FAQ Database Discussion Community


AttributeError: 'module' object has no attribute 'Spider'

python,scrapy,scrapy-spider
I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)...

Scrapy python error - Missing scheme in request URL

python,web-crawler,scrapy,scrapy-spider
I'm trying to pull a file from a password protected FTP server. This is the code I'm using: import scrapy from scrapy.contrib.spiders import XMLFeedSpider from scrapy.http import Request from crawler.items import CrawlerItem class SiteSpider(XMLFeedSpider): name = 'site' allowed_domains = ['ftp.site.co.uk'] itertag = 'item' def start_requests(self): yield Request('ftp.site.co.uk/feed.xml', meta={'ftp_user': 'test', 'ftp_password':...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
I have the following code below which crawls all the available pages from a website. This is perfectly `crawling` the valid pages because when I use print function - I can see the data from the `'items'` list, but I don't see any output when I try to use `.csv`...

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

python,xml,rss,scrapy,scrapy-spider
I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This...

Rename output file after scrapy spider complete

python,scrapy,scrapy-spider,scrapyd
I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes. My problem is I can't be sure when the output file...

How to get scrapy results orderly?

python,web-scraping,scrapy,scrapy-spider
Help me with scrapy. My code resulting output however it doesn't print the corrected way. I also tried with inside another for loop but that will not give correct result, Anyway if you found something missing in there.. please tel me Code: import scrapy class YelpScrapy(scrapy.Spider): name = 'yelp' start_urls...

Is there a way using scrapy to export each item that is scrapped into a separate json file?

web-scraping,scrapy,scrapy-spider
currently I am using "yield item" after every item i scrape, though it gives me all the items in one single Json file.

Regular expression for Scrapy rules

python,regex,scrapy-spider
I want to crawl data from pages with format: http://www.vesselfinder.com/vessels?page=i where i is from 0 to some integer. Is the following regex correct for this pattern: start_urls = [ "http://www.vesselfinder.com/vessels" ] rules = ( Rule(LinkExtractor(allow=r"com/vessels\?page=[1-100]"), callback='parse_item', follow=True), ) ...

Remove first tag html using python & scrapy

python,xpath,scrapy,scrapy-spider
I have a HTML: <div class="abc"> <div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div> </div> I used: response.xpath('//div[contains(@class,"abc")]/div[contains(@class,"xyz")]').extract() Result: u'['<div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div>'] I want remove...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

python,scrapy,scrapy-spider
I'd like to acquire data, using Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal...

Scrapy - scraped website authentication token expires while scraping

python,authentication,scrapy,scrapy-spider
To scrape a particular website 180 days into the future, an authentication token must be obtained in order to get the json data to scrape. While scraping, the token expires and the HTTP response returns a status code of 401 "Unauthorized". How do I get a new token into the...

Extracting links with scrapy that have a specific css class

python,web-scraping,scrapy,screen-scraping,scrapy-spider
Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

Using ItemLoader but adding XPath, values etc. in Scrapy

python,xpath,web-scraping,scrapy,scrapy-spider
Currently I'm using the XPathItemLoader to scrape data: def parse_product(self, response): items = [] l = XPathItemLoader(item=MyItem(), response=response) l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars) l.default_output_processor = Join() l.add_xpath('name', 'div[2]/header/h1/text()') items.append(l.load_item()) return items and needed the v.split() to get rid of some spaces - that's working fine. But how can I...

Simple scrapy XML spider syntax error [closed]

python,xml,scrapy,scrapy-spider
I was just trying to make a simple spider using scrapy to grab data from an XML file. This is what I came up with: from scrapy.contrib.spiders import XMLFeedSpider class MySpider(XMLFeedSpider): name = 'testproject' allowed_domains = ['www.w3schools.com'] start_urls = ['http://www.w3schools.com/xml/note.xml'] itertag = 'note' def parse_node(self, response, node): to = node.select('to/text()').extract()...

Scrapy creating XML feed wraps content in “value” tags

python,xml,scrapy,scrapy-spider
I've had a bit of help on here by my code pretty much works. The only issue is that in the process of generating an XML, it wraps the content in "value" tags when I don't want it to. According to the doc's this is due to this: Unless overriden...

Scrapy: If key exists, why do I get a KeyError?

python,list,key,scrapy,scrapy-spider
With items.py defined: import scrapy class CraigslistSampleItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() and populating each item via the spider thus: item = CraigslistSampleItem() item["title"] = $someXpath.extract() item["link"] = $someOtherXpath.extract() When I append these to a list (returned by parse()) and store this as e.g. a csv, I get two...

scrapy crawling at depth not working

python,scrapy,scrapy-spider
I am writing scrapy code to crawl first page and one additional depth of given webpage Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation. I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what...

Scrapy returning zero results

python,scrapy,scrapy-spider
I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I...

Error using scrapy

python,web-scraping,scrapy,scrapy-spider
I have this code in python: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from site_auto_1.items import AutoItem class AutoSpider(CrawlSpider): name = "auto" allowed_host = ["autowereld.nl"] url = "http://www.autowereld.nl/" start_urls = [...

Scrapy - Issue with xpath on an xml crawl

python,xml,xpath,scrapy,scrapy-spider
I'm trying to make a simple spider to grab some xml and spit it out in a new format for an experiment. However it seems there is extra code contained within the xml which is spat out. The format I want is like this (no extra code or value tag)...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider
I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Scrapy collect data from first element and post's title

python,web-scraping,web-crawler,scrapy,scrapy-spider
I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

Multiple inheritance in scrapy spiders

python,regex,scrapy,multiple-inheritance,scrapy-spider
Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider? I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But...

scrapy itemloaders return list of items

scrapy,scrapy-spider
def parse: for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]: yield Request(link.url) l = MytemsLoader() l.add_value('main1', some xpath) l.add_value('main2', some xpath) l.add_value('main3', some xpath) rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]") for row in rows: l.add_value('table1', some xpath based on rows) l.add_value('table2', some xpath based on rows) l.add_value('main3', some xpath based on rows) yield l.loaditem() I am...

SgmlLinkExtractor not displaying results or following link

python,web-crawler,scrapy,scrapy-spider,sgml
I am having problems fully understanding how SGML Link Extractor works. When making a crawler with Scrapy, I can successfully extract data from links using specific URLS. The problem is using Rules to follow a next page link in a particular URL. I think the problem lies in the allow()...

Scrapy prints fields but doesn't populate XML file

python,xml,xpath,scrapy,scrapy-spider
I have a problem where it prints the XML files correctly but it doesn't populate the XML file with any content. The output in terminal is this: [u'Tove'] [u'Jani'] [u'Reminder'] [u"Don't forget me this weekend!"] However the output site_products.xml results in this (which is wrong, no data): <?xml version="1.0" encoding="utf-8"?>...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider
I am trying to write a crawler with Scrapy to crawl a classified-type (target) site and fetch information from the links on the target site. The tutorial on Scrapy only helps me get the links from the target URL but not the second layer of data gathering that I seek....

While scraping getting error instance method has no attribute '__getitem__'

python,web-scraping,scrapy,web-crawler,scrapy-spider
I couldn't understand I am getting this error-> instance method has no attribute getitem. I am just trying to scrape this site to extract the department names. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin from amazon.items import AmazonItem...

Scrapy: catch responses with specific HTTP server codes

python,web-scraping,scrapy,scrapy-spider
We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

Why is xpath selecting only the last
  • inside the
  • python,web-scraping,scrapy,scrapy-spider
    I'm trying to scrape this site : http://www.kaymu.com.ng/. The part of the HTML I'm scraping is like this: <ul id="navigation-menu"> <li> some content </li> <li> some content </li> ... <li> some content </li> </ul> This is my spider : class KaymuSpider(Spider): name = "kaymu" allowed_domains = ["kaymu.com.ng"] start_urls = [...

    Scrapy keep all unique pages based on a list of start urls

    python,web-scraping,scrapy,scrapy-spider
    I want to give Scrapy a list of start urls, and have it visit each link on each of those start pages. For every link, if it hasn't been to that page before, I want to download the page and keep it locally. How can I achieve this?

    Scrapy command in shell script not executing when called from java

    java,bash,shell,scrapy,scrapy-spider
    I have the below shell script which invokes scrapy #!/bin/bash export PATH=usr/local/bin/scrapy:$PATH scrapy crawl flipkart -a key="$1" -o "$2"flipkart.xml scrapy crawl myntra -a key="$1" -o "$2"myntra.xml scrapy crawl jabong -a key="$1" -o "$2"jabong.xml echo $PATH In the java program which calls this script file the error stream says that scrapy:...

    Why scrapy not storing data into mongodb?

    python,mongodb,web-scraping,scrapy,scrapy-spider
    My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

    Scrapy crawl and follow links within href

    python,web-scraping,scrapy,scrapy-spider
    I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:...

    Scrapy - Scrape multiple URLs using results from the first URL

    python,scrapy,scrapy-spider
    I use Scrapy to scrape data from the first URL. The first URL returns a response contains a list of URLs. So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in...

    Passing list as arguments in Scrapy

    python,flask,scrapy,scrapy-spider
    I am trying to build an application using Flask and Scrapy. I have to pass the list of urls to spider. I tried using the following syntax: __init__: in Spider self.start_urls = ["http://www.google.com/patents/" + x for x in u] Flask Method u = ["US6249832", "US20120095946"] os.system("rm static/s.json; scrapy crawl patents...

    Scraping iTunes Charts using Scrapy

    python,web-scraping,scrapy,scrapy-spider
    I am doing the following tutorial on using Scrapy to scrape iTunes charts. http://davidwalsh.name/python-scrape The tutorial is slightly outdated, in that some of the syntaxes used have been deprecated in the current version of Scrapy (e.g. HtmlXPathSelector, BaseSpider..) - I have been working on completing the tutorial with the current...

    How to read xml directly from URLs with scrapy/python

    python,xml,web-scraping,scrapy,scrapy-spider
    In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

    Scrapy not giving individual results of all the reviews of a phone?

    python,xpath,web-scraping,scrapy,scrapy-spider
    This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

    How to extract all the source code under and export as html?python,html,scrapy,scrapy-spider
    I am beginner of Scrapy. My goal is to extract selected tables from a big HTML page and then export the selected tables together in HTML format. So essentially, what I want is to get a shorter version of the original web page keeping only the <table> sections. The structure...

    Scrapy: how can I get the content of pages whose response.status=302?

    web-scraping,scrapy,scrape,scrapy-spider
    I get the following log when crawling: DEBUG: Crawled (302) <GET http://fuyuanxincun.fang.com/xiangqing/> (referer: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/) DEBUG: Scraped from <302 http://fuyuanxincun.fang.com/xiangqing/> But it actually returns nothing. How can I deal with these response with status=302? Any help would be much appreciated !...

    Scrapy: Spider optimization

    python,web-scraping,scrapy,scrapy-spider
    I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps. This website has a structure like this: The homepage has the links to the family-items and subfamily-items pages Each family & subfamily page has a list of products paginated Right now I have 2 spiders:...

    scrapy append to linkextractor links

    python,web-scraping,scrapy,scrapy-spider
    I am using CrawlSpider with LinkExtractor to crawl the links. How would I go about appending parameters to the links LinkExtractor finds? I would like to add '?pag_sortorder=0&pag_perPage=999' to each link that LinkExtractorextracts....

    Scrapy CrawlSpider not following links

    python,web-scraping,web-crawler,scrapy,scrapy-spider
    I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

    Crawl spider not crawling ~ Rule Issue

    python,web-scraping,scrapy,scrapy-spider
    I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

    Cannot download image with relative URL Python Scrapy

    python,web-crawler,scrapy,scrapy-spider
    I'm using Scrapy to download images from http://www.vesselfinder.com/vessels However, I can only get the relative url of images like this http://www.vesselfinder.com/vessels/ship-photo/0-227349190-7c01e2b3a7a5078ea94fff9a0f862f8a/0 All of the image named 0.jpg but if I try to use that absolute url, I cannot get access to the image. My code: items.py import scrapy class VesselItem(scrapy.Item):...

    Scrapy extracting from Link

    python,scrapy,scrapy-spider
    I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why. Here is my code: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector...

    delete spiders from scrapinghub

    delete,web-crawler,scrapy,scrapy-spider,scrapinghub
    I am a new user of scrapinghub. I already searched on googled and had read the scrapinghub docs but I could not find any information about removing spiders from a project. Is it possible, how? I do not want to replace a spider, I want to delete/remove it from scrapinghub...

    Pass argument to scrapy spider within a python script

    python,python-2.7,web-scraping,scrapy,scrapy-spider
    I can run crawl in a python script with the following recipe from wiki : from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from testspiders.spiders.followall import FollowAllSpider from scrapy.utils.project import get_project_settings spider = FollowAllSpider(domain='scrapinghub.com') settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start()...

    Web scraping error: exceptions.MemoryError

    python,web-scraping,scrapy,scrapy-spider
    I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

    Scrap a huge site with scrapy never completed

    scrapy,scrapy-spider
    I'm scraping a site which has millions of pages and about hundred of thousands of items. I'm using the CrawlSpider with LxmlLinkExtractors to define the exact path between different type of pages. Everything works fine and my scraper doesn't follow unwanted links. However, the whole site never seems to be...

    Distinguishing between HTML and non-HTML pages in Scrapy

    python,html,web-crawler,scrapy,scrapy-spider
    I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

    How to exclude a particular html tag(without any id) from several tags while using scrapy?

    python,html,web-scraping,scrapy,scrapy-spider
    <div class="region size2of3"> <h2>Mumbai</h2> <strong>Fort</strong> <div>Elphinstone building, Horniman Circle,</div> <div>Veer Nariman Road, Fort</div> <div>Mumbai 400001</div> <div>Timings: 08:00-00:30 hrs (Mon-Sun)</div> <div><br></div> </div> I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. Here's my code: import scrapy from job.items import StarbucksItem class StarbucksSpider(scrapy.Spider): name =...

    Scrapy - generating items outside of parse callback

    python,scrapy,scrapy-spider
    This might be a bit of an odd one. I have a Scrapy project with a few spiders that inherit from CrawlSpider. Aside for their normal execution (going through the intended website), I also want to be able to push items outside of the scope of the original callback. I...