FAQ Database Discussion Community


Scrapy CrawlSpider not following links

python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
I have the following code below which crawls all the available pages from a website. This is perfectly `crawling` the valid pages because when I use print function - I can see the data from the `'items'` list, but I don't see any output when I try to use `.csv`...

Scrapy collect data from first element and post's title

python,web-scraping,web-crawler,scrapy,scrapy-spider
I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

Pass argument to scrapy spider within a python script

python,python-2.7,web-scraping,scrapy,scrapy-spider
I can run crawl in a python script with the following recipe from wiki : from twisted.internet import reactor from scrapy.crawler import Crawler from scrapy import log, signals from testspiders.spiders.followall import FollowAllSpider from scrapy.utils.project import get_project_settings spider = FollowAllSpider(domain='scrapinghub.com') settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(reactor.stop, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start()...

Scrapy - generating items outside of parse callback

python,scrapy,scrapy-spider
This might be a bit of an odd one. I have a Scrapy project with a few spiders that inherit from CrawlSpider. Aside for their normal execution (going through the intended website), I also want to be able to push items outside of the scope of the original callback. I...

Scrapy prints fields but doesn't populate XML file

python,xml,xpath,scrapy,scrapy-spider
I have a problem where it prints the XML files correctly but it doesn't populate the XML file with any content. The output in terminal is this: [u'Tove'] [u'Jani'] [u'Reminder'] [u"Don't forget me this weekend!"] However the output site_products.xml results in this (which is wrong, no data): <?xml version="1.0" encoding="utf-8"?>...

How to extract all the source code under and export as html?python,html,scrapy,scrapy-spider
I am beginner of Scrapy. My goal is to extract selected tables from a big HTML page and then export the selected tables together in HTML format. So essentially, what I want is to get a shorter version of the original web page keeping only the <table> sections. The structure...

Scrapy: how can I get the content of pages whose response.status=302?

web-scraping,scrapy,scrape,scrapy-spider
I get the following log when crawling: DEBUG: Crawled (302) <GET http://fuyuanxincun.fang.com/xiangqing/> (referer: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/) DEBUG: Scraped from <302 http://fuyuanxincun.fang.com/xiangqing/> But it actually returns nothing. How can I deal with these response with status=302? Any help would be much appreciated !...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider
I am trying to write a crawler with Scrapy to crawl a classified-type (target) site and fetch information from the links on the target site. The tutorial on Scrapy only helps me get the links from the target URL but not the second layer of data gathering that I seek....

Using ItemLoader but adding XPath, values etc. in Scrapy

python,xpath,web-scraping,scrapy,scrapy-spider
Currently I'm using the XPathItemLoader to scrape data: def parse_product(self, response): items = [] l = XPathItemLoader(item=MyItem(), response=response) l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars) l.default_output_processor = Join() l.add_xpath('name', 'div[2]/header/h1/text()') items.append(l.load_item()) return items and needed the v.split() to get rid of some spaces - that's working fine. But how can I...

Regular expression for Scrapy rules

python,regex,scrapy-spider
I want to crawl data from pages with format: http://www.vesselfinder.com/vessels?page=i where i is from 0 to some integer. Is the following regex correct for this pattern: start_urls = [ "http://www.vesselfinder.com/vessels" ] rules = ( Rule(LinkExtractor(allow=r"com/vessels\?page=[1-100]"), callback='parse_item', follow=True), ) ...

Error using scrapy

python,web-scraping,scrapy,scrapy-spider
I have this code in python: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from site_auto_1.items import AutoItem class AutoSpider(CrawlSpider): name = "auto" allowed_host = ["autowereld.nl"] url = "http://www.autowereld.nl/" start_urls = [...

Scrapy returning zero results

python,scrapy,scrapy-spider
I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I...

Scrapy extracting from Link

python,scrapy,scrapy-spider
I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why. Here is my code: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector...

SgmlLinkExtractor not displaying results or following link

python,web-crawler,scrapy,scrapy-spider,sgml
I am having problems fully understanding how SGML Link Extractor works. When making a crawler with Scrapy, I can successfully extract data from links using specific URLS. The problem is using Rules to follow a next page link in a particular URL. I think the problem lies in the allow()...

Scrapy - scraped website authentication token expires while scraping

python,authentication,scrapy,scrapy-spider
To scrape a particular website 180 days into the future, an authentication token must be obtained in order to get the json data to scrape. While scraping, the token expires and the HTTP response returns a status code of 401 "Unauthorized". How do I get a new token into the...

Scrapy: catch responses with specific HTTP server codes

python,web-scraping,scrapy,scrapy-spider
We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

Why scrapy not storing data into mongodb?

python,mongodb,web-scraping,scrapy,scrapy-spider
My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

Scrapy command in shell script not executing when called from java

java,bash,shell,scrapy,scrapy-spider
I have the below shell script which invokes scrapy #!/bin/bash export PATH=usr/local/bin/scrapy:$PATH scrapy crawl flipkart -a key="$1" -o "$2"flipkart.xml scrapy crawl myntra -a key="$1" -o "$2"myntra.xml scrapy crawl jabong -a key="$1" -o "$2"jabong.xml echo $PATH In the java program which calls this script file the error stream says that scrapy:...

Scraping iTunes Charts using Scrapy

python,web-scraping,scrapy,scrapy-spider
I am doing the following tutorial on using Scrapy to scrape iTunes charts. http://davidwalsh.name/python-scrape The tutorial is slightly outdated, in that some of the syntaxes used have been deprecated in the current version of Scrapy (e.g. HtmlXPathSelector, BaseSpider..) - I have been working on completing the tutorial with the current...

AttributeError: 'module' object has no attribute 'Spider'

python,scrapy,scrapy-spider
I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)...

Scrapy keep all unique pages based on a list of start urls

python,web-scraping,scrapy,scrapy-spider
I want to give Scrapy a list of start urls, and have it visit each link on each of those start pages. For every link, if it hasn't been to that page before, I want to download the page and keep it locally. How can I achieve this?

Scrapy: Spider optimization

python,web-scraping,scrapy,scrapy-spider
I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps. This website has a structure like this: The homepage has the links to the family-items and subfamily-items pages Each family & subfamily page has a list of products paginated Right now I have 2 spiders:...

Passing list as arguments in Scrapy

python,flask,scrapy,scrapy-spider
I am trying to build an application using Flask and Scrapy. I have to pass the list of urls to spider. I tried using the following syntax: __init__: in Spider self.start_urls = ["http://www.google.com/patents/" + x for x in u] Flask Method u = ["US6249832", "US20120095946"] os.system("rm static/s.json; scrapy crawl patents...

How to exclude a particular html tag(without any id) from several tags while using scrapy?

python,html,web-scraping,scrapy,scrapy-spider
<div class="region size2of3"> <h2>Mumbai</h2> <strong>Fort</strong> <div>Elphinstone building, Horniman Circle,</div> <div>Veer Nariman Road, Fort</div> <div>Mumbai 400001</div> <div>Timings: 08:00-00:30 hrs (Mon-Sun)</div> <div><br></div> </div> I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. Here's my code: import scrapy from job.items import StarbucksItem class StarbucksSpider(scrapy.Spider): name =...

Distinguishing between HTML and non-HTML pages in Scrapy

python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider
I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Rename output file after scrapy spider complete

python,scrapy,scrapy-spider,scrapyd
I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes. My problem is I can't be sure when the output file...

Scrapy crawl and follow links within href

python,web-scraping,scrapy,scrapy-spider
I am very much new to scrapy. I need to follow href from the homepage of a url to multiple depths. Again inside the href links i've multiple href's. I need to follow these href until i reach my desired page to scrape. The sample html of my page is:...

Scrapy not giving individual results of all the reviews of a phone?

python,xpath,web-scraping,scrapy,scrapy-spider
This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

Scrapy python error - Missing scheme in request URL

python,web-crawler,scrapy,scrapy-spider
I'm trying to pull a file from a password protected FTP server. This is the code I'm using: import scrapy from scrapy.contrib.spiders import XMLFeedSpider from scrapy.http import Request from crawler.items import CrawlerItem class SiteSpider(XMLFeedSpider): name = 'site' allowed_domains = ['ftp.site.co.uk'] itertag = 'item' def start_requests(self): yield Request('ftp.site.co.uk/feed.xml', meta={'ftp_user': 'test', 'ftp_password':...

scrapy append to linkextractor links

python,web-scraping,scrapy,scrapy-spider
I am using CrawlSpider with LinkExtractor to crawl the links. How would I go about appending parameters to the links LinkExtractor finds? I would like to add '?pag_sortorder=0&pag_perPage=999' to each link that LinkExtractorextracts....

Scrap a huge site with scrapy never completed

scrapy,scrapy-spider
I'm scraping a site which has millions of pages and about hundred of thousands of items. I'm using the CrawlSpider with LxmlLinkExtractors to define the exact path between different type of pages. Everything works fine and my scraper doesn't follow unwanted links. However, the whole site never seems to be...

Extracting links with scrapy that have a specific css class

python,web-scraping,scrapy,screen-scraping,scrapy-spider
Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

Scrapy creating XML feed wraps content in “value” tags

python,xml,scrapy,scrapy-spider
I've had a bit of help on here by my code pretty much works. The only issue is that in the process of generating an XML, it wraps the content in "value" tags when I don't want it to. According to the doc's this is due to this: Unless overriden...

How to read xml directly from URLs with scrapy/python

python,xml,web-scraping,scrapy,scrapy-spider
In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

delete spiders from scrapinghub

delete,web-crawler,scrapy,scrapy-spider,scrapinghub
I am a new user of scrapinghub. I already searched on googled and had read the scrapinghub docs but I could not find any information about removing spiders from a project. Is it possible, how? I do not want to replace a spider, I want to delete/remove it from scrapinghub...

Multiple inheritance in scrapy spiders

python,regex,scrapy,multiple-inheritance,scrapy-spider
Is it possible to create a spider which inherits the functionality from two base spiders, namely SitemapSpider and CrawlSpider? I have been trying to scrape data from various sites and realized that not all sites have listing of every page on the website, thus a need to use CrawlSpider. But...

Scrapy - Issue with xpath on an xml crawl

python,xml,xpath,scrapy,scrapy-spider
I'm trying to make a simple spider to grab some xml and spit it out in a new format for an experiment. However it seems there is extra code contained within the xml which is spat out. The format I want is like this (no extra code or value tag)...

Scrapy: If key exists, why do I get a KeyError?

python,list,key,scrapy,scrapy-spider
With items.py defined: import scrapy class CraigslistSampleItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() and populating each item via the spider thus: item = CraigslistSampleItem() item["title"] = $someXpath.extract() item["link"] = $someOtherXpath.extract() When I append these to a list (returned by parse()) and store this as e.g. a csv, I get two...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

python,scrapy,scrapy-spider
I'd like to acquire data, using Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal...

Remove first tag html using python & scrapy

python,xpath,scrapy,scrapy-spider
I have a HTML: <div class="abc"> <div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div> </div> I used: response.xpath('//div[contains(@class,"abc")]/div[contains(@class,"xyz")]').extract() Result: u'['<div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div>'] I want remove...

While scraping getting error instance method has no attribute '__getitem__'

python,web-scraping,scrapy,web-crawler,scrapy-spider
I couldn't understand I am getting this error-> instance method has no attribute getitem. I am just trying to scrape this site to extract the department names. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse import urljoin from amazon.items import AmazonItem...

Scrapy - Scrape multiple URLs using results from the first URL

python,scrapy,scrapy-spider
I use Scrapy to scrape data from the first URL. The first URL returns a response contains a list of URLs. So far is ok for me. My question is how can I further scrape this list of URLs? After searching, I know I can return a request in...

Why isn't XMLFeedSpider failing to iterate through the designated nodes?

python,xml,rss,scrapy,scrapy-spider
I'm trying to parse through PLoS's RSS feed to pick up new publications. The RSS feed is located here. Below is my spider: from scrapy.contrib.spiders import XMLFeedSpider class PLoSSpider(XMLFeedSpider): name = "plos" itertag = 'entry' allowed_domains = ["plosone.org"] start_urls = [ ('http://www.plosone.org/article/feed/search' '?unformattedQuery=*%3A*&sort=Date%2C+newest+first') ] def parse_node(self, response, node): pass This...

Crawl spider not crawling ~ Rule Issue

python,web-scraping,scrapy,scrapy-spider
I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

Simple scrapy XML spider syntax error [closed]

python,xml,scrapy,scrapy-spider
I was just trying to make a simple spider using scrapy to grab data from an XML file. This is what I came up with: from scrapy.contrib.spiders import XMLFeedSpider class MySpider(XMLFeedSpider): name = 'testproject' allowed_domains = ['www.w3schools.com'] start_urls = ['http://www.w3schools.com/xml/note.xml'] itertag = 'note' def parse_node(self, response, node): to = node.select('to/text()').extract()...

How to get scrapy results orderly?

python,web-scraping,scrapy,scrapy-spider
Help me with scrapy. My code resulting output however it doesn't print the corrected way. I also tried with inside another for loop but that will not give correct result, Anyway if you found something missing in there.. please tel me Code: import scrapy class YelpScrapy(scrapy.Spider): name = 'yelp' start_urls...

Is there a way using scrapy to export each item that is scrapped into a separate json file?

web-scraping,scrapy,scrapy-spider
currently I am using "yield item" after every item i scrape, though it gives me all the items in one single Json file.

scrapy itemloaders return list of items

scrapy,scrapy-spider
def parse: for link in LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]: yield Request(link.url) l = MytemsLoader() l.add_value('main1', some xpath) l.add_value('main2', some xpath) l.add_value('main3', some xpath) rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]") for row in rows: l.add_value('table1', some xpath based on rows) l.add_value('table2', some xpath based on rows) l.add_value('main3', some xpath based on rows) yield l.loaditem() I am...

Web scraping error: exceptions.MemoryError

python,web-scraping,scrapy,scrapy-spider
I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

scrapy crawling at depth not working

python,scrapy,scrapy-spider
I am writing scrapy code to crawl first page and one additional depth of given webpage Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation. I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what...

Cannot download image with relative URL Python Scrapy

python,web-crawler,scrapy,scrapy-spider
I'm using Scrapy to download images from http://www.vesselfinder.com/vessels However, I can only get the relative url of images like this http://www.vesselfinder.com/vessels/ship-photo/0-227349190-7c01e2b3a7a5078ea94fff9a0f862f8a/0 All of the image named 0.jpg but if I try to use that absolute url, I cannot get access to the image. My code: items.py import scrapy class VesselItem(scrapy.Item):...

Why is xpath selecting only the last
  • inside the
  • python,web-scraping,scrapy,scrapy-spider
    I'm trying to scrape this site : http://www.kaymu.com.ng/. The part of the HTML I'm scraping is like this: <ul id="navigation-menu"> <li> some content </li> <li> some content </li> ... <li> some content </li> </ul> This is my spider : class KaymuSpider(Spider): name = "kaymu" allowed_domains = ["kaymu.com.ng"] start_urls = [...