FAQ Database Discussion Community


How to get started with Scrapy Web Service? [closed]

web-services,scrapy
I have been using Scrapy for a long time now and I must say I am in love with it. Recently, I got to know about Scrapy Web Service. But I am unable to figure out how it works. Or how can I use it to monitor my current spiders....

Scrapy xpath construction for tables of data - yielding empty brackets

html,xpath,scrapy
I am attempting to build out xpath constructs for data items I would like to extract from several hundred pages of a site that are all formatted the same. An example site is https://weedmaps.com/dispensaries/cannabicare As can be seen the site has headings and within those headings are rows of item...

Scrapy using python3 directory (Ubuntu)

python,scrapy
I installed scrapy with sudo pip install scrapy. Installed without errors, but it won't run. scrapy --version returns this: Traceback (most recent call last): File "/usr/local/bin/scrapy", line 9, in <module> load_entry_point('Scrapy==0.24.6', 'console_scripts', 'scrapy')() File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 521, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 2632, in load_entry_point return ep.load()...

How to update the new page source everytime in scrapy xpath while using selenium?

python,selenium,xpath,web-scraping,scrapy
This selenium merged with scrapy is working fine with only one problem- I need to update the sites = response.xpath() every time with the new source code the page generates otherwise it is returning me repetitive results again and again. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import...

Get JavaScript function call value using Selenium

python,selenium,selenium-webdriver,web-scraping,scrapy
I am scraping web pages using python-scrapy which works pretty well for static content. I am trying to scrape a url from this page but as it turns out, it is returned through a javascript call. For this I am using selenium but unable to figure out how to do...

Remove first tag html using python & scrapy

python,xpath,scrapy,scrapy-spider
I have a HTML: <div class="abc"> <div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div> </div> I used: response.xpath('//div[contains(@class,"abc")]/div[contains(@class,"xyz")]').extract() Result: u'['<div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div>'] I want remove...

How to read xml directly from URLs with scrapy/python

python,xml,web-scraping,scrapy,scrapy-spider
In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

Scrapy middleware setup

python,web-scraping,web-crawler,scrapy
I am trying to access public proxy using scrapy to get some data. I get the following error when i try to run the code: ImportError: Error loading object 'craiglist.middlewares.ProxyMiddleware': No module named middlewares I've created middlewares.py file with following code: import base64 # Start your middleware class class ProxyMiddleware(object):...

Scrapy follow link and collect email

python,web-scraping,web-crawler,scrapy
i need help with saving email with Scrapy. The row in .csv file where emails are supposed to be collected is blank. Any help is very appreciated. Here is the code: # -*- coding: utf-8 -*- import scrapy # item class included here class DmozItem(scrapy.Item): # define the fields for...

Memory Leak in Scrapy

python,web-scraping,scrapy
i wrote the following code to scrape for email addresses (for testing purposes): import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from scrapy.selector import Selector from crawler.items import EmailItem class LinkExtractorSpider(CrawlSpider): name = 'emailextractor' start_urls = ['http://news.google.com'] rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),) def process_item(self, response):...

Why scrapy not storing data into mongodb?

python,mongodb,web-scraping,scrapy,scrapy-spider
My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

Scrapy LinkExtractor not working

scrapy
Having trouble with my Scrapy crawler following links. Below is my code. I want it to essentially go to YouTube pages, pull the twitter links, and then call parse_page3 and pull in information but right now, only the parse_page2 extraction part is working. Thanks! Eric import scrapy from scrapy.contrib.spiders import...

How to crawl a site and parse only pages that match a RegEx using Scrapy 0.24

python,regex,scrapy
I'm using Scrapy 0.24 on Python 2.7.9 on a Windows 64-bit machine. I'm trying to tell scrapy to start at a specific URL http://www.allen-heath.com/products/ and from there only gather data from pages where the url includes the string ahproducts. Unfortunately, when I do this no data is scraped at all....

Scrapy running from python script processes only start url

python,python-2.7,scrapy
I have written a Scrapy CrawlSpider. class SiteCrawlerSpider(CrawlSpider): name = 'site_crawler' def __init__(self, start_url, **kw): super(SiteCrawlerSpider, self).__init__(**kw) self.rules = ( Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True), ) self.start_urls = [start_url] self.allowed_domains = tldextract.extract(start_url).registered_domain def parse_start_url(self, response): external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response) for link in external_links: i =...

Is there a way using scrapy to export each item that is scrapped into a separate json file?

web-scraping,scrapy,scrapy-spider
currently I am using "yield item" after every item i scrape, though it gives me all the items in one single Json file.

Scrapy follow pagination AJAX Request - POST

ajax,post,pagination,xmlhttprequest,scrapy
I am quite new to scrapy and have build a few spiders. I am trying to scrape reviews from this page. My spider so far crawls the first page and scrape those items, but when it comes to pagination it does not follow links. I know this happens because it...

Scrapy: how can I get the content of pages whose response.status=302?

web-scraping,scrapy,scrape,scrapy-spider
I get the following log when crawling: DEBUG: Crawled (302) <GET http://fuyuanxincun.fang.com/xiangqing/> (referer: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/) DEBUG: Scraped from <302 http://fuyuanxincun.fang.com/xiangqing/> But it actually returns nothing. How can I deal with these response with status=302? Any help would be much appreciated !...

Stuck scraping a specific table with scrapy

python,xpath,scrapy
So the table I'm trying to scrape can be found here: http://www.betdistrict.com/tipsters I'm after the table titled 'June Stats'. Here's my spider: from __future__ import division from decimal import * import scrapy import urlparse from ttscrape.items import TtscrapeItem class BetdistrictSpider(scrapy.Spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def...

Scrapy not entering parse method

python,selenium,web-scraping,web-crawler,scrapy
I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

python install lxml on mac os 10.10.1

python,osx,python-2.7,scrapy,lxml
I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started...

Scrapy: retain original order of scraped items in the output

python,scrapy
I have the following Scrapy spider to get the status of the pages from the list of urls in the file url.txt import scrapy from scrapy.contrib.spiders import CrawlSpider from pegasLinks.items import StatusLinkItem class FindErrorsSpider(CrawlSpider): handle_httpstatus_list = [404,400,401,500] name = "findErrors" allowed_domains = ["domain-name.com"] f = open("urls.txt") start_urls = [url.strip() for...

new to scrapy and table data extraction

python,web-scraping,scrapy
First day on scrappy and I want to get the table data in this http://www.cottagehealthsystem.org/tabid/149/Default.aspx so I want Administration 569-7290 Anesthesia 569-7206 Birth Center 569-7232 Cancer Data Center 569-8280 Cardiac Care Unit 569-7222 Cardiac Electrophysiology 569-8234 Cardiac Rehabilitation 569-7201 Cardiology 569-8284 etc etc I did this scrapy shell "http://www.cottagehealthsystem.org/tabid/149/Default.aspx" response.selector.xpath('//table//td//text()').extract()...

Extracting links with scrapy that have a specific css class

python,web-scraping,scrapy,screen-scraping,scrapy-spider
Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

Scrapy Memory Error (too many requests) Python 2.7

python,django,python-2.7,memory,scrapy
I've been running a crawler in Scrapy to crawl a large site I'd rather not mention. I use the tutorial spider as a template, then I created a series of starting requests and let it crawl from there, using something like this: def start_requests(self): f = open('zipcodes.csv', 'r') lines =...

Scrapy returning zero results

python,scrapy,scrapy-spider
I am attempting to learn how to use scrapy, and am trying to do what I think is a simple project. I am attempting to pull 2 pieces of data from a single webpage - crawling additional links isn't needed. However, my code seems to be returning zero results. I...

How can I initialize a Field() to contain a nested python dict?

python,web-scraping,scrapy
I have a Field() in my items.py called: scores = Field() I want multiple scrapers to append a value to a nested dict inside scores. For example, one of my scrapers: item['scores']['baseball_score'] = '92' And another scraper would: item['scores']['basket_score'] = '21' So that when I retrieve scores: > item['scores'] {...

scraping url and title from nested anchor tag

python,web-scraping,scrapy
This is my first scraper using scrapy. I am trying to scrap video url, title from https://www.google.co.in/trends/hotvideos#hvsm=0 site. import scrapy from scrapy.item import Item, Field from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class CraigslistItem(Item): title = Field() link = Field() class DmozSpider(scrapy.Spider): name = "google" allowed_domains = ["google.co.in"] start_urls...

Scrapy extracting from Link

python,scrapy,scrapy-spider
I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why. Here is my code: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector...

How to reset standard dupefilter in scrapy

scrapy
For some reasons I would like to reset the list of seen urls that scrapy maintains internally at some point of my spider code. I know that by default scrapy uses the RFPDupeFilter class and that there is a fingerprint set. How can this set be cleared within spider code?...

Scrapy: Wait for a specific url to be parsed before parsing others

python,scrapy
Brief Explanation: I have a Scrapy project that takes stock data from Yahoo! Finance. In order for my project to work, I need to ensure that a stock has been around for a desired amount of time. I do this by scraping CAT (Caterpillar Inc. (CAT) -NYSE) first, get the...

Is there a way to prevent parsing responses that are redirected from another page?

python,scrapy
I'm using a CrawlSpider to scrape a site. Some requests match a rule with a callback but are redirected to another page, and Scrapy is parsing these responses anyway. Is there a way to prevent this from happening?

Rename output file after scrapy spider complete

python,scrapy,scrapy-spider,scrapyd
I am using Scrapy and Scrapyd to monitor certain sites. The output files are compressed jsonlines. Right after I submit a job schedule to scrapyd, I can see the output file being created and is growing as it scrapes. My problem is I can't be sure when the output file...

xpath acting strangely in scrapy

html,xpath,scrapy
Assume I have this code: <div class="page-header" align="center"> <h4>[<a href='[email protected]%200DAY' data-placement='top' rel='tooltip' data-original-title='Browse 0DAY'><strong>FIRST</strong></a>] SECOND-</a><a href=/[email protected]%20GUSH rel='tooltip' data-original-title='Find more from GUSH'><b>THIRD</b></a> <h6>FOUR<br> <br/></h6> Search: <a href="https://xxx1">xxx</a>, </h4> <br/> </div> I want to...

what would be the right way of doing getallAttributes()

python,xpath,web-scraping,scrapy
I am trying to read the property (attributes) of given element. I want to extract a Dictionary of all the attributes name-value pairs. What I am currently doing is using regex and listing all the property values. But the problem here is, it only displays the value of the property...

scrapy crawling at depth not working

python,scrapy,scrapy-spider
I am writing scrapy code to crawl first page and one additional depth of given webpage Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation. I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what...

Python/Scrapy: Scraping Nasdaq's data? [closed]

javascript,python,selenium,scrapy,scrape
I am comfortable scraping most sites with Scrapy, however I have never tried getting dynamic content from javascript and I am running into a lot of arguments in regard to how to start learning. I am attempting to scrape revenue data from the table at: http://www.nasdaq.com/symbol/scmp/revenue-eps I have heard a...

xpath: how to select items between item A and item B

xpath,scrapy
I have an HTML page with this structure: <big><b>Staff in:</b></big> <br> <a href='...'>Movie 1</a> <br> <a href='...'>Movie 2</a> <br> <a href='...'>Movie 3</a> <br> <br> <big><b>Cast in:</b></big> <br> <a href='...'>Movie 4</a> How do I select Movies 1, 2, and 3 using Xpath? I wrote this query '//big/b[text()="Staff in:"]/following::a' but it returns...

Does anyone have an example to store crawl data from scrapy to MySQLdb using Peewee?

python,mysql,scrapy,mysql-python,peewee
I'm new to Scrapy. I have Googled around and searched in Stack Overflow, but there's no exact things I want to do. I have been struggling with these for two days. This is what I have gotten so far for pipelines.py. Would anyone point out what's wrong with it or...

Scrapy writing XPath expression for unknown depth

html,xpath,web-scraping,scrapy
I have an html file which is like: <div id='author'> <div> <div> ... <a> John Doe </a> I do not know how many div's would be under the author div. It may have different depth for different pages. So what would be the XPath expression for this kind of xml?...

How to set a rule according to the current URL?

python,scrapy
I'm using Scrapy and I want to be able to have more control over the crawler. To do this I would like to set rules depending on the current URL that I am processing. For example if I am on example.com/a I want to apply a rule with LinkExtractor(restrict_xpaths='//div[@class="1"]'). And...

Using Selenium with Scrapy disables the Pipeline feature? How can I re-enable it?

mongodb,selenium,twitter,scrapy
I'm currently coding a Twitter Scraper using Scrapy to scrape and process the data, and Selenium as an automation tool, as Twitter itself is an interactive page so I can "scroll-down" the tweets and get more data in one sweep. Using the MongoDB Pipeline I've set, it should theoretically send...

Scrapy Xpath construction producing empty brackets on dynamic site

python,selenium,selenium-webdriver,web-scraping,scrapy
I am trying to create a spider via scrapy to crawl a website and extract all links for specific stores. Ultimately, the spider would then use those store links to extract pricing information. The site is designed to break down store information into States and Regions. I have been able...

Scrapy crawler not processing XHR Request

python,web-scraping,xmlhttprequest,scrapy,scrape
My spider is only crawling the first 10 pages, so I am assuming it is not entering the load more button though the Request. I am scraping this website: http://www.t3.com/reviews. My spider code: import scrapy from scrapy.conf import settings from scrapy.http import Request from scrapy.selector import Selector from reviews.items import...

Scraping dynamic content using python-Scrapy

python,web-scraping,scrapy
Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website. I'm using Python-Scrapy for getting data from koovs.com. However, I'm not able to get the product size, which is dynamically generated. Specifically, if...

Scrpay ImportError: cannot import name Request

python-2.7,scrapy
I have seen similar questions over this forum but this is different from those I have Item class like this ... class NewCarItem(Item): car_petrol_engine_type = Item() car_petrol_engine_size = Item() car_petrol_engine_max_power = Item() car_petrol_engine_max_torque = Item() car_petrol_engine_fuel_supply_system = Item() car_diesel_engine_type = Item() car_diesel_engine_size = Item() car_diesel_engine_max_power = Item() car_diesel_engine_max_torque = Item()...

Get the links from website using scrapy

python,scrapy
i am trying to extract the links from one class and store it using scrapy. I am not really sure what's the problem. Here is the code: import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["craigslist.org"] start_urls = [ "http://losangeles.craigslist.org/search/jjj" ] def parse(self, response): for...

Scrap a huge site with scrapy never completed

scrapy,scrapy-spider
I'm scraping a site which has millions of pages and about hundred of thousands of items. I'm using the CrawlSpider with LxmlLinkExtractors to define the exact path between different type of pages. Everything works fine and my scraper doesn't follow unwanted links. However, the whole site never seems to be...

SgmlLinkExtractor in scrapy

web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider
I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Scrapy delay request

python,web-crawler,scrapy
every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class DmozItem(scrapy.Item): # define the fields for your item...

Using ItemLoader but adding XPath, values etc. in Scrapy

python,xpath,web-scraping,scrapy,scrapy-spider
Currently I'm using the XPathItemLoader to scrape data: def parse_product(self, response): items = [] l = XPathItemLoader(item=MyItem(), response=response) l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars) l.default_output_processor = Join() l.add_xpath('name', 'div[2]/header/h1/text()') items.append(l.load_item()) return items and needed the v.split() to get rid of some spaces - that's working fine. But how can I...

delete spiders from scrapinghub

delete,web-crawler,scrapy,scrapy-spider,scrapinghub
I am a new user of scrapinghub. I already searched on googled and had read the scrapinghub docs but I could not find any information about removing spiders from a project. Is it possible, how? I do not want to replace a spider, I want to delete/remove it from scrapinghub...

scrapy form-filling when form posts to a second web page

python,scrapy
New to scrapy and wondering if anyone can point me to a sample project using scrapy to submit to HTML forms that have hidden fields in cases where the action page of the form is not the same address as where the form itself is presented. What is the easiest...

Installing Scrapy on Mac OSX 10.9.5

osx,python-2.7,scrapy,pip,pip-install-cryptography
I've new to python and have hit a wall with installing scrapy. Environment Details: MacBook pro OSX 10.9.5 XCode and Command Line utilities are installed Python 2.7.9 is installed in /usr/local/bin/python Python 2.7.5 (distrib as part of OSX) is installed in /usr/bin/python using pip install Approach tried to date Initial...

Make shell prompt wait till all processes complete

bash,shell,process,scrapy
I am running a shell script from a java program. The shell script will start multiple scrapy crawlers. #!/bin/bash cd /Users/renny/Documents/WorkSpaces/Scrapy/tutorial scrapy crawl flipkart -a key="$1" -o "$2"flipkart.xml scrapy crawl myntra -a key="$1" -o "$2"myntra.xml scrapy crawl jabong -a key="$1" -o "$2"jabong.xml The shell script will not wait for the...

Distinguishing between HTML and non-HTML pages in Scrapy

python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

Why scrapy not giving all the results and the rules part is also not working?

python,xpath,web-scraping,web-crawler,scrapy
This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

AttributeError: 'module' object has no attribute 'Spider'

python,scrapy,scrapy-spider
I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)...

Send Post Request in Scrapy

python,rest,post,curl,scrapy
I am trying to crawl latest reviews from google play store but I need to make a post request to get the latest reviews. with the the postman post request works and I get desired response but a post request using in terminal gives me server error For ex: for...

Xpath text() wrong output

python,xpath,web-scraping,scrapy
This is my first scrapy program! I'm writing a program using python/scrapy and I've tested my Xpath in FirePath and it works perfectly, but it is not displaying properly in the console (still in the early testing phase) What I'm doing is attempting to scrape a page of amazon reviews....

How to crawl links on all pages of a web site with Scrapy

website,web-crawler,scrapy,extract
I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
I have the following code below which crawls all the available pages from a website. This is perfectly `crawling` the valid pages because when I use print function - I can see the data from the `'items'` list, but I don't see any output when I try to use `.csv`...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

python,scrapy,scrapy-spider
I'd like to acquire data, using Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal...

Scrapy: If key exists, why do I get a KeyError?

python,list,key,scrapy,scrapy-spider
With items.py defined: import scrapy class CraigslistSampleItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() and populating each item via the spider thus: item = CraigslistSampleItem() item["title"] = $someXpath.extract() item["link"] = $someOtherXpath.extract() When I append these to a list (returned by parse()) and store this as e.g. a csv, I get two...

Iterate over all links/sub-links with Scrapy run from script

python,windows,python-2.7,web-scraping,scrapy
I want to run Scrapy Spider from my script, but it works only for 1 request. I cannot execute the procedure self.parse_product from scrapy.http.Request(product_url, callback=self.parse_product). I guess it's being due the command crawler.signals.connect(callback, signal=signals.spider_closed). Please advise how correctly go over all links and sub-links. Whole script is shown below. import...

XPath: Find first occurance in children and siblings

xpath,scrapy
So I have some HTML that looks like thus: <tr class="a"> <td>...</td> <td>...</td> </tr> <tr> <td>....</td> <td class="b">A</td> </tr> <tr>....</tr> <tr class="a"> <td class="b">B</td> <td>....</td> </tr> <tr> <td class="b">Not this</td> <td>....</td> </tr> I'm basically wanting to find the first instance of td class b following a tr with a class...

Make Scrapy follow links and collect data

python,web-scraping,web-crawler,scrapy
I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>. I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated....

Scrapy returning a null output when extracting an element from a table using xpath

python,xpath,web-scraping,web-crawler,scrapy
I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i...

How to get scrapy results orderly?

python,web-scraping,scrapy,scrapy-spider
Help me with scrapy. My code resulting output however it doesn't print the corrected way. I also tried with inside another for loop but that will not give correct result, Anyway if you found something missing in there.. please tel me Code: import scrapy class YelpScrapy(scrapy.Spider): name = 'yelp' start_urls...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button

javascript,python,pagination,web-crawler,scrapy
I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider
I am trying to write a crawler with Scrapy to crawl a classified-type (target) site and fetch information from the links on the target site. The tutorial on Scrapy only helps me get the links from the target URL but not the second layer of data gathering that I seek....

Scrapy gathers data, but does not save it into the item

python,scrapy
I've built a spider that gets the stock data for a given stock from as many pages that the stock has (this can be 1 page of stock data, or 20 from Yahoo! Finance). It scrapes all the pages well, and gathers all of the data as it should. However,...

'NoneType' object has no attribute '_app_data' in scrapy\twisted\openssl

python,openssl,scrapy,twisted,pyopenssl
During the scraping process using scrapy one error appears in my logs from time to time. It doesnt seem to be anywhere in my code, and looks like it something inside twisted\openssl. Any ideas what caused this and how to get rid of it? Stacktrace here: [Launcher,27487/stderr] Error during info_callback...

ImportError: cannot import name unwrap

python,scrapy,importerror
I have installed scrapy with pip install scrapy. But in python shell I am getting an ImportError: >>> from scrapy.spider import Spider Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/scrapy/__init__.py", line 56, in <module> from scrapy.spider import Spider File "/usr/local/lib/python2.7/dist-packages/scrapy/spider.py", line 7, in <module> from...

Scrapy parse list of urls, open one by one and parse additional data

python,parsing,web-scraping,scrapy
I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls. My script gets the list of first 4 items on the page, their urls,...

Scrapy not giving individual results of all the reviews of a phone?

python,xpath,web-scraping,scrapy,scrapy-spider
This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

Scrapy: catch responses with specific HTTP server codes

python,web-scraping,scrapy,scrapy-spider
We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

Scraping Multi level data using Scrapy, optimum way

python,selenium,data-structures,web-crawler,scrapy
I have been wondering what would be the best way to scrap the multi level of data using scrapy I will describe the situation in four stage, current architecture that i am following to scrape this data basic code structure the difficulties and why i think there has to be...

Web scraping error: exceptions.MemoryError

python,web-scraping,scrapy,scrapy-spider
I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

XPath Query Finds Elements Not Inside Selector

python,xpath,scrapy
My XPath query is finding elements that aren't even inside it. For example (from my code below) business_div contains the HTML: <div class="foo"> <div> <table> ... <a class="bar" href="A">link</a> </table> </div> </div> When I run the XPath query business_div.xpath("//a[@class='bar']/@href").extract() it returns: ["A", "B", "D"] # should just be ["A"] How...

Scrapy (Python): Iterating over 'next' page without multiple functions

python,scrapy
I am using Scrapy to grab stock data from Yahoo! Finance. Sometimes, I need to loop over several pages, 19 in this example , in order to get all of the stock data. Previously (when I knew there would only be two pages), I would use one function for each...

How to exclude a particular html tag(without any id) from several tags while using scrapy?

python,html,web-scraping,scrapy,scrapy-spider
<div class="region size2of3"> <h2>Mumbai</h2> <strong>Fort</strong> <div>Elphinstone building, Horniman Circle,</div> <div>Veer Nariman Road, Fort</div> <div>Mumbai 400001</div> <div>Timings: 08:00-00:30 hrs (Mon-Sun)</div> <div><br></div> </div> I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. Here's my code: import scrapy from job.items import StarbucksItem class StarbucksSpider(scrapy.Spider): name =...

scrapy xpath not returning desired results. Any idea?

html,xpath,scrapy
Please look at this page http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845. As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1]...

How to use re() to extract data from javascript variable using scrapy?

javascript,python,regex,web-scraping,scrapy
My items.py file goes like this: from scrapy.item import Item, Field class SpiItem(Item): title = Field() lat = Field() lng = Field() add = Field() and the spider is: import scrapy import re from spi.items import SpiItem class HdfcSpider(scrapy.Spider): name = "hdfc" allowed_domains = ["hdfc.com"] start_urls = ["http://hdfc.com/branch-locator"] def parse(self,response):...

Scrapy redirects to homepage for some urls

scrapy,scrapy-shell
I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example: Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service...

Scrapy CrawlSpider not following links

python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Crawl spider not crawling ~ Rule Issue

python,web-scraping,scrapy,scrapy-spider
I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

make scrapy request depending on outcome of prior request?

python,scrapy
I am scraping data where for each user, I don't know if there will be data for the entire time period. Therefore I would like to first call the API on a large chunk of time and then if there are results, call the API for smaller increments of time...

For scrapy/selenium is there a way to go back to a previous page?

python,selenium,scrapy
I essentially have a start_url that has my javascript search form and button, hence the need of selenium. I use selenium to select the appropriate items in my select box objects, and click the search button. The following page, I do some scrapy magic. However, now I want to go...

Scrapy collect data from first element and post's title

python,web-scraping,web-crawler,scrapy,scrapy-spider
I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

Trying to parse JSON files using Scrapy

python,json,web-scraping,scrapy
I'm trying to parse files much like this one, but for a lot of longitudes and latitudes. The crawler loops through all of the webpages, but doesn't output anything. Here is my code: import scrapy import json from tutorial.items import DmozItem from scrapy.http import Request from scrapy.contrib.spiders import CrawlSpider, Rule...

Cannot download image with relative URL Python Scrapy

python,web-crawler,scrapy,scrapy-spider
I'm using Scrapy to download images from http://www.vesselfinder.com/vessels However, I can only get the relative url of images like this http://www.vesselfinder.com/vessels/ship-photo/0-227349190-7c01e2b3a7a5078ea94fff9a0f862f8a/0 All of the image named 0.jpg but if I try to use that absolute url, I cannot get access to the image. My code: items.py import scrapy class VesselItem(scrapy.Item):...

How scrapy write in log while running spider?

python,scrapy,scrapyd,portia
While running scrapy spider, I am seeing that the log message has "DEBUG:" which has 1. DEBUG: Crawled (200) (GET http://www.example.com) (referer: None) 2. DEBUG: Scraped from (200 http://www.example.com) I want to know that 1. what to those "Crawled" and "Scraped from" meant for? 2. From where those above both...

Having trouble selecting some specific xpath… (html table, scrapy, xpath)

html,xpath,scrapy
I'm trying to scrape data (using scrapy) from tables that can be found here: http://www.bettingtools.co.uk/tipster-table/tipsters My spider functions when I parse response within the following xpath: //*[@id="imagetable"]/tbody/tr Every table on the page shares that id, so I'm basically grabbing all the table data. However, I only want the table data...

Why is xpath selecting only the last
  • inside the
  • python,web-scraping,scrapy,scrapy-spider
    I'm trying to scrape this site : http://www.kaymu.com.ng/. The part of the HTML I'm scraping is like this: <ul id="navigation-menu"> <li> some content </li> <li> some content </li> ... <li> some content </li> </ul> This is my spider : class KaymuSpider(Spider): name = "kaymu" allowed_domains = ["kaymu.com.ng"] start_urls = [...

    Check for xpath duplicates while running a for loop in scrapy for python

    python,xpath,scrapy
    I'm scraping xml data through scrapy and at the same time I want to check on duplicates. For this I'm using the following code: arr = [] for tr in response.xpath('/html/body/table[1]'): if tr.xpath('tr/td/text()').extract() not in arr: arr.append(tr.xpath('tr/td/text()').extract()) print arr This yields the following output (demo data): [[u'test1', u'12', u'test2', u'12',...

    The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work

    python,web-crawler,scrapy
    I want to crawl the page http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie . And some part of my spider code is : class MovieSpider(CrawlSpider): name = "doubanmovie" allowed_domains = ["douban.com"] start_urls = ["http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie"] rules = ( Rule(LinkExtractor(allow=(r'http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie\?start=\d{2}'))), Rule(LinkExtractor(allow=(r"http://movie.douban.com/subject/\d+")), callback = "parse_item") ) def start_requests(self): yield...

    Scrapy command in shell script not executing when called from java

    java,bash,shell,scrapy,scrapy-spider
    I have the below shell script which invokes scrapy #!/bin/bash export PATH=usr/local/bin/scrapy:$PATH scrapy crawl flipkart -a key="$1" -o "$2"flipkart.xml scrapy crawl myntra -a key="$1" -o "$2"myntra.xml scrapy crawl jabong -a key="$1" -o "$2"jabong.xml echo $PATH In the java program which calls this script file the error stream says that scrapy:...

    Python: Generate a date time string that I can use in for MySQL

    python,mysql,time,scrapy
    How to do this: from time import time import datetime current_time = time.strftime(r"%d.%m.%Y %H:%M:%S", time.localtime()) l.add_value('time', current_time) this will end up in an error: print time.strftime(r"%d.%m.%Y %H:%M:%S", time.localtime()) exceptions.AttributeError: 'builtin_function_or_method' object has no attribute 'strftime' I found plenty of information - but it seems as if I either need to...

    How to restrict the area in which LinkExtractor is being applied?

    scrapy
    I have a scraper with the following rules: rules = ( Rule(LinkExtractor(allow=('\S+list=\S+'))), Rule(LinkExtractor(allow=('\S+list=\S+'))), Rule(LinkExtractor(allow=('\S+view=1\S+')), callback='parse_archive'), ) As you can see, the 2nd and 3rd rules are exactly the same. What I would like to do is tell scrappy extract the links I am interested in by referring to particular places...