web-crawler,scrapy,scrapy-spider , How to crawl classified websites [closed]

How to crawl classified websites [closed]


Tag: web-crawler,scrapy,scrapy-spider

I am trying to write a crawler with Scrapy to crawl a classified-type (target) site and fetch information from the links on the target site. The tutorial on Scrapy only helps me get the links from the target URL but not the second layer of data gathering that I seek. Any leads?

So for instance, target site would be:

start_url = "http://newyork.craigslist.org/search/cta"

and for all the links on the target site I want to go to each listing and get the price, seller, location, phone or email

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin

class CompItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    location = scrapy.Field()

class criticspider(CrawlSpider):
    name = "craig"
    allowed_domains = ["newyork.craigslist.org"]
    start_urls = ["http://newyork.craigslist.org/search/cta"]

    def parse(self, response):
        sites = response.xpath('//div[@class="content"]')
        items = []

        for site in sites:
            item = CompItem()
            item['name'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="pl"]/a/text()').extract().
            item['price'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="price"]/text()').extract()
            item['location'] = site.xpath('.//p[@class="row"]/span[@class="txt"]/span[@class="l2"]/span[@class="pnr"]/small/text()').extract()
            return items


Scrapy xpath construction for tables of data - yielding empty brackets

I am attempting to build out xpath constructs for data items I would like to extract from several hundred pages of a site that are all formatted the same. An example site is https://weedmaps.com/dispensaries/cannabicare As can be seen the site has headings and within those headings are rows of item...

Stuck scraping a specific table with scrapy

So the table I'm trying to scrape can be found here: http://www.betdistrict.com/tipsters I'm after the table titled 'June Stats'. Here's my spider: from __future__ import division from decimal import * import scrapy import urlparse from ttscrape.items import TtscrapeItem class BetdistrictSpider(scrapy.Spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def...

Having trouble selecting some specific xpath… (html table, scrapy, xpath)

I'm trying to scrape data (using scrapy) from tables that can be found here: http://www.bettingtools.co.uk/tipster-table/tipsters My spider functions when I parse response within the following xpath: //*[@id="imagetable"]/tbody/tr Every table on the page shares that id, so I'm basically grabbing all the table data. However, I only want the table data...

scrapy xpath not returning desired results. Any idea?

Please look at this page As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1]...

Distinguishing between HTML and non-HTML pages in Scrapy

I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

SgmlLinkExtractor in scrapy

i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

Scrapy running from python script processes only start url

I have written a Scrapy CrawlSpider. class SiteCrawlerSpider(CrawlSpider): name = 'site_crawler' def __init__(self, start_url, **kw): super(SiteCrawlerSpider, self).__init__(**kw) self.rules = ( Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True), ) self.start_urls = [start_url] self.allowed_domains = tldextract.extract(start_url).registered_domain def parse_start_url(self, response): external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response) for link in external_links: i =...

How scrapy write in log while running spider?

While running scrapy spider, I am seeing that the log message has "DEBUG:" which has 1. DEBUG: Crawled (200) (GET http://www.example.com) (referer: None) 2. DEBUG: Scraped from (200 http://www.example.com) I want to know that 1. what to those "Crawled" and "Scraped from" meant for? 2. From where those above both...

XPath: Find first occurance in children and siblings

So I have some HTML that looks like thus: <tr class="a"> <td>...</td> <td>...</td> </tr> <tr> <td>....</td> <td class="b">A</td> </tr> <tr>....</tr> <tr class="a"> <td class="b">B</td> <td>....</td> </tr> <tr> <td class="b">Not this</td> <td>....</td> </tr> I'm basically wanting to find the first instance of td class b following a tr with a class...

Iterate over all links/sub-links with Scrapy run from script

I want to run Scrapy Spider from my script, but it works only for 1 request. I cannot execute the procedure self.parse_product from scrapy.http.Request(product_url, callback=self.parse_product). I guess it's being due the command crawler.signals.connect(callback, signal=signals.spider_closed). Please advise how correctly go over all links and sub-links. Whole script is shown below. import...

Scrapy not entering parse method

I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

Extracting links with scrapy that have a specific css class

Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

T_STRING error in my php code [duplicate]

This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

How to read xml directly from URLs with scrapy/python

In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

Check for xpath duplicates while running a for loop in scrapy for python

I'm scraping xml data through scrapy and at the same time I want to check on duplicates. For this I'm using the following code: arr = [] for tr in response.xpath('/html/body/table[1]'): if tr.xpath('tr/td/text()').extract() not in arr: arr.append(tr.xpath('tr/td/text()').extract()) print arr This yields the following output (demo data): [[u'test1', u'12', u'test2', u'12',...

Why scrapy not storing data into mongodb?

My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

scrapy crawling at depth not working

I am writing scrapy code to crawl first page and one additional depth of given webpage Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation. I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what...

Scrapy not giving individual results of all the reviews of a phone?

This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

Workload balancing between akka actors

I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

Remove first tag html using python & scrapy

I have a HTML: <div class="abc"> <div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div> </div> I used: response.xpath('//div[contains(@class,"abc")]/div[contains(@class,"xyz")]').extract() Result: u'['<div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div>'] I want remove...

Web Scraper for dynamic forms in python

I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Xpath text() wrong output

This is my first scrapy program! I'm writing a program using python/scrapy and I've tested my Xpath in FirePath and it works perfectly, but it is not displaying properly in the console (still in the early testing phase) What I'm doing is attempting to scrape a page of amazon reviews....

Scrap a huge site with scrapy never completed

I'm scraping a site which has millions of pages and about hundred of thousands of items. I'm using the CrawlSpider with LxmlLinkExtractors to define the exact path between different type of pages. Everything works fine and my scraper doesn't follow unwanted links. However, the whole site never seems to be...

scraping url and title from nested anchor tag

This is my first scraper using scrapy. I am trying to scrap video url, title from https://www.google.co.in/trends/hotvideos#hvsm=0 site. import scrapy from scrapy.item import Item, Field from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class CraigslistItem(Item): title = Field() link = Field() class DmozSpider(scrapy.Spider): name = "google" allowed_domains = ["google.co.in"] start_urls...

Scrapy Limit Requests For Testing

I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Scrapy Memory Error (too many requests) Python 2.7

I've been running a crawler in Scrapy to crawl a large site I'd rather not mention. I use the tutorial spider as a template, then I created a series of starting requests and let it crawl from there, using something like this: def start_requests(self): f = open('zipcodes.csv', 'r') lines =...

Scrapy CrawlSpider not following links

I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Heritrix not finding CSS files in conditional comment blocks

The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Scrapy: catch responses with specific HTTP server codes

We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

How to reset standard dupefilter in scrapy

For some reasons I would like to reset the list of seen urls that scrapy maintains internally at some point of my spider code. I know that by default scrapy uses the RFPDupeFilter class and that there is a fingerprint set. How can this set be cleared within spider code?...

Scrapy writing XPath expression for unknown depth

I have an html file which is like: <div id='author'> <div> <div> ... <a> John Doe </a> I do not know how many div's would be under the author div. It may have different depth for different pages. So what would be the XPath expression for this kind of xml?...

Crawl spider not crawling ~ Rule Issue

I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

AttributeError: 'module' object has no attribute 'Spider'

I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script

I'd like to acquire data, using Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal...

How can I initialize a Field() to contain a nested python dict?

I have a Field() in my items.py called: scores = Field() I want multiple scrapers to append a value to a nested dict inside scores. For example, one of my scrapers: item['scores']['baseball_score'] = '92' And another scraper would: item['scores']['basket_score'] = '21' So that when I retrieve scores: > item['scores'] {...

xpath: how to select items between item A and item B

I have an HTML page with this structure: <big><b>Staff in:</b></big> <br> <a href='...'>Movie 1</a> <br> <a href='...'>Movie 2</a> <br> <a href='...'>Movie 3</a> <br> <br> <big><b>Cast in:</b></big> <br> <a href='...'>Movie 4</a> How do I select Movies 1, 2, and 3 using Xpath? I wrote this query '//big/b[text()="Staff in:"]/following::a' but it returns...

make scrapy request depending on outcome of prior request?

I am scraping data where for each user, I don't know if there will be data for the entire time period. Therefore I would like to first call the API on a large chunk of time and then if there are results, call the API for smaller increments of time...

Apache Nutch REST api

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

How to iterate over many websites and parse text using web crawler

I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

Scrapy parse list of urls, open one by one and parse additional data

I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls. My script gets the list of first 4 items on the page, their urls,...

Why scrapy not giving all the results and the rules part is also not working?

This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()

I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "http://www.google.com" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern,...

Scrapy extracting from Link

I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why. Here is my code: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector...

Web scraping error: exceptions.MemoryError

I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

Scrapy redirects to homepage for some urls

I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example: Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service...

Check if element exists in fetched URL [closed]

I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...