web-crawler,scrapy,rules,extractor , SgmlLinkExtractor in scrapy


SgmlLinkExtractor in scrapy

Question:

Tag: web-crawler,scrapy,rules,extractor

i need some enlightenment about SgmlLinkExtractor in scrapy.

For the link: example.com/YYYY/MM/DD/title i would write:

Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')]

For the link: example.com/news/economic/title should i write:

r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news)

For the link: example.com/article/title should i write:

r'\article\w+' ? (the url contains always article)


Answer:

It's not possible to answer "should i" questions if you don't provide complete example strings and what you want to match (and what you don't want to match) with a regular expression.

I guess, that your regex won't work because you use \ instead of /.

I recommend you go to regex101 and test if your urls match your regular expressions. See following screenshot:

enter image description here


Related:


Web Scraper for dynamic forms in python


python,web-scraping,web-crawler,mechanize
I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

scrapy crawling at depth not working


python,scrapy,scrapy-spider
I am writing scrapy code to crawl first page and one additional depth of given webpage Somehow my crawler doesn't enter additional depth. Just crawls given starting urls and ends its operation. I added filter_links callback function but even thts not getting called so clearly rules are getting ignored. what...

XPath: Find first occurance in children and siblings


xpath,scrapy
So I have some HTML that looks like thus: <tr class="a"> <td>...</td> <td>...</td> </tr> <tr> <td>....</td> <td class="b">A</td> </tr> <tr>....</tr> <tr class="a"> <td class="b">B</td> <td>....</td> </tr> <tr> <td class="b">Not this</td> <td>....</td> </tr> I'm basically wanting to find the first instance of td class b following a tr with a class...

Web scraping error: exceptions.MemoryError


python,web-scraping,scrapy,scrapy-spider
I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

Crawl spider not crawling ~ Rule Issue


python,web-scraping,scrapy,scrapy-spider
I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

Scrapy parse list of urls, open one by one and parse additional data


python,parsing,web-scraping,scrapy
I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls. My script gets the list of first 4 items on the page, their urls,...

make scrapy request depending on outcome of prior request?


python,scrapy
I am scraping data where for each user, I don't know if there will be data for the entire time period. Therefore I would like to first call the API on a large chunk of time and then if there are results, call the API for smaller increments of time...

Scrapy crawler ignores `DOWNLOADER_MIDDLEWARES` when run as a script


python,scrapy,scrapy-spider
I'd like to acquire data, using Scrapy, from a few different sites and perform some analysis on that data. Since the both the crawlers and the code to analyze the data relate to the same project, I'd like to store everything in the same Git repository. I created a minimal...

How scrapy write in log while running spider?


python,scrapy,scrapyd,portia
While running scrapy spider, I am seeing that the log message has "DEBUG:" which has 1. DEBUG: Crawled (200) (GET http://www.example.com) (referer: None) 2. DEBUG: Scraped from (200 http://www.example.com) I want to know that 1. what to those "Crawled" and "Scraped from" meant for? 2. From where those above both...

How can I initialize a Field() to contain a nested python dict?


python,web-scraping,scrapy
I have a Field() in my items.py called: scores = Field() I want multiple scrapers to append a value to a nested dict inside scores. For example, one of my scrapers: item['scores']['baseball_score'] = '92' And another scraper would: item['scores']['basket_score'] = '21' So that when I retrieve scores: > item['scores'] {...

Scrapy Limit Requests For Testing


python,python-2.7,web-scraping,scrapy,scrapy-spider
I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Why scrapy not giving all the results and the rules part is also not working?


python,xpath,web-scraping,web-crawler,scrapy
This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

Scrapy not giving individual results of all the reviews of a phone?


python,xpath,web-scraping,scrapy,scrapy-spider
This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

Workload balancing between akka actors


multithreading,scala,web-crawler,akka,actor
I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

Iterate over all links/sub-links with Scrapy run from script


python,windows,python-2.7,web-scraping,scrapy
I want to run Scrapy Spider from my script, but it works only for 1 request. I cannot execute the procedure self.parse_product from scrapy.http.Request(product_url, callback=self.parse_product). I guess it's being due the command crawler.signals.connect(callback, signal=signals.spider_closed). Please advise how correctly go over all links and sub-links. Whole script is shown below. import...

Check if element exists in fetched URL [closed]


javascript,jquery,python,web-crawler,window.open
I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

Xpath text() wrong output


python,xpath,web-scraping,scrapy
This is my first scrapy program! I'm writing a program using python/scrapy and I've tested my Xpath in FirePath and it works perfectly, but it is not displaying properly in the console (still in the early testing phase) What I'm doing is attempting to scrape a page of amazon reviews....

SgmlLinkExtractor in scrapy


web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

Scrapy extracting from Link


python,scrapy,scrapy-spider
I am trying to extract information in certain links, but I don't get to go to the links, I extract from the start_url and I am not sure why. Here is my code: import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from tutorial.items import DmozItem from scrapy.selector...

Heritrix not finding CSS files in conditional comment blocks


java,web-crawler,heritrix
The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Remove first tag html using python & scrapy


python,xpath,scrapy,scrapy-spider
I have a HTML: <div class="abc"> <div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div> </div> I used: response.xpath('//div[contains(@class,"abc")]/div[contains(@class,"xyz")]').extract() Result: u'['<div class="xyz"> <div class="needremove"></div> <p>text</p> <p>text</p> <p>text</p> <p>text</p> </div>'] I want remove...

How to reset standard dupefilter in scrapy


scrapy
For some reasons I would like to reset the list of seen urls that scrapy maintains internally at some point of my spider code. I know that by default scrapy uses the RFPDupeFilter class and that there is a fingerprint set. How can this set be cleared within spider code?...

Scrapy writing XPath expression for unknown depth


html,xpath,web-scraping,scrapy
I have an html file which is like: <div id='author'> <div> <div> ... <a> John Doe </a> I do not know how many div's would be under the author div. It may have different depth for different pages. So what would be the XPath expression for this kind of xml?...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()


python-3.x,web-crawler
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "http://www.google.com" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern,...

Stuck scraping a specific table with scrapy


python,xpath,scrapy
So the table I'm trying to scrape can be found here: http://www.betdistrict.com/tipsters I'm after the table titled 'June Stats'. Here's my spider: from __future__ import division from decimal import * import scrapy import urlparse from ttscrape.items import TtscrapeItem class BetdistrictSpider(scrapy.Spider): name = "betdistrict" allowed_domains = ["betdistrict.com"] start_urls = ["http://www.betdistrict.com/tipsters"] def...

Having trouble selecting some specific xpath… (html table, scrapy, xpath)


html,xpath,scrapy
I'm trying to scrape data (using scrapy) from tables that can be found here: http://www.bettingtools.co.uk/tipster-table/tipsters My spider functions when I parse response within the following xpath: //*[@id="imagetable"]/tbody/tr Every table on the page shares that id, so I'm basically grabbing all the table data. However, I only want the table data...

Why scrapy not storing data into mongodb?


python,mongodb,web-scraping,scrapy,scrapy-spider
My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

Scrapy running from python script processes only start url


python,python-2.7,scrapy
I have written a Scrapy CrawlSpider. class SiteCrawlerSpider(CrawlSpider): name = 'site_crawler' def __init__(self, start_url, **kw): super(SiteCrawlerSpider, self).__init__(**kw) self.rules = ( Rule(LinkExtractor(allow=()), callback='parse_start_url', follow=True), ) self.start_urls = [start_url] self.allowed_domains = tldextract.extract(start_url).registered_domain def parse_start_url(self, response): external_links = LinkExtractor(allow=(), deny=self.allowed_domains).extract_links(response) for link in external_links: i =...

Apache Nutch REST api


api,rest,web-crawler,nutch
I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

Check for xpath duplicates while running a for loop in scrapy for python


python,xpath,scrapy
I'm scraping xml data through scrapy and at the same time I want to check on duplicates. For this I'm using the following code: arr = [] for tr in response.xpath('/html/body/table[1]'): if tr.xpath('tr/td/text()').extract() not in arr: arr.append(tr.xpath('tr/td/text()').extract()) print arr This yields the following output (demo data): [[u'test1', u'12', u'test2', u'12',...

T_STRING error in my php code [duplicate]


php,web-crawler
This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

Scrapy not entering parse method


python,selenium,web-scraping,web-crawler,scrapy
I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

Scrapy redirects to homepage for some urls


scrapy,scrapy-shell
I am new to Scrapy framework & currently using it to extract articles from multiple 'Health & Wellness' websites. For some of the requests, scrapy is redirecting to homepage(this behavior is not observed in browser). Below is an example: Command: scrapy shell "http://www.bornfitness.com/blog/page/10/" Result: 2015-06-19 21:32:15+0530 [scrapy] DEBUG: Web service...

Scrapy: catch responses with specific HTTP server codes


python,web-scraping,scrapy,scrapy-spider
We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

scrapy xpath not returning desired results. Any idea?


html,xpath,scrapy
Please look at this page http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845. As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1]...

xpath: how to select items between item A and item B


xpath,scrapy
I have an HTML page with this structure: <big><b>Staff in:</b></big> <br> <a href='...'>Movie 1</a> <br> <a href='...'>Movie 2</a> <br> <a href='...'>Movie 3</a> <br> <br> <big><b>Cast in:</b></big> <br> <a href='...'>Movie 4</a> How do I select Movies 1, 2, and 3 using Xpath? I wrote this query '//big/b[text()="Staff in:"]/following::a' but it returns...

AttributeError: 'module' object has no attribute 'Spider'


python,scrapy,scrapy-spider
I just started to learn scrapy. So I followed the scrapy documentation. I just written the first spider mentioned in that site. import scrapy class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)...

Scrapy Memory Error (too many requests) Python 2.7


python,django,python-2.7,memory,scrapy
I've been running a crawler in Scrapy to crawl a large site I'd rather not mention. I use the tutorial spider as a template, then I created a series of starting requests and let it crawl from there, using something like this: def start_requests(self): f = open('zipcodes.csv', 'r') lines =...

Distinguishing between HTML and non-HTML pages in Scrapy


python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

Scrap a huge site with scrapy never completed


scrapy,scrapy-spider
I'm scraping a site which has millions of pages and about hundred of thousands of items. I'm using the CrawlSpider with LxmlLinkExtractors to define the exact path between different type of pages. Everything works fine and my scraper doesn't follow unwanted links. However, the whole site never seems to be...

scraping url and title from nested anchor tag


python,web-scraping,scrapy
This is my first scraper using scrapy. I am trying to scrap video url, title from https://www.google.co.in/trends/hotvideos#hvsm=0 site. import scrapy from scrapy.item import Item, Field from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class CraigslistItem(Item): title = Field() link = Field() class DmozSpider(scrapy.Spider): name = "google" allowed_domains = ["google.co.in"] start_urls...

Scrapy xpath construction for tables of data - yielding empty brackets


html,xpath,scrapy
I am attempting to build out xpath constructs for data items I would like to extract from several hundred pages of a site that are all formatted the same. An example site is https://weedmaps.com/dispensaries/cannabicare As can be seen the site has headings and within those headings are rows of item...

Scrapy CrawlSpider not following links


python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Extracting links with scrapy that have a specific css class


python,web-scraping,scrapy,screen-scraping,scrapy-spider
Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

How to read xml directly from URLs with scrapy/python


python,xml,web-scraping,scrapy,scrapy-spider
In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

How to iterate over many websites and parse text using web crawler


python,web-crawler,sentiment-analysis
I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...