website,web-crawler,scrapy,extract , How to crawl links on all pages of a web site with Scrapy

How to crawl links on all pages of a web site with Scrapy


Tag: website,web-crawler,scrapy,extract

I'm learning about scrapy and I'm trying to extract all links that contains: "" , example: But I don't know what is the page on the web site that contains these information. For example this web site:

The links that I want are on this page:

What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you

1 EDIT----

I need search on all pages of the main ( site the kind of links ( My Objective is get all the links but I don't know where they are. I'm using a simple code actually like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = [""]
    start_urls = (
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse')]

    def parse(self, response):
        filename = str(random.randint(1, 9999))
        open(filename, 'wb').write(response.body)

#I'm trying to understand how to use rules correctly

2 EDIT----


class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = [""]
    start_urls = (
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
            Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse_links')]
    def parse_links(self, response):
        filename = "Lattes.txt"
        arquivo = open(filename, 'wb')
        extractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
        for link in extractor.extract_links(response):
            url = link.urlextractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
            arquivo.writelines("%s\n" % url)                
            print url

It shows me:

C:\Python27\Scripts\tutorial3>scrapy crawl example
2015-06-02 08:08:18-0300 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial3)
2015-06-02 08:08:18-0300 [scrapy] INFO: Optional features available: ssl, http11
2015-06-02 08:08:18-0300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial3.spiders', 'SPIDER_MODULES': ['tutorial3
.spiders'], 'BOT_NAME': 'tutorial3'}
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMidd
leware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMidd
leware, ChunkedTransferMiddleware, DownloaderStats
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe
ngthMiddleware, DepthMiddleware
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled item pipelines:
2015-06-02 08:08:19-0300 [example] INFO: Spider opened
2015-06-02 08:08:19-0300 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Telnet console listening on
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Web service listening on
2015-06-02 08:08:19-0300 [example] DEBUG: Crawled (200) <GET> (referer: None)
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to '': <GET>
2015-06-02 08:08:19-0300 [example] INFO: Closing spider (finished)
2015-06-02 08:08:19-0300 [example] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 215,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 18296,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 912000),
         'log_count/DEBUG': 10,
         'log_count/INFO': 7,
         'offsite/domains': 7,
         'offsite/filtered': 42,
         'request_depth_max': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 528000)}
2015-06-02 08:08:19-0300 [example] INFO: Spider closed (finished)

And I was looking the source code of the site, there are more links of pages that the crawl didn't GET, maybe my rules are incorrect


So, a couple things first:

1) the rules attribute only works if you're extending the CrawlSpider class, they won't work if you extend the simpler scrapy.Spider.

2) if you go the rules and CrawlSpider route, you should not override the default parse callback, because the default implementation is what actually calls the rules -- so you want to use another name for your callback.

3) to do the actual extraction of the links you want, you can use a LinkExtractor inside your callback to scrape the links from the page:

from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):

    def parse_links(self, response):
        extractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
        for link in extractor.extract_links(response):
            item = LattesItem()
            item['url'] = link.url

I hope it helps.


Building a website - Displaying product information

I am building an e-commerce website for a friend, I have some knowledge of HTML and CSS but I wouldn't class myself as advanced on the subject. I said I would do it as a favor/experience. I just have a question about displaying multiple products information. My page currently has...

Method for Storing and organizing data submitted by user to a website (form, pdf)?

I have completed a website for a startup consultancy office and am confused on database part. User is supposed to upload basic information using form and resume in pdf format. How should I organize data so that it get stored, say, in Excel sheet or anything like that. Office people...

Scrapy not entering parse method

I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from import WebDriverWait from import...

Parsing Data from android Sign Up page to webView Sign Up page

I'm new android developer, I don't have any great idea. I wanna make app that will take some input from users then will pass those input to a website for sign up & then when press Sign Up button from my app button will hit the website Sign Up button....

Extracting Numbers From a Table on a Website

I am new to this website and programming in general, so bare with me please as my formatting for the question may be incorrect. I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly...

How to disable firefox's reader view from my website?

Today i updated my firefox to the latest version and one big feature is the reader view for desktop. We launched a webshop two weeks ago and now there is this tiny "reader view" icon. When i click on it i get an error-message. My team-leader wants me to remove...

Why scrapy not giving all the results and the rules part is also not working?

This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

Heritrix single-site scrape, including required off-site assets

I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...

Find out website hosting location [closed]

Am a member of IT team. Each group of people in my company have their own webpage. somebody is using google sites. 2 websites were not found from this morning. I need to find out one website hosting location. Website URL is something like https://www.mycompanywebsite/groupweb. I checked all the codes...

How to get web page source from a cookie drived web site using Java

It is easy to get web page source if it has a regular url with it. Here is an answer for it: How to get a web page's source code from Java But some websites, like Sobeys. they are asking you to input your location first then you can get...

I and making an app what are my choices for making a website for it

Hi I'm currently a making free cross platform app (Win7, Android, Windows 10, IOS).And I want to make a website for it, But I don't have much experience in web design. So I wonder if there is any good platform/CMS/Template I can use to make the website quickly? That's the...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button

I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...

How can I create full screen slides

Can someone help me to create full screen slides exactly like following website... Browser scrollbar is hidden, when scroll up/down or press up/down key slides moves to next screen and active dot changed as per slide. $("nav a").click(function() { $('html, body').animate({ scrollTop: $($(this).attr('href')).offset().top }, 1000); }); * { box-sizing:...

Python: Transform a unicode variable into a string variable

I used a web crawler to get some data. I stored the data in a variable price. The type of price is: <class 'bs4.element.NavigableString'> The type of each element of price is: <type 'unicode'> Basically the price contains some white space and line feeds followed by: $520. I want to...

T_STRING error in my php code [duplicate]

This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

How to generate a new url for each site

I'm currently making a website with the system of replacing content of the main div in the "body" part. Now I figured out that I messed up a bit because every content will be placed in the same url. Do you have a idea of fixing this? HTML/PHP(index.php): <html> <head>...

How to iterate over many websites and parse text using web crawler

I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

Ruby - WebCrawler how to visit the links of the found links?

I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it...

Images doesn't display on AngularJS website after uploading them with filezilla

I created an AngularJS project. With doing the grunt serve, grunt serve:dist commands in my terminal, my images are loaded. I upload my page to a webserver with filezilla, but on the public webpage my images are not found. I discovered that the images in my dist folder 2 extensions...

Jquery Hide show toggle on DIVs

Hi Apparently my toggle function does not work. I need to Hide a div "photography" when clicking the word "photography" that is located in another div. Please see my code below <--html--> <div class="bar"> <div="container"> <div class="row"> <div class="col-lg-10 col-lg-offset-1 text-center"> <a href="#" id="hideshow" class="btn">Photography</a> <a href="#" id="hideshow" class="btn">Graphics</a> </div>...

IIS 6 Allow anonymous access by default,iis,website,iis-6
I have automated builds the will create IIS websites for me using FinalBuilder scripts. The problem is that in FinalBuilder there is no option to set 'allow anonymous access' to true on creation of the website, so I am forced to go into IIS manager and change this manually. In...

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

This question already has an answer here: How to get normal value from IO action in Haskell 2 answers I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)

I've tried to follow one of the youtube tutorial however I've met some issue. Anyone able to help? I'm new to python, I understand that there is one or two similar question, however, I read and don't understand. Can someone help me out? Thanks import requests from bs4 import BeautifulSoup...

Apache Nutch REST api

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

How to run a website that uses npm to start the server

Me and a few friends have been working on developing some experiments. I want to put them up on a URL so we can all share them with other people. I however, have no experience setting a server up for anything other then static web pages. Within my package.json file...

SgmlLinkExtractor in scrapy

i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: should i write: r'\article\w+' ? (the url contains always article)...

Distinguishing between HTML and non-HTML pages in Scrapy

I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

How to crawl links on all pages of a web site with Scrapy

I'm learning about scrapy and I'm trying to extract all links that contains: "" , example: But I don't know what is the page on the web site that contains these information. For example this web site: The links that I want are on this page: What...

Cordova- embed a external website into cordova app

I have a website. it run good in server I'm building a cordova app to embed this website and I want to open this website into this cordova app. I use, "_blank").., "_selt") modify config.xml <access origin.... but when I run app in my device, website auto open...

Web Scraper for dynamic forms in python

I am trying to fill the form of this website It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Scrapy CrawlSpider not following links

I am trying to crawl some attributes from all(#123) detail pages given on this category page - but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

solution search engine website php [closed]

I have website and I need to build fast search engine for it. I have to search text in database or files (word, pdf) . I want too when I search a text the result the exact word or a text which have close writing, for example, I type "exemple"...

Setting default virtual host on apache

I have set my local development machine to use apache to serve virtual hosts based on folders using the following setup in apache. <VirtualHost *:80> ServerName dev DocumentRoot /Users/ben/Sites VirtualDocumentRoot /Users/ben/Sites/%-2/htdocs UseCanonicalName Off <Directory "/Users/ben/Sites/*/htdocs"> AllowOverride All Order allow,deny Allow from all Require all granted </Directory> </VirtualHost> Is it possible...

Workload balancing between akka actors

I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

Bootstrap Modal Does not work

I am trying to make a modal work but it just wont work. This is my code: I followed everything on the bootstrap site but it just wont want to work. Can someone help me out?...

Heritrix not finding CSS files in conditional comment blocks

The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Bootstrap navigation toggle for mobile

I used a Bootstrap template to start of my website. I added a navlogic and did some adjustment on the logo in an own CSS. Now, when i try my website on mobile , it resizes but when I click on the button nothing happens. I serached for some answers...

Get all links from page on Wikipedia

I am making a Python web-crawler program to play The Wiki game. If you're unfamiliar with this game: Start from some article on Wikipedia Pick a goal article Try to get to the goal article from the start article just by clicking wiki/ links My process for doing this is:...

how to download image in Goutte

I want to download an image in this page. The image source is I try to download it use this: $client = new Goutte\Client (); $client->getClient->get($img_url, array('save_to' => $img_url_save_name)); But I failed, then I realize if I directly access, I are denied by CDN nginx server. I have to access...

HTML & CSS - Full Screen Image

I'm working on a website where I want an image to straight away take up the screen. I want to make it take up the available screen but when you begin to scroll down I want a div to appear with the text and information. I do not mind if...

set my website as homepage for android device browser using PHP code

I need to define a button on top of my website which when a user open my website with android device and click on that button, my website set as his homepage on his android device browser. I need to do this with PHP or javaScript if possible.

Check if element exists in fetched URL [closed]

I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

How to get all the src and href attributes of a web site

I want a way to get all the src and href attributes(like images and links) in a website. How can i make this in javascript? I try this: var ilist=document.links; for(var i = 0; i < ilist.length; i++) { if(ilist[i].href) { alert(ilist[i].href) } } But for a something, this not...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()

I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = title = re.findall(pattern,...

Selenium pdf automatic download not working

I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("",2); fp.set_preference("",False) fp.set_preference("", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

Different content in multiple css popups

Sooo.. I have this html/css popup code. I want multiple popups in my page, but when I click on another popup, it will only show the content of the first one in the html. How can I make them show different content? This is the code: html: <div class="box"> <a...

Redirect htaccess automatic

initially I had the site up /beta/, and google me indexed pages, now I would like that when one goes up as http: //miosito.ext/beta/blabla become automatically http: //miosito.ext/blabla I tried redirect /beta/^([a-zA-Z0-9_-]+)$ http://miosito.ext/^([a-zA-Z0-9_-]+)$ but it does not work ... Any idea how can I do?...

R Importing excel file directly from web

I need to import excel file directly from NYSE website. The spreadsheet url is . Tried using gdata package and changing https to http but still doesnt work. Does anybody know solution to such issue? EDIT: Has to be imported to R directly from website (project requirement)...

Unable to click in CasperJS

I want to crawl the HTML data. And, I tried headless browser in CasperJS. But, Can't able to click. - The following is tried code in CapserJS. var casper = require('casper').create(); var mouse = require('mouse').create(casper); casper.start('', function() { this.echo('START'); }); casper.then(function() { this.capture("1.png");'li[class="item1"]'); casper.wait(5000, function() { this.capture("2.png"); }); });...