website,web-crawler,scrapy,extract , How to crawl links on all pages of a web site with Scrapy


How to crawl links on all pages of a web site with Scrapy

Question:

Tag: website,web-crawler,scrapy,extract

I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site:

http://www.ppgcc.ufv.br/

The links that I want are on this page:

http://www.ppgcc.ufv.br/?page_id=697

What could I do? I'm trying to use rules but I don't know how to use regular expressions correctly. Thank you

1 EDIT----

I need search on all pages of the main (ppgcc.ufv.br) site the kind of links (http://lattes.cnpq.br/asequenceofnumber). My Objective is get all the links lattes.cnpq.br/numbers but I don't know where they are. I'm using a simple code actually like:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["ppgcc.ufv.br"]
    start_urls = (
        'http://www.ppgcc.ufv.br/',
    )
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse')]

    def parse(self, response):
        filename = str(random.randint(1, 9999))
        open(filename, 'wb').write(response.body)

#I'm trying to understand how to use rules correctly

2 EDIT----

Using:

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = [".ppgcc.ufv.br"]
    start_urls = (
        'http://www.ppgcc.ufv.br/',
    )
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*']), follow=True),
            Rule(SgmlLinkExtractor(allow=[r'@href']), callback='parse_links')]
    def parse_links(self, response):
        filename = "Lattes.txt"
        arquivo = open(filename, 'wb')
        extractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
        for link in extractor.extract_links(response):
            url = link.urlextractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
            arquivo.writelines("%s\n" % url)                
            print url

It shows me:

C:\Python27\Scripts\tutorial3>scrapy crawl example
2015-06-02 08:08:18-0300 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial3)
2015-06-02 08:08:18-0300 [scrapy] INFO: Optional features available: ssl, http11
2015-06-02 08:08:18-0300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial3.spiders', 'SPIDER_MODULES': ['tutorial3
.spiders'], 'BOT_NAME': 'tutorial3'}
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMidd
leware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMidd
leware, ChunkedTransferMiddleware, DownloaderStats
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe
ngthMiddleware, DepthMiddleware
2015-06-02 08:08:19-0300 [scrapy] INFO: Enabled item pipelines:
2015-06-02 08:08:19-0300 [example] INFO: Spider opened
2015-06-02 08:08:19-0300 [example] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-02 08:08:19-0300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-06-02 08:08:19-0300 [example] DEBUG: Crawled (200) <GET http://www.ppgcc.ufv.br/> (referer: None)
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.cgu.gov.br': <GET http://www.cgu.gov.br/acessoainformacao
gov/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.brasil.gov.br': <GET http://www.brasil.gov.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.ppgcc.ufv.br': <GET http://www.ppgcc.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.ufv.br': <GET http://www.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.dpi.ufv.br': <GET http://www.dpi.ufv.br/>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.portal.ufv.br': <GET http://www.portal.ufv.br/?page_id=84
>
2015-06-02 08:08:19-0300 [example] DEBUG: Filtered offsite request to 'www.wordpress.org': <GET http://www.wordpress.org/>
2015-06-02 08:08:19-0300 [example] INFO: Closing spider (finished)
2015-06-02 08:08:19-0300 [example] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 215,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 18296,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 912000),
         'log_count/DEBUG': 10,
         'log_count/INFO': 7,
         'offsite/domains': 7,
         'offsite/filtered': 42,
         'request_depth_max': 1,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 6, 2, 11, 8, 19, 528000)}
2015-06-02 08:08:19-0300 [example] INFO: Spider closed (finished)

And I was looking the source code of the site, there are more links of pages that the crawl didn't GET, maybe my rules are incorrect


Answer:

So, a couple things first:

1) the rules attribute only works if you're extending the CrawlSpider class, they won't work if you extend the simpler scrapy.Spider.

2) if you go the rules and CrawlSpider route, you should not override the default parse callback, because the default implementation is what actually calls the rules -- so you want to use another name for your callback.

3) to do the actual extraction of the links you want, you can use a LinkExtractor inside your callback to scrape the links from the page:

from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(scrapy.Spider):
    ...

    def parse_links(self, response):
        extractor = LinkExtractor(allow=r'lattes\.cnpq\.br/\d+')
        for link in extractor.extract_links(response):
            item = LattesItem()
            item['url'] = link.url

I hope it helps.


Related:


Scrapy not entering parse method


python,selenium,web-scraping,web-crawler,scrapy
I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

Building a website - Displaying product information


html,css,website
I am building an e-commerce website for a friend, I have some knowledge of HTML and CSS but I wouldn't class myself as advanced on the subject. I said I would do it as a favor/experience. I just have a question about displaying multiple products information. My page currently has...

solution search engine website php [closed]


php,file,search,website
I have website and I need to build fast search engine for it. I have to search text in database or files (word, pdf) . I want too when I search a text the result the exact word or a text which have close writing, for example, I type "exemple"...

Redirect htaccess automatic


.htaccess,redirect,website
initially I had the site up /beta/, and google me indexed pages, now I would like that when one goes up as http: //miosito.ext/beta/blabla become automatically http: //miosito.ext/blabla I tried redirect /beta/^([a-zA-Z0-9_-]+)$ http://miosito.ext/^([a-zA-Z0-9_-]+)$ but it does not work ... Any idea how can I do?...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()


python-3.x,web-crawler
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "http://www.google.com" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern,...

Workload balancing between akka actors


multithreading,scala,web-crawler,akka,actor
I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

Images doesn't display on AngularJS website after uploading them with filezilla


angularjs,image,website,gruntjs,filezilla
I created an AngularJS project. With doing the grunt serve, grunt serve:dist commands in my terminal, my images are loaded. I upload my page to a webserver with filezilla, but on the public webpage my images are not found. I discovered that the images in my dist folder 2 extensions...

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]


haskell,web-crawler,monads
This question already has an answer here: How to get normal value from IO action in Haskell 2 answers I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results...

How to crawl links on all pages of a web site with Scrapy


website,web-crawler,scrapy,extract
I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What...

Check if element exists in fetched URL [closed]


javascript,jquery,python,web-crawler,window.open
I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

Setting default virtual host on apache


apache,website,development-environment
I have set my local development machine to use apache to serve virtual hosts based on folders using the following setup in apache. <VirtualHost *:80> ServerName dev DocumentRoot /Users/ben/Sites VirtualDocumentRoot /Users/ben/Sites/%-2/htdocs UseCanonicalName Off <Directory "/Users/ben/Sites/*/htdocs"> AllowOverride All Order allow,deny Allow from all Require all granted </Directory> </VirtualHost> Is it possible...

Unable to click in CasperJS


javascript,web-crawler,phantomjs,casperjs
I want to crawl the HTML data. And, I tried headless browser in CasperJS. But, Can't able to click. - The following is tried code in CapserJS. var casper = require('casper').create(); var mouse = require('mouse').create(casper); casper.start('http://sts.kma.go.kr/jsp/home/contents/climateData/smart/smartStatisticsSearch.do', function() { this.echo('START'); }); casper.then(function() { this.capture("1.png"); this.mouse.click('li[class="item1"]'); casper.wait(5000, function() { this.capture("2.png"); }); });...

I and making an app what are my choices for making a website for it


android,ios,windows,website
Hi I'm currently a making free cross platform app (Win7, Android, Windows 10, IOS).And I want to make a website for it, But I don't have much experience in web design. So I wonder if there is any good platform/CMS/Template I can use to make the website quickly? That's the...

Web Scraper for dynamic forms in python


python,web-scraping,web-crawler,mechanize
I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

How to run a website that uses npm to start the server


node.js,installation,website,server,setup-deployment
Me and a few friends have been working on developing some experiments. I want to put them up on a URL so we can all share them with other people. I however, have no experience setting a server up for anything other then static web pages. Within my package.json file...

Bootstrap navigation toggle for mobile


javascript,php,jquery,website,bootstrap
I used a Bootstrap template to start of my website. I added a navlogic and did some adjustment on the logo in an own CSS. Now, when i try my website on mobile , it resizes but when I click on the button nothing happens. I serached for some answers...

Heritrix single-site scrape, including required off-site assets


java,web-crawler,heritrix
I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...

How to get all the src and href attributes of a web site


javascript,image,website,attributes,src
I want a way to get all the src and href attributes(like images and links) in a website. How can i make this in javascript? I try this: var ilist=document.links; for(var i = 0; i < ilist.length; i++) { if(ilist[i].href) { alert(ilist[i].href) } } But for a something, this not...

Ruby - WebCrawler how to visit the links of the found links?


ruby,url,hyperlink,web-crawler,net-http
I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it...

Cordova- embed a external website into cordova app


cordova,browser,web-applications,website
I have a website. it run good in server I'm building a cordova app to embed this website and I want to open this website into this cordova app. I use window.open(..., "_blank").. window.open(..., "_selt") modify config.xml <access origin.... but when I run app in my device, website auto open...

Parsing Data from android Sign Up page to webView Sign Up page


android,webview,website
I'm new android developer, I don't have any great idea. I wanna make app that will take some input from users then will pass those input to a website for sign up & then when press Sign Up button from my app button will hit the website Sign Up button....

How to iterate over many websites and parse text using web crawler


python,web-crawler,sentiment-analysis
I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

How to disable firefox's reader view from my website?


firefox,website,reader,firefox-reader-view
Today i updated my firefox to the latest version and one big feature is the reader view for desktop. We launched a webshop two weeks ago and now there is this tiny "reader view" icon. When i click on it i get an error-message. My team-leader wants me to remove...

Find out website hosting location [closed]


php,website,web-hosting
Am a member of IT team. Each group of people in my company have their own webpage. somebody is using google sites. 2 websites were not found from this morning. I need to find out one website hosting location. Website URL is something like https://www.mycompanywebsite/groupweb. I checked all the codes...

Jquery Hide show toggle on DIVs


javascript,jquery,html,css,website
Hi Apparently my toggle function does not work. I need to Hide a div "photography" when clicking the word "photography" that is located in another div. Please see my code below <--html--> <div class="bar"> <div="container"> <div class="row"> <div class="col-lg-10 col-lg-offset-1 text-center"> <a href="#" id="hideshow" class="btn">Photography</a> <a href="#" id="hideshow" class="btn">Graphics</a> </div>...

Get all links from page on Wikipedia


python,python-2.7,web-crawler
I am making a Python web-crawler program to play The Wiki game. If you're unfamiliar with this game: Start from some article on Wikipedia Pick a goal article Try to get to the goal article from the start article just by clicking wiki/ links My process for doing this is:...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)


python,web-crawler
I've tried to follow one of the youtube tutorial however I've met some issue. Anyone able to help? I'm new to python, I understand that there is one or two similar question, however, I read and don't understand. Can someone help me out? Thanks import requests from bs4 import BeautifulSoup...

Distinguishing between HTML and non-HTML pages in Scrapy


python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

How to get web page source from a cookie drived web site using Java


java,cookies,website
It is easy to get web page source if it has a regular url with it. Here is an answer for it: How to get a web page's source code from Java But some websites, like Sobeys. they are asking you to input your location first then you can get...

Extracting Numbers From a Table on a Website


python,table,website,beautifulsoup
I am new to this website and programming in general, so bare with me please as my formatting for the question may be incorrect. I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly...

Heritrix not finding CSS files in conditional comment blocks


java,web-crawler,heritrix
The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Selenium pdf automatic download not working


python,selenium,selenium-webdriver,web-scraping,web-crawler
I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList",2); fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

R Importing excel file directly from web


r,excel,import,website
I need to import excel file directly from NYSE website. The spreadsheet url is https://quotespeed.morningstar.com/exportChartDataToExcel.jsp?tickers=AAPL&symbols=126.1.AAPL&st=1980-12-1&ed=2015-6-8&f=m&dty=1&types=1&ver=1.6.0&qs_wsid=E43474CC03753FE0E777D89877788ECB . Tried using gdata package and changing https to http but still doesnt work. Does anybody know solution to such issue? EDIT: Has to be imported to R directly from website (project requirement)...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button


javascript,python,pagination,web-crawler,scrapy
I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...

Why scrapy not giving all the results and the rules part is also not working?


python,xpath,web-scraping,web-crawler,scrapy
This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

set my website as homepage for android device browser using PHP code


php,android,browser,website,device
I need to define a button on top of my website which when a user open my website with android device and click on that button, my website set as his homepage on his android device browser. I need to do this with PHP or javaScript if possible.

How to generate a new url for each site


javascript,php,html,url,website
I'm currently making a website with the system of replacing content of the main div in the "body" part. Now I figured out that I messed up a bit because every content will be placed in the same url. Do you have a idea of fixing this? HTML/PHP(index.php): <html> <head>...

IIS 6 Allow anonymous access by default


asp.net,iis,website,iis-6
I have automated builds the will create IIS websites for me using FinalBuilder scripts. The problem is that in FinalBuilder there is no option to set 'allow anonymous access' to true on creation of the website, so I am forced to go into IIS manager and change this manually. In...

Different content in multiple css popups


jquery,html,css,website,popup
Sooo.. I have this html/css popup code. I want multiple popups in my page, but when I click on another popup, it will only show the content of the first one in the html. How can I make them show different content? This is the code: html: <div class="box"> <a...

Bootstrap Modal Does not work


javascript,css,website,modal-dialog,bootstrap
I am trying to make a modal work but it just wont work. This is my code: pastebin.com/ES17Dxkk I followed everything on the bootstrap site but it just wont want to work. Can someone help me out?...

how to download image in Goutte


php,web-crawler,guzzle,goutte
I want to download an image in this page. The image source ishttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg. I try to download it use this: $client = new Goutte\Client (); $client->getClient->get($img_url, array('save_to' => $img_url_save_name)); But I failed, then I realize if I directly accesshttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg, I are denied by CDN nginx server. I have to access...

HTML & CSS - Full Screen Image


html,css,html5,google-chrome,website
I'm working on a website where I want an image to straight away take up the screen. I want to make it take up the available screen but when you begin to scroll down I want a div to appear with the text and information. I do not mind if...

How can I create full screen slides


javascript,jquery,html,css,website
Can someone help me to create full screen slides exactly like following website... Browser scrollbar is hidden, when scroll up/down or press up/down key slides moves to next screen and active dot changed as per slide. http://vaalentin.github.io/2015/ $("nav a").click(function() { $('html, body').animate({ scrollTop: $($(this).attr('href')).offset().top }, 1000); }); * { box-sizing:...

Method for Storing and organizing data submitted by user to a website (form, pdf)?


php,sql,database,data,website
I have completed a website for a startup consultancy office and am confused on database part. User is supposed to upload basic information using form and resume in pdf format. How should I organize data so that it get stored, say, in Excel sheet or anything like that. Office people...

Scrapy CrawlSpider not following links


python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Python: Transform a unicode variable into a string variable


python,unicode,casting,web-crawler,unicode-string
I used a web crawler to get some data. I stored the data in a variable price. The type of price is: <class 'bs4.element.NavigableString'> The type of each element of price is: <type 'unicode'> Basically the price contains some white space and line feeds followed by: $520. I want to...

Apache Nutch REST api


api,rest,web-crawler,nutch
I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

T_STRING error in my php code [duplicate]


php,web-crawler
This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

SgmlLinkExtractor in scrapy


web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...