web-crawler,nutch , focused crawler by modifying nutch


focused crawler by modifying nutch

Question:

Tag: web-crawler,nutch

I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??


Answer:

If the extracted urls could be differentiated by Regular expression you can do that with current Nutch by adding the specific regex filter. But if you are going to classify URL according to some metadata features related to page you have to implement a customized HTMLParseFilter to filter Outlink[] during parse step. For more information about How to develop a plugin for Nutch follow these links:

http://wiki.apache.org/nutch/AboutPlugins

http://wiki.apache.org/nutch/WritingPluginExample


Related:


Python: urllib2 get nothing which does exist


python,web-scraping,web-crawler,urllib2
I'm trying to crawl my college website and I set cookie, add headers then: homepage=opener.open("website") content = homepage.read() print content I can get the source code sometimes but sometime just nothing. I can't figure it out what happened. Is my code wrong? Or the web matters? Does one geturl() can...

How to iterate over many websites and parse text using web crawler


python,web-crawler,sentiment-analysis
I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

How to crawl images in Nutch 2.3 as HBase as backend?


nutch
I want to crawl images from certain sites. So far I tried modifiying regex-urlfilter.txt. I changed: -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PP T|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ To: -\.(css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|t gz|TGZ|mov|MOV|exe|EXE|js|JS)$ But it didn't work. I am surprised that I didn't find any documentation regarding crawling images using Nutch 2.3. Referal to any existing documentation would...

Nutch Error: JAVA_HOME is not set


java,ubuntu,nutch
I followed this tutorial http://saskia-vola.com/nutch-2-2-elasticsearch-1-x-hbase/ When I finally tried to run Nutch sudo bin/nutch inject urls I got this error Error: JAVA_HOME is not set. but when I echo JAVA_HOME it returns /usr/lib/jvm/java-7-openjdk-amd64 and it is also in /etc/environment JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64" and also I added line to end of file ~/.bashrc...

fullPage.js: Make all slides and sections visible in search engine results


jquery,seo,web-crawler,single-page-application,fullpage.js
I'm using fullpage.js jQuery plugin for a Single page application. I'm using mostly default settings and the plugin works like a charm. When I got to the SEO though I couldn't properly make Google crawl my website on a "per slide" basis. All my slides are loaded at the page...

Get all links from page on Wikipedia


python,python-2.7,web-crawler
I am making a Python web-crawler program to play The Wiki game. If you're unfamiliar with this game: Start from some article on Wikipedia Pick a goal article Try to get to the goal article from the start article just by clicking wiki/ links My process for doing this is:...

Selenium pdf automatic download not working


python,selenium,selenium-webdriver,web-scraping,web-crawler
I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList",2); fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

focused crawler by modifying nutch


web-crawler,nutch
I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??

Nutch 2.3 REST curl syntax


rest,curl,nutch
I'm trying to use curl to test out the Nutch 2.X REST API. I'm able to start the nutchserver and inject URLS, but I'm having trouble getting the generate step to work. Here's what I've done: curl -i -X POST -H "Content-Type:application/json" http://localhost:8081/job/create -d '{"crawlId":"crawl-01","type":"INJECT","confId":"default","args":{"seedDir":"/Users/username/myNutchFolder/apache-nutch-2.3/runtime/local/urls/"}}' which when I look at...

want to keep running my single ruby crawler that dont need html and nothing


ruby-on-rails,ruby,web-crawler
first of all, I'm a newbie. I just made a single ruby file, which crawls something on the certain web and put data into my google spreadsheet. But I want my crawler to do its job every morning 9:00 AM. Then what do I need? Maybe a gem and server?...

Heritrix single-site scrape, including required off-site assets


java,web-crawler,heritrix
I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...

My Java program reaches 80% cpu usage after 20-30 min


java,database,web-crawler,cpu
I have a java program that crawls for some data on some sites and inserts it into the database. The Program keeps doing this : Get the html Extract the relevant data with some splits Insert into to database For the first 5-10 min it runs perfectly and very fast...

How to keep a web crawler running?


javascript,node.js,web-crawler
I want to write my own web crawler in JS. I am thinking of using a node.js solution such as https://www.npmjs.com/package/js-crawler The objective is to have a "crawl" every 10 minutes - so every 10 minutes I want my crawler to fetch data from a website. I understand that I...

Make Scrapy follow links and collect data


python,web-scraping,web-crawler,scrapy
I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>. I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated....

Python: Transform a unicode variable into a string variable


python,unicode,casting,web-crawler,unicode-string
I used a web crawler to get some data. I stored the data in a variable price. The type of price is: <class 'bs4.element.NavigableString'> The type of each element of price is: <type 'unicode'> Basically the price contains some white space and line feeds followed by: $520. I want to...

Heritrix not finding CSS files in conditional comment blocks


java,web-crawler,heritrix
The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

Scraping Multi level data using Scrapy, optimum way


python,selenium,data-structures,web-crawler,scrapy
I have been wondering what would be the best way to scrap the multi level of data using scrapy I will describe the situation in four stage, current architecture that i am following to scrape this data basic code structure the difficulties and why i think there has to be...

Scrapy middleware setup


python,web-scraping,web-crawler,scrapy
I am trying to access public proxy using scrapy to get some data. I get the following error when i try to run the code: ImportError: Error loading object 'craiglist.middlewares.ProxyMiddleware': No module named middlewares I've created middlewares.py file with following code: import base64 # Start your middleware class class ProxyMiddleware(object):...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)


python,web-crawler
I've tried to follow one of the youtube tutorial however I've met some issue. Anyone able to help? I'm new to python, I understand that there is one or two similar question, however, I read and don't understand. Can someone help me out? Thanks import requests from bs4 import BeautifulSoup...

Scrapy collect data from first element and post's title


python,web-scraping,web-crawler,scrapy,scrapy-spider
I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

Gora MongoDb Exception, can't serialize Utf8


mongodb,nutch,gora
I'm trying to get nutch 2.3 work with mongoDB but I get the following exception: java.lang.IllegalArgumentException: can't serialize class org.apache.avro.util.Utf8 at org.bson.BasicBSONEncoder._putObjectField(BasicBSONEncoder.java:284) at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:185) I've found the following ticket related to this problem, which says it should be resolved in nutch 2.3: https://issues.apache.org/jira/browse/NUTCH-1843 There's another ticket for the Gora project...

Apache Nutch REST api


api,rest,web-crawler,nutch
I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

Scrapy CrawlSpider not following links


python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4


python,beautifulsoup,web-crawler,bs4
I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not. <h3 class="two-lines-name"> <a title="APPLE IPOD IPOD A1199...

The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work


python,web-crawler,scrapy
I want to crawl the page http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie . And some part of my spider code is : class MovieSpider(CrawlSpider): name = "doubanmovie" allowed_domains = ["douban.com"] start_urls = ["http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie"] rules = ( Rule(LinkExtractor(allow=(r'http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie\?start=\d{2}'))), Rule(LinkExtractor(allow=(r"http://movie.douban.com/subject/\d+")), callback = "parse_item") ) def start_requests(self): yield...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()


python-3.x,web-crawler
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "http://www.google.com" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern,...

Authorization issue with cron crawler inserting data into Google spreadsheet using Google API in Ruby


ruby,cron,google-api,web-crawler,google-api-client
My project is to crawl the certain web data and put them into my Google spreadsheet every morning 9:00. And it has to get the authorization to read & write something. That's why the code below is located at the top. # Google API CLIENT_ID = blah blah CLIENT_SECRET =...

Scrapy delay request


python,web-crawler,scrapy
every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class DmozItem(scrapy.Item): # define the fields for your item...

Scrapy not entering parse method


python,selenium,web-scraping,web-crawler,scrapy
I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

Why scrapy not giving all the results and the rules part is also not working?


python,xpath,web-scraping,web-crawler,scrapy
This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

Unable to click in CasperJS


javascript,web-crawler,phantomjs,casperjs
I want to crawl the HTML data. And, I tried headless browser in CasperJS. But, Can't able to click. - The following is tried code in CapserJS. var casper = require('casper').create(); var mouse = require('mouse').create(casper); casper.start('http://sts.kma.go.kr/jsp/home/contents/climateData/smart/smartStatisticsSearch.do', function() { this.echo('START'); }); casper.then(function() { this.capture("1.png"); this.mouse.click('li[class="item1"]'); casper.wait(5000, function() { this.capture("2.png"); }); });...

Distinguishing between HTML and non-HTML pages in Scrapy


python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

Scrapy follow link and collect email


python,web-scraping,web-crawler,scrapy
i need help with saving email with Scrapy. The row in .csv file where emails are supposed to be collected is blank. Any help is very appreciated. Here is the code: # -*- coding: utf-8 -*- import scrapy # item class included here class DmozItem(scrapy.Item): # define the fields for...

Cannot ant runtime in Apache nutch 2.3


java,apache,ant,nutch
I followed this tutorial https://wiki.apache.org/nutch/Nutch2Tutorial. When I tried to run ant runtime I was getting this message BUILD FAILED /usr/local/nutch/framework/apache-nutch-2.3/build.xml:113: The following error occurred while executing this line: /usr/local/nutch/framework/apache-nutch-2.3/src/plugin/build.xml:35: The following error occurred while executing this line: /usr/local/nutch/framework/apache-nutch-2.3/src/plugin/build-plugin.xml:117: Compile failed; see the compiler error output for details. This is on...

How to crawl links on all pages of a web site with Scrapy


website,web-crawler,scrapy,extract
I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What...

PHP web crawler, check URL for path


php,url,path,web-crawler,bots
I'm writing a simple web crawler to grab some links from a site. I need to check the returned links to make sure I selectively collect what I want. For example, here's a few links returned from http://www.polygon.com/ [0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments [1] http://www.polygon.com/videos [2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide [3] http://www.polygon.com/features so link 0 and...

SgmlLinkExtractor in scrapy


web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

Ruby - WebCrawler how to visit the links of the found links?


ruby,url,hyperlink,web-crawler,net-http
I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it...

how to download image in Goutte


php,web-crawler,guzzle,goutte
I want to download an image in this page. The image source ishttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg. I try to download it use this: $client = new Goutte\Client (); $client->getClient->get($img_url, array('save_to' => $img_url_save_name)); But I failed, then I realize if I directly accesshttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg, I are denied by CDN nginx server. I have to access...

how to check whether a program using requests module is dead or not


python,web-crawler,downloading
I am trying to using python download a batch of files, and I use requests module with stream turned on, in other words, I retrieve each file in 200K blocks. However, sometimes, the downloading may stop as it just gets stuck (no response) and there is no error. I guess...

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]


haskell,web-crawler,monads
This question already has an answer here: How to get normal value from IO action in Haskell 2 answers I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results...

Crawling & parsing results of querying google-like search engine


java,parsing,web-crawler,jsoup
I have to write parser in Java (my first html parser by this way). For now I'm using jsoup library and I think it is very good solution for my problem. Main goal is to get some information from Google Scholar (h-index, numbers of publications, years of scientific carier). I...

Workload balancing between akka actors


multithreading,scala,web-crawler,akka,actor
I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

T_STRING error in my php code [duplicate]


php,web-crawler
This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

Scrapy returning a null output when extracting an element from a table using xpath


python,xpath,web-scraping,web-crawler,scrapy
I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i...

Web Scraper for dynamic forms in python


python,web-scraping,web-crawler,mechanize
I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Check if element exists in fetched URL [closed]


javascript,jquery,python,web-crawler,window.open
I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button


javascript,python,pagination,web-crawler,scrapy
I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...