Nutch skip url containing #

I am learning Nutch. I have set up nutch and started crawling sites. But one thing I am unable to figure out is how to restrict url containing # as several duplication is going on due to this #. I have checked the regex-urlfilter.txt # skip URLs containing certain characters...

Apache Nutch REST api

I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

Nutch Error: JAVA_HOME is not set

I followed this tutorial http://saskia-vola.com/nutch-2-2-elasticsearch-1-x-hbase/ When I finally tried to run Nutch sudo bin/nutch inject urls I got this error Error: JAVA_HOME is not set. but when I echo JAVA_HOME it returns /usr/lib/jvm/java-7-openjdk-amd64 and it is also in /etc/environment JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64" and also I added line to end of file ~/.bashrc...

How to crawl images in Nutch 2.3 as HBase as backend?

I want to crawl images from certain sites. So far I tried modifiying regex-urlfilter.txt. I changed: -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PP T|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ To: -\.(css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|t gz|TGZ|mov|MOV|exe|EXE|js|JS)$ But it didn't work. I am surprised that I didn't find any documentation regarding crawling images using Nutch 2.3. Referal to any existing documentation would...

Nutch 2.3 REST curl syntax

I'm trying to use curl to test out the Nutch 2.X REST API. I'm able to start the nutchserver and inject URLS, but I'm having trouble getting the generate step to work. Here's what I've done: curl -i -X POST -H "Content-Type:application/json" http://localhost:8081/job/create -d '{"crawlId":"crawl-01","type":"INJECT","confId":"default","args":{"seedDir":"/Users/username/myNutchFolder/apache-nutch-2.3/runtime/local/urls/"}}' which when I look at...

How to resume a previous incomplete job in apache nutch crawler

I am using nutch 2.3. There is a possible chance that during any stage of nutch (fetch parse index etc.), network probelm occur or power shutdown happen. How I can resume previous incomplete job. Please give some example for explaination?...

Gora MongoDb Exception, can't serialize Utf8

I'm trying to get nutch 2.3 work with mongoDB but I get the following exception: java.lang.IllegalArgumentException: can't serialize class org.apache.avro.util.Utf8 at org.bson.BasicBSONEncoder._putObjectField(BasicBSONEncoder.java:284) at org.bson.BasicBSONEncoder.putObject(BasicBSONEncoder.java:185) I've found the following ticket related to this problem, which says it should be resolved in nutch 2.3: https://issues.apache.org/jira/browse/NUTCH-1843 There's another ticket for the Gora project...

Nutch, NoSuchElementException error after removing table from Hbase

I use nutch for crawling some sites. One time i decide to clear all crawling result and just remove "webpage" table from Hbase store, using hbase shell. After that nutch trow exception java.util.NoSuchElementException at java.util.TreeMap.key(TreeMap.java:1221) at java.util.TreeMap.firstKey(TreeMap.java:285) at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125) at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73) at org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68) at...

Parsing huge amount of HTML with java [closed]

What's the best way to pass HTML to Java? Specifically, I need to crawl through 2TB of HTML files (.warc format, using nutchWAX) and feed them to my java program one at a time. Workflow: crawl a page send page to java program wait for answer and then continue crawling...

focused crawler by modifying nutch

I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??

Cannot ant runtime in Apache nutch 2.3

I followed this tutorial https://wiki.apache.org/nutch/Nutch2Tutorial. When I tried to run ant runtime I was getting this message BUILD FAILED /usr/local/nutch/framework/apache-nutch-2.3/build.xml:113: The following error occurred while executing this line: /usr/local/nutch/framework/apache-nutch-2.3/src/plugin/build.xml:35: The following error occurred while executing this line: /usr/local/nutch/framework/apache-nutch-2.3/src/plugin/build-plugin.xml:117: Compile failed; see the compiler error output for details. This is on...

./bin/hbase shell command not working

i am integrating nutch with hbase . While dummy testing hbase . by typing ./bin/hbase shell.... i am getting the following error ./bin/hbase: line 392: /etc/java-7-openjdk//bin/java: No such file or directory thank you...

Apache nutch and solr : queries

I have just started using Nutch 1.9 and Solr 4.10 After browsing through certain pages I see that syntax for running this version has been changed, and I have to update certain xml's for configuring Nutch and Solr This version of package doesnt require Tomcat for running. I started Solr:...