FAQ Database Discussion Community


Apache Nutch REST api

api,rest,web-crawler,nutch
I'm trying to launch a crawl via the rest api. A crawl starts with injecting urls. Using a chrome developer tool "Advanced Rest Client" I'm trying to build this POST payload up but the response I get is a 400 Bad Request. POST - http://localhost:8081/job/create Payload { "crawl-id":"crawl-01", "type":"INJECT", "config-id":"default",...

How can I scrape pages with dynamic content using node.js?

node.js,request,web-crawler,phantomjs,cheerio
I am trying to scrape a website but I don't get some of the elements, because these elements are dynamically created. I use the cheerio in node.js and My code is below. var request = require('request'); var cheerio = require('cheerio'); var url = "http://www.bdtong.co.kr/index.php?c_category=C02"; request(url, function (err, res, html) {...

How to check whether the html has changed?

javascript,html,web-scraping,firefox-addon,web-crawler
Apologies if that's the wrong place, but have no clue where to ask. We are building Firefox addon that works on selected websites. Now, because those websites tend to change once in a while, I want to run once a day a javascript script that will check whether the specific...

Net/HTTPS not getting all the content

ruby,web-crawler,nokogiri,net-http,mechanize-ruby
I need to login into Jenkins through a crawler to collect some data, but Net/HTTPS gets an incomplete page in comparison to Jenkins' source, here are both sources: Net/HTTPS' HTML <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <meta http-equiv="refresh" content="1;url=/login?from=%2F"> <script> window.location.replace('/login?from=%2F'); </script>...

jsoup crawler error when called inside a servlet

java,google-app-engine,servlets,web-crawler,jsoup
I'm trying to crawl flipkart product specifications and the code works fine when I run it as a java application. But when I call it inside a servlet it gives me an error: org.jsoup.nodes.Document doc; Elements specs = null; try { doc = Jsoup.connect(link).timeout(250000).get(); specs = doc.select("table[class=specTable]"); System.out.println(specs); } catch...

Get all links from page on Wikipedia

python,python-2.7,web-crawler
I am making a Python web-crawler program to play The Wiki game. If you're unfamiliar with this game: Start from some article on Wikipedia Pick a goal article Try to get to the goal article from the start article just by clicking wiki/ links My process for doing this is:...

Crawling websites and dynamic urls

php,website,seo,web-crawler
Do search engine robots crawl my dynamically generated URLs? With this I mean html pages generated by php based upon GET variables in the url. The links would look like this: http://www.mywebsite.com/view.php?name=something http://www.mywebsite.com/view.php?name=somethingelse http://www.mywebsite.com/view.php?name=something I have tried crawling my website with a test crawler found here: http://robhammond.co/tools/seo-crawler but it only...

Get Facebook name from id with API/crawler

facebook,facebook-graph-api,web-crawler
I am trying to get Facebook name of user according to his Facebook id. for example if my Facebook is: Mor Amit and my Facebook id is: 875810135770071 i want to get the name of the Facebook that is: mor.amit.3 Do you know how can i do it? Thank you...

how to output multiple webpages crawled data into csv file using python with scrapy

python-2.7,web-scraping,web-crawler,scrapy,scrapy-spider
I have the following code below which crawls all the available pages from a website. This is perfectly `crawling` the valid pages because when I use print function - I can see the data from the `'items'` list, but I don't see any output when I try to use `.csv`...

How to keep a web crawler running?

javascript,node.js,web-crawler
I want to write my own web crawler in JS. I am thinking of using a node.js solution such as https://www.npmjs.com/package/js-crawler The objective is to have a "crawl" every 10 minutes - so every 10 minutes I want my crawler to fetch data from a website. I understand that I...

Stop Scrapy crawling the same URLs

python,web-scraping,web-crawler,scrapy,duplication
I've written a basic Scrapy spider to crawl a website which seems to run fine other than the fact it doesn't want to stop, i.e. it keeps revisiting the same urls and returning the same content - I always end up having to stop it. I suspect it's going over...

“TypeError: 'Rule' object is not iterable” webscraping an .aspx page in python

python-2.7,selenium,web-crawler
I am using the following code to scrape this website (http://profiles.ehs.state.ma.us/Profiles/Pages/ChooseAPhysician.aspx?Page=1) ; however, obtain the following TypeError: "File "C:\Users\Anaconda2\lib\site-packages\scrapy\contrib\spiders\crawl.py", line 83, in _compile_rules self._rules = [copy.copy(r) for r in self.rules] TypeError: 'Rule' object is not iterable" I don't have any code written on line 83, thus, wondering if anyone has...

Check if element exists in fetched URL [closed]

javascript,jquery,python,web-crawler,window.open
I have a page with, say, 30 URLS, I need to click on each and check if an element exists. Currently, this means: $('area').each(function(){ $(this).attr('target','_blank'); var _href = $(this).attr("href"); var appID = (window.location.href).split('?')[1]; $(this).attr("href", _href + '?' + appID); $(this).trigger('click'); }); Which opens 30 new tabs, and I manually go...

My Java program reaches 80% cpu usage after 20-30 min

java,database,web-crawler,cpu
I have a java program that crawls for some data on some sites and inserts it into the database. The Program keeps doing this : Get the html Extract the relevant data with some splits Insert into to database For the first 5-10 min it runs perfectly and very fast...

Create accounts only for real people

session,cookies,web-crawler
I am building a simple website where users can try a website without registering. I basically create shadow account and log users in without them knowing, so I don't have to bother with functionality of not-logged in users. I then set the cookie to a user so they can come...

Heritrix single-site scrape, including required off-site assets

java,web-crawler,heritrix
I believe need help compiling Heritrix decide rules, although I'm open to other Heritrix suggestions: https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules I need to scrape an entire copy of a website (in the crawler-beans.cxml seed list), but not scrape any external (off-site) pages. Any external resources needed to render the current website should be downloaded,...

Selenium Click() not working with scrapy spider

javascript,selenium-webdriver,click,web-crawler,scrapy
I am trying to scrape links to product pages from a listing page using a scrapy spider. The page shows the first 10 machines and has a button for 'show all machines' that calls some javascript. The javascript is reasonably complicated (i.e. I can't just look at the function and...

How to Pass variables inside functions using new method

java,variables,web-crawler
I have this code in my web crawler project: public static void main(String[] args) throws Exception { String frontierUrl = "http://www.cnn.com"; //creates a new instance of class WebCrawler WebCrawler webCrawler = new WebCrawler(); //Add the frontier url to the queue first webCrawler.enque(new LinkNode(frontierUrl)); webCrawler.processQueue(); } void enque(LinkNode link){ link.setEnqueTime(new Date());...

Python 3.3 TypeError: can't use a string pattern on a bytes-like object in re.findall()

python-3.x,web-crawler
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = "http://www.google.com" regex = '<title>(,+?)</title>' pattern = re.compile(regex) with urllib.request.urlopen(url) as response: html = response.read() title = re.findall(pattern,...

Web Scraper for dynamic forms in python

python,web-scraping,web-crawler,mechanize
I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Make Scrapy follow links and collect data

python,web-scraping,web-crawler,scrapy
I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>. I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated....

PageRank toy example fails to converge

python,web-crawler,graph-algorithm,pagerank
I'm coding a toy PageRank, including a crawler as well. It looks a bit odd, as my code fails to converge the PR values. I can also note that the delta between each iteration is 0, part of the output would be: url: http://en.m.wikipedia.org/wiki/Israel_State_Cup links_to_node: set(['http://en.m.wikipedia.org/wiki/Association_football', 'http://en.m.wikipedia.org/wiki/Wikipedia:General_disclaimer']) links_from_node: set(['http://en.m.wikipedia.org/wiki/Israel_State_Cup']) PR_score:...

Get a substring from a list item in python after a word

python,regex,beautifulsoup,web-crawler
I am using BeautifulSoup to get the title of a book from a goodreads page. Sample HTML - <td class="field title"><a href="/book/show/12996.Othello" title="Othello"> Othello </a></td> I want to get the text between the anchor tags. Using the code below, I can get all the children of with class="field title" in...

Scrape result export prooblem

python,web-crawler,scrapy
I have written a simple spider to search details on a website. When I run it on the console I'm getting the output, but if I put it into a file using -o filename.json it is just giving me a [ in the file. What do I do? My spider...

Why the difference of set-cookie after curl call in php

php,curl,cookies,header,web-crawler
I have two pages http://site.aspx?page=AddressData2&AddressID=298587,466579,66052 http://site.aspx?page=AddressData2&ShowPanel=EID second link meant to maintain the same cookie / session info that stored in the first link after it was accessed. I stored the cookie from the first one by: $cookies =array( $cookies[0]=>$cookies[1], "__utma"=>"250300755.603693956.1425821004.1425827777.1425854702.4", "__utmb"=>"250300755", "__utmc"=>"250300755",...

Python: Transform a unicode variable into a string variable

python,unicode,casting,web-crawler,unicode-string
I used a web crawler to get some data. I stored the data in a variable price. The type of price is: <class 'bs4.element.NavigableString'> The type of each element of price is: <type 'unicode'> Basically the price contains some white space and line feeds followed by: $520. I want to...

Scrapy middleware setup

python,web-scraping,web-crawler,scrapy
I am trying to access public proxy using scrapy to get some data. I get the following error when i try to run the code: ImportError: Error loading object 'craiglist.middlewares.ProxyMiddleware': No module named middlewares I've created middlewares.py file with following code: import base64 # Start your middleware class class ProxyMiddleware(object):...

How to get file access information on linux (debian)

linux,logging,web-crawler,monitoring,server
Recently i have been having some issues with the robots.txt file on my webserver according to google webmasters tools. More precisely i get a "Crawl postponed because robots.txt was inaccessible." message. This is weird, because if you try to access it: http://www.newsflow24.com/robots.txt it looks just fine, even the google crawl...

brute force web crawler, how to use Link Extractor towards increased automation. Scrapy

python,xpath,hyperlink,web-crawler,scrapy
I'm using a scrapy web crawler to extract a bunch of data, as I describe here, I've figured out a brute force way to get the information I want, but.. it's really pretty crude. I just ennumerate all the pages I want to scrape, which is a few hundred. I...

Why scrapy not giving all the results and the rules part is also not working?

python,xpath,web-scraping,web-crawler,scrapy
This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

Workload balancing between akka actors

multithreading,scala,web-crawler,akka,actor
I have 2 akka actors used for crawling links, i.e. find all links in page X, then find all links in all pages linked from X, etc... I want them to progress more or less at the same pace, but more often than not one of them becomes starved and...

How to prevent search engines from indexing a span of text?

html,web-crawler,robots.txt,googlebot,noindex
From the information I have been able to find so far, <noindex> is supposed to achieve this, making a single section of a page hidden from search engine spiders. But then it also seems this is not obeyed by many browsers - so if that is the case, what markup...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4

python,beautifulsoup,web-crawler,bs4
I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not. <h3 class="two-lines-name"> <a title="APPLE IPOD IPOD A1199...

python3 - can't pass through autorization

authentication,python-3.x,web-crawler,authorization
I need to build webcrawler for internal usage and I need to login into administration area. I'm trying to use requests lib, tried this ways: import urllib.parse import requests base_url = "https://target.url" data = ({'login': 'login', 'pass': 'password'}) params = urllib.parse.urlencode(data) r = requests.post(base_url, data=params) print(r.text) and import requests base_url...

Distinguishing between HTML and non-HTML pages in Scrapy

python,html,web-crawler,scrapy,scrapy-spider
I am building a Spider in Scrapy that follows all the links it can find, and sends the url to a pipeline. At the moment, this is my code: from scrapy import Spider from scrapy.http import Request from scrapy.http import TextResponse from scrapy.selector import Selector from scrapyTest.items import TestItem import...

Web Crawler - TooManyRedirects: Exceeded 30 redirects. (python)

python,web-crawler
I've tried to follow one of the youtube tutorial however I've met some issue. Anyone able to help? I'm new to python, I understand that there is one or two similar question, however, I read and don't understand. Can someone help me out? Thanks import requests from bs4 import BeautifulSoup...

Scrapy follow link and collect email

python,web-scraping,web-crawler,scrapy
i need help with saving email with Scrapy. The row in .csv file where emails are supposed to be collected is blank. Any help is very appreciated. Here is the code: # -*- coding: utf-8 -*- import scrapy # item class included here class DmozItem(scrapy.Item): # define the fields for...

Python: urllib2 get nothing which does exist

python,web-scraping,web-crawler,urllib2
I'm trying to crawl my college website and I set cookie, add headers then: homepage=opener.open("website") content = homepage.read() print content I can get the source code sometimes but sometime just nothing. I can't figure it out what happened. Is my code wrong? Or the web matters? Does one geturl() can...

limit web scraping extractions to once per xpath item, returning too many copies

python,xpath,web-crawler,scrapy
I'm using using the following scrapy based web crawling script to extract some elements of this page, however, it's returning the same information over and over which is complicating the post processing I have to do, is there a good way to limit these extractions to once per xpath item?...

How to crawl classified websites [closed]

web-crawler,scrapy,scrapy-spider
I am trying to write a crawler with Scrapy to crawl a classified-type (target) site and fetch information from the links on the target site. The tutorial on Scrapy only helps me get the links from the target URL but not the second layer of data gathering that I seek....

how to download image in Goutte

php,web-crawler,guzzle,goutte
I want to download an image in this page. The image source ishttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg. I try to download it use this: $client = new Goutte\Client (); $client->getClient->get($img_url, array('save_to' => $img_url_save_name)); But I failed, then I realize if I directly accesshttp://i2.pixiv.net/c/600x600/img-master/img/2015/01/19/12/17/13/48258889_p0_master1200.jpg, I are denied by CDN nginx server. I have to access...

Selenium pdf automatic download not working

python,selenium,selenium-webdriver,web-scraping,web-crawler
I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList",2); fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

Scrapy python error - Missing scheme in request URL

python,web-crawler,scrapy,scrapy-spider
I'm trying to pull a file from a password protected FTP server. This is the code I'm using: import scrapy from scrapy.contrib.spiders import XMLFeedSpider from scrapy.http import Request from crawler.items import CrawlerItem class SiteSpider(XMLFeedSpider): name = 'site' allowed_domains = ['ftp.site.co.uk'] itertag = 'item' def start_requests(self): yield Request('ftp.site.co.uk/feed.xml', meta={'ftp_user': 'test', 'ftp_password':...

How to use MessageQueue in Crawler?

architecture,web-crawler,message-queue
It seems that MessageQueue should be a good architectural solution for building Web Crawler, but still I can't understand how to do it. Let's consider the first case with shared database, it is pretty clear how to do it the algorithm would be the classical Graph Traversal: There are multiple...

ValueError:(“Invalid XPath: %s” % query) XPath Checker generating erroneous code

python,html,xpath,web-crawler,scrapy
If I use this id('div_a1')/x:div[3] in an attempt to extract the single character 匞 from the subsection ◎ 基本解释 of this website I get the error: ValueError:("Invalid XPath: %s" % query) Though if I just cut it down to id('div_a1') I get no error though I extract far too much....

how to check whether a program using requests module is dead or not

python,web-crawler,downloading
I am trying to using python download a batch of files, and I use requests module with stream turned on, in other words, I retrieve each file in 200K blocks. However, sometimes, the downloading may stop as it just gets stuck (no response) and there is no error. I guess...

Scrapy collect data from first element and post's title

python,web-scraping,web-crawler,scrapy,scrapy-spider
I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

How can I get the value of a Monad without System.IO.Unsafe? [duplicate]

haskell,web-crawler,monads
This question already has an answer here: How to get normal value from IO action in Haskell 2 answers I just started learning Haskell and got my first project working today. Its a small program that uses Network.HTTP.Conduit and Graphics.Rendering.Chart (haskell-chart) to plot the amount of google search results...

SgmlLinkExtractor not displaying results or following link

python,web-crawler,scrapy,scrapy-spider,sgml
I am having problems fully understanding how SGML Link Extractor works. When making a crawler with Scrapy, I can successfully extract data from links using specific URLS. The problem is using Rules to follow a next page link in a particular URL. I think the problem lies in the allow()...

Scrapy CrawlSpider not following links

python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Get Web Bot To Properly Crawl All Pages Of A Site

python,web-scraping,web-crawler,beautifulsoup
I am trying to crawl through all the pages of a website and pull out all instances of a certain tag/class. It seems to be pulling information from the same page over and over again, but I'm not sure why, because theres a bell-curve-ish change in len(urls) #The stack of...

Scrapy delay request

python,web-crawler,scrapy
every time i run my code my ip gets banned. I need help to delay each request for 10 seconds. I've tried to place DOWNLOAD_DELAY in code but it gives no results. Any help is appreciated. # item class included here class DmozItem(scrapy.Item): # define the fields for your item...

Authorization issue with cron crawler inserting data into Google spreadsheet using Google API in Ruby

ruby,cron,google-api,web-crawler,google-api-client
My project is to crawl the certain web data and put them into my Google spreadsheet every morning 9:00. And it has to get the authorization to read & write something. That's why the code below is located at the top. # Google API CLIENT_ID = blah blah CLIENT_SECRET =...

PHP Curl for encrypted pages

php,curl,encryption,web-crawler
I do a php curl to this website http://www.hoovers.com/company-information/company-search.html But it returned 404. Looks like something encrypted or what. Can you give some clue about this problem. Thanks // Get cURL resource $curl = curl_init(); // Set some options - we are passing in a useragent too here curl_setopt_array($curl, array(...

How to resume a previous incomplete job in apache nutch crawler

apache,web-crawler,nutch,resume
I am using nutch 2.3. There is a possible chance that during any stage of nutch (fetch parse index etc.), network probelm occur or power shutdown happen. How I can resume previous incomplete job. Please give some example for explaination?...

How to access the web page contents

java,html,web,web-crawler,jsoup
I am storing text of a webpage in a string . but some contents of the web page is not stored in the string. I don't know why the contents in a div like elements are not stored. Even the links inside the div is not accessible using a web...

fullPage.js: Make all slides and sections visible in search engine results

jquery,seo,web-crawler,single-page-application,fullpage.js
I'm using fullpage.js jQuery plugin for a Single page application. I'm using mostly default settings and the plugin works like a charm. When I got to the SEO though I couldn't properly make Google crawl my website on a "per slide" basis. All my slides are loaded at the page...

Redis - list of visited sites from crawler

python,url,redis,queue,web-crawler
I'm currently working on a crawler coded in Python with combination of Gevent/requests/lxml to crawl a defined set of pages. I use redis as a db to hold lists such as pending queue, fetching, and sites that has been crawled. For each url, I have a key url_ and I'm...

Howto use scrapy to crawl a website which hides the url as href=“javascript:;” in the next button

javascript,python,pagination,web-crawler,scrapy
I am learning python and scrapy lately. I googled and searched around for a few days, but I don't seem to find any instruction on how to crawl multiple pages on a website with hidden urls - <a href="javascript:;". Basically each page contains 20 listings, each time you click on...

efficient XPath syntax exclusively extract single component

html,xpath,web-scraping,web-crawler,scrapy
Using the Firefox-Aurora I determined the following HTML snippet from this website: http://www.zdic.net/z/19/js/5DCD.htm. I want to extract only the component 丨フ丨ノ一丨ノ丶フノ一ノ丨フ一一ノフフ丶. It's located near the bottom of the following code block: <tr> <td class="z_i_t4_uno" align="center"> <a href="http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5DCD" target="_blank"> <img src="/images/unicode2.gif" border="0" align="absmiddle"> </a> U+5DCD </td> <td...

Scrapy, detect when new start_url is being

python,scrapy,web-crawler
I'm trying to estimate the progress of a spider by counting how many start_url it has processed but I'm not sure how to detect this. I know it's nowhere near a real measure of current progress as the spider has no clue how big the remaining sites to be crawled...

Extracting data from webpage using lxml XPath in Python

python,xpath,web-crawler,lxml,python-requests
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library. The page url is www.mangapanda.com/one-piece/1/1 I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to...

Cannot download image with relative URL Python Scrapy

python,web-crawler,scrapy,scrapy-spider
I'm using Scrapy to download images from http://www.vesselfinder.com/vessels However, I can only get the relative url of images like this http://www.vesselfinder.com/vessels/ship-photo/0-227349190-7c01e2b3a7a5078ea94fff9a0f862f8a/0 All of the image named 0.jpg but if I try to use that absolute url, I cannot get access to the image. My code: items.py import scrapy class VesselItem(scrapy.Item):...

How to make a parser for a web crawler maintainable

ruby,web-crawler,nokogiri
I wrote a Ruby web-crawler that retrieves data from a third-party website. I am using Nokogiri to extract information based on a specific CSS div and specific fields (accessing children and elements of the nodes I extract). From time to time, the structure of the third-party website changes which breaks...

The scrapy LinkExtractor(allow=(url)) get the wrong crawled page, the regulex doesn't work

python,web-crawler,scrapy
I want to crawl the page http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie . And some part of my spider code is : class MovieSpider(CrawlSpider): name = "doubanmovie" allowed_domains = ["douban.com"] start_urls = ["http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie"] rules = ( Rule(LinkExtractor(allow=(r'http://www.douban.com/tag/%E7%88%B1%E6%83%85/movie\?start=\d{2}'))), Rule(LinkExtractor(allow=(r"http://movie.douban.com/subject/\d+")), callback = "parse_item") ) def start_requests(self): yield...

How do i cache pages that are created on the fly by java servlet so reusable and indexable

java,tomcat,amazon-web-services,web-crawler
Im using Amazon Web Services with Tomcat to deploy a Java application. The application consists of a a Lucene index of artist data and a website that allows a user to search for a musical artist (e.g madonna, beatles) it will then return information about that artist generated from the...

Redirecting Crawler to internal service

facebook,nginx,service,web-crawler
I want to setup nginx to have certain crawlers get data from an internal service running on port 9998. So for instance, when a browser requests www.mywebsite.com/resource/1 it will look at the root folder but when the same resource is requested by a crawler (for instance the FB crawler) it...

Web crawler class not working

python,web,web-crawler
Recently, I began working on constructing a simple web crawler. My initial code that just iterated twice worked perfectly, but when I attempted to turn it into a class with error exception handling, it no longer compiled. import re, urllib class WebCrawler: """A Simple Web Crawler That Is Readily Extensible"""...

How could I get a part of a match string by RegEx in Python?

python,regex,web-crawler
I'm now making a web-spider by python,and some part of the program requests me to get some strings like data-id="48859672" from a website. I've successfully got these strings using: pattern=re.compile(r'\bdata-id="\d+"') m=pattern.search(html,start) But I'm now wondering how to only get the number part of the strings,except the whole string?...

How to iterate over many websites and parse text using web crawler

python,web-crawler,sentiment-analysis
I am trying to parse text and run an sentiment analysis over the text from multiple websites. I have successfully been able to strip just one website at a time and generate a sentiment score using the TextBlob library, but I am trying to replicate this over many websites, any...

Crawling & parsing results of querying google-like search engine

java,parsing,web-crawler,jsoup
I have to write parser in Java (my first html parser by this way). For now I'm using jsoup library and I think it is very good solution for my problem. Main goal is to get some information from Google Scholar (h-index, numbers of publications, years of scientific carier). I...

Making AngularJS and Parse Web App Crawlable with Prerender

angularjs,parse.com,web-crawler,google-crawlers,prerender
I have been trying to get my AngularJS and Parse web app crawlable for Google and Facebook share and even with prerender-parse I have not been able to get it working. I have tried using tips from this Parse Developers thread for engaging HTML5 Mode. Nothing will work using the...

How to crawl links on all pages of a web site with Scrapy

website,web-crawler,scrapy,extract
I'm learning about scrapy and I'm trying to extract all links that contains: "http://lattes.cnpq.br/andasequenceofnumbers" , example: http://lattes.cnpq.br/0281123427918302 But I don't know what is the page on the web site that contains these information. For example this web site: http://www.ppgcc.ufv.br/ The links that I want are on this page: http://www.ppgcc.ufv.br/?page_id=697 What...

delete spiders from scrapinghub

delete,web-crawler,scrapy,scrapy-spider,scrapinghub
I am a new user of scrapinghub. I already searched on googled and had read the scrapinghub docs but I could not find any information about removing spiders from a project. Is it possible, how? I do not want to replace a spider, I want to delete/remove it from scrapinghub...

PhantomJS console charset

node.js,character-encoding,web-crawler,phantomjs
I'm trying to run below code but my console prints weird charset. var page = require('webpage').create(); var url = "http://www.bdtong.co.kr/index.php?c_category=C02" //var url = "http://www.baemin.com/"; /* var option = { encoding : "euc-kr" } */ page.onConsoleMessage = function(msg, line, source) { //phantom.outputEncoding = "utf8"; console.log('console> '+msg); }; page.open(url, function() { page.includeJs("http://ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js",...

Single session multiple post/get in python requests

python,web-crawler,python-requests
I am trying to write a crawler to automatically download some files using python requests module. However, I met a problem. I initialized a new requests session, then I used post method to login into the website, after that as long as I try to use post/get method (a simplified...

How to retrieve redirect url given in window.location

python,beautifulsoup,web-crawler,python-requests,url-redirection
I am trying to make a crawler using Python. I am making use of beautifulsoup and requests library and need the set of URLs for a given website. However, in a certain part, there is redirect and when I print the response.text i.e the page content I get the following...

Scraping Multi level data using Scrapy, optimum way

python,selenium,data-structures,web-crawler,scrapy
I have been wondering what would be the best way to scrap the multi level of data using scrapy I will describe the situation in four stage, current architecture that i am following to scrape this data basic code structure the difficulties and why i think there has to be...

Python: Can I use Chrome's “Inspect Element” XPath create tool as a Scrapy spider XPath?

python,google-chrome,xpath,web-crawler,scrapy
My spider class is as follows: class MySpider(BaseSpider): name = "dropzone" allowed_domains = ["dropzone.com"] start_urls = ["http://www.dropzone.com/cgi-bin/forum/gforum.cgi?post=4724043"] def parse(self, response): hxs = HtmlXPathSelector(response) reply = response.xpath('//*[@id="wrapper"]/div/div/table/tbody/tr/td/div/div/center/table/tbody/tr/td/table/tbody/tr/td/font/table/tbody/tr/td/table/tbody/tr/td/font/b') dates =...

Cannot Write Web Crawler in Python

python,web-crawler,beautifulsoup,urllib2
I'm having an issue writing a basic web crawler. I'd like to write about 500 pages of raw html to files. The problem is my search is either too broad or too narrow. It either goes too deep, and never gets past the first loop, or doesn't go deep enough,...

focused crawler by modifying nutch

web-crawler,nutch
I want to create a focused crawler using nutch. Is there any way to modify nutch so as to make crawling faster? Can we use the metadata in nutch to train a classifier that would reduce the number of urls nutch has to crawl for a given topic??

Unable to click in CasperJS

javascript,web-crawler,phantomjs,casperjs
I want to crawl the HTML data. And, I tried headless browser in CasperJS. But, Can't able to click. - The following is tried code in CapserJS. var casper = require('casper').create(); var mouse = require('mouse').create(casper); casper.start('http://sts.kma.go.kr/jsp/home/contents/climateData/smart/smartStatisticsSearch.do', function() { this.echo('START'); }); casper.then(function() { this.capture("1.png"); this.mouse.click('li[class="item1"]'); casper.wait(5000, function() { this.capture("2.png"); }); });...

PHP web crawler, check URL for path

php,url,path,web-crawler,bots
I'm writing a simple web crawler to grab some links from a site. I need to check the returned links to make sure I selectively collect what I want. For example, here's a few links returned from http://www.polygon.com/ [0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments [1] http://www.polygon.com/videos [2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide [3] http://www.polygon.com/features so link 0 and...

Is it possible to list all the functions called after clicking the page with the use of Chrome Developer Tools

javascript,google-chrome,debugging,web-crawler
While writting spider,I always have to find out which function send the Http requset on the javascript page after I click it. As there may be so many functions involved,I have to jump from one to another,guessing which is the key one. while we run an incorrect program written in...

T_STRING error in my php code [duplicate]

php,web-crawler
This question already has an answer here: PHP Parse/Syntax Errors; and How to solve them? 10 answers I have this PHP that is supposed to crawl End Clothing website for product IDs When I run it its gives me this error Parse error: syntax error, unexpected 'i' (T_STRING), expecting...

Why getting unexpected with loop of URL results when solving equation

java,web-crawler,equation
I am trying to write a web crawler algorithm. To do that i use the equations below: and i write this code too to solve it: public class URLWeight { public static List<LinkNode> weight(LinkNode sourceLink, List<LinkNode> links){ List<LinkNode> interLinks = new LinkedList<>(); List<LinkNode> intraLinks = new LinkedList<>(); for (LinkNode link...

Scrapy not entering parse method

python,selenium,web-scraping,web-crawler,scrapy
I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

How to retrieve all the images, js, css urls

python,http,web,web-crawler,scrapy
I was going through all the scrapy examples and tutorials I can find and I couldn't find an example where I can go and get all the urls of the images, css, and js files being sent from the server. Is there a way to do that with scrapy? If...

Why python print is delayed?

python,python-3.x,web-crawler,python-requests
I am trying to download file using requests, and print a dot every time retrieve 100k size of file, but all the dots is printed out at the end. See code. with open(file_name,'wb') as file: print("begin downloading, please wait...") respond_file = requests.get(file_url,stream=True) size = len(respond_file.content)//1000000 #the next line will not...

How to get Google to re-index a page after removing noindex metatag?

web-crawler,sitemap,meta-tags,google-webmaster-tools,noindex
By accident, I had put <meta name="robots" content="noindex"> into lots of pages on my domain. I have now removed this meta-tag, but how can I get these pages to be re-indexed by Google? Any tip? I have tried re-submitting my sitemap.xml in Webmaster Tools, but I'm not sure if it...

Selenium interpret javascript on mac?

selenium,web-crawler,mechanize
I'm trying to make a web crawler that click on ads (yes, i know), it's very sophisticated, but, I realise that Google Ads aren't showed when javascript is disabled. Today, i use Mechanize, and it doesn't "accept" javasript. I heard selenium use another system to crawl the net. The only...

Python: Scrapy start_urls list able to handle .format()?

python,function,while-loop,web-crawler,scrapy
I want to parse a list of stocks so I am trying to format the end of my start_urls list so I can just add the symbol instead of the entire url. Spider class with start_urls inside stock_list method: class MySpider(BaseSpider): symbols = ["SCMP"] name = "dozen" allowed_domains = ["yahoo.com"]...

How to use file_get_contents to find 'a' and click 'a' to get inner contents

php,ajax,web-crawler
I am making a crawler to fetch data from pakwheels.com, I was able to fetch data from this website from this code <?php for ($y = 1; $y <= 5; $y++) { $pakwheels = file_get_contents('http://www.pakwheels.com/used-cars/search/-/?page=' . $y . ''); $file2 = 'pakwheels.txt'; file_put_contents($file2 , $pakwheels, FILE_APPEND); } ?> But requirement...

Scrapy returning a null output when extracting an element from a table using xpath

python,xpath,web-scraping,web-crawler,scrapy
I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i...

Heritrix not finding CSS files in conditional comment blocks

java,web-crawler,heritrix
The Problem/evidence Heritrix is not detecting the presence of files in conditional comments that open & close in one string, such as this: <!--[if (gt IE 8)|!(IE)]><!--> <link rel="stylesheet" href="/css/mod.css" /> <!--<![endif]--> However standard conditional blocks like this work fine: <!--[if lte IE 9]> <script src="/js/ltei9.js"></script> <![endif]--> I've identified the...

WxPython using Listbox and other UserInput with a Button

python,listbox,wxpython,web-crawler
I am trying to create a web crawler based on specific user input. For example, the User Input I am trying to receive is from a ListBox and a text field. Once I have that information, I would like the user to click a button to start the search with...

SgmlLinkExtractor in scrapy

web-crawler,scrapy,rules,extractor
i need some enlightenment about SgmlLinkExtractor in scrapy. For the link: example.com/YYYY/MM/DD/title i would write: Rule(SgmlLinkExtractor(allow=[r'\d{4}/\d{2}/\d{2}/\w+']), callback='parse_example')] For the link: example.com/news/economic/title should i write: r'\news\category\w+'or r'\news\w+/\w+' ? (category changes but the url contains always news) For the link: example.com/article/title should i write: r'\article\w+' ? (the url contains always article)...

scrapy crawling multiple pages [3 levels] but scraped data not linking properly

python,arrays,web-crawler,scrapy
I'm trying to scrape 3 levels of data: Tv name -> season -> episodes. The issue I'm having is that I'm getting all the episodes, but the first two levels is not linking. For example season 1 has 5 episodes and season 2 have 10 episodes, the output I'm getting...

want to keep running my single ruby crawler that dont need html and nothing

ruby-on-rails,ruby,web-crawler
first of all, I'm a newbie. I just made a single ruby file, which crawls something on the certain web and put data into my google spreadsheet. But I want my crawler to do its job every morning 9:00 AM. Then what do I need? Maybe a gem and server?...

Ruby - WebCrawler how to visit the links of the found links?

ruby,url,hyperlink,web-crawler,net-http
I try to make a WebCrawler which find links from a homepage and visit the found links again and again.. Now i have written a code w9ith a parser which shows me the found links and print there statistics of some tags of this homepage but i dont get it...