FAQ Database Discussion Community


Scrapy crawler not processing XHR Request

python,web-scraping,xmlhttprequest,scrapy,scrape
My spider is only crawling the first 10 pages, so I am assuming it is not entering the load more button though the Request. I am scraping this website: http://www.t3.com/reviews. My spider code: import scrapy from scrapy.conf import settings from scrapy.http import Request from scrapy.selector import Selector from reviews.items import...

How can I initialize a Field() to contain a nested python dict?

python,web-scraping,scrapy
I have a Field() in my items.py called: scores = Field() I want multiple scrapers to append a value to a nested dict inside scores. For example, one of my scrapers: item['scores']['baseball_score'] = '92' And another scraper would: item['scores']['basket_score'] = '21' So that when I retrieve scores: > item['scores'] {...

Error writing data to CSV due to ascii error in Python

python,csv,web-scraping,non-ascii-chars
import requests from bs4 import BeautifulSoup import csv from urlparse import urljoin import urllib2 base_url = 'http://www.baseball-reference.com' data = requests.get("http://www.baseball-reference.com/teams/BAL/2014-schedule-scores.shtml") soup = BeautifulSoup(data.content) outfile = open("./Balpbp.csv", "wb") writer = csv.writer(outfile) url = [] for link in soup.find_all('a'): if not link.has_attr('href'): continue if link.get_text() != 'boxscore': continue url.append(base_url + link['href']) for...

Iterate over all links/sub-links with Scrapy run from script

python,windows,python-2.7,web-scraping,scrapy
I want to run Scrapy Spider from my script, but it works only for 1 request. I cannot execute the procedure self.parse_product from scrapy.http.Request(product_url, callback=self.parse_product). I guess it's being due the command crawler.signals.connect(callback, signal=signals.spider_closed). Please advise how correctly go over all links and sub-links. Whole script is shown below. import...

PHP scrape links from table

php,web-scraping
How to get only one link from the table? <table> <tr class="title"> <td width="40%">a </td> <td width="40%">b</td> <td width="10%">c</td> <td width="10%">d</td> </tr> <tr> <td>abc.com</td> <td>123.123.526.12</td> <td><a class="update" href="fruit/grape"</a></td> <td><a class="delete" href="fruit/grape"></a></td> <td> </td> </tr> <tr> <td>bcd.com</td>...

URL Variable is not being recognized using NSURL

ios,swift,parsing,web-scraping,nsurl
I am attempting to parse a website, however when I was simply starting to set up my code I got the error, playMusicViewController.swift does not have a member named 'url', as well as the error expected declaration under the task.resume, I did include an import statement as well. let url...

scraping url and title from nested anchor tag

python,web-scraping,scrapy
This is my first scraper using scrapy. I am trying to scrap video url, title from https://www.google.co.in/trends/hotvideos#hvsm=0 site. import scrapy from scrapy.item import Item, Field from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector class CraigslistItem(Item): title = Field() link = Field() class DmozSpider(scrapy.Spider): name = "google" allowed_domains = ["google.co.in"] start_urls...

Grabbing text data from Baseball-reference Python

python,web-scraping,html-parsing
http://www.baseball-reference.com/players/split.cgi?id=aardsda01&year=2015&t=p I would like to get the data of what arm this pitcher pitches with. If it were a table i would be able to grab the data but I dont know how to get the text. David Aardsma \ARDS-mah\ David Allan Aardsma (twitter: @TheDA53) Position: Pitcher Bats: Right, Throws:...

Scrapy: how can I get the content of pages whose response.status=302?

web-scraping,scrapy,scrape,scrapy-spider
I get the following log when crawling: DEBUG: Crawled (302) <GET http://fuyuanxincun.fang.com/xiangqing/> (referer: http://esf.hz.fang.com/housing/151__1_0_0_0_2_0_0/) DEBUG: Scraped from <302 http://fuyuanxincun.fang.com/xiangqing/> But it actually returns nothing. How can I deal with these response with status=302? Any help would be much appreciated !...

Get JavaScript function call value using Selenium

python,selenium,selenium-webdriver,web-scraping,scrapy
I am scraping web pages using python-scrapy which works pretty well for static content. I am trying to scrape a url from this page but as it turns out, it is returned through a javascript call. For this I am using selenium but unable to figure out how to do...

Not able to scrape data from dropdown box

javascript,python,html,web-scraping
In the following website "http://www.msamb.com/apmcpri_rpt.aspx". The output change every time I click on an element in a dropdown but the url remains same. It is calling a java script if the value of the drop down changes. I tracked the Network and checked the request headers and form key-values and...

Download csv file via submit button in R

r,csv,automation,web-scraping
I want to automate the following task in R: Go to the following page with historical data: http://www.ariva.de/XXX/historische_kurse, where the XXX stands for some ticker, like e.g. DBX0BT So the URL in this case would be: http://www.ariva.de/DBX0BT/historische_kurse At the right hand bottom there is a button Download. I want to...

How to parse Selenium driver elements?

python,parsing,selenium,selenium-webdriver,web-scraping
I'm new in Selenium with Python. I'm trying to scrape some data but I can't figure out how to parse outputs from commands like this: driver.find_elements_by_css_selector("div.flightbox") I was trying to google some tutorial but I've found nothing for Python. Could you give me a hint?...

How to identify an element via XPath when IDs keep changing

html,xml,xpath,selenium-webdriver,web-scraping
I am using a website, where the values of the elements are changing dynamically every time the elements load. The id's are dynamic and so is the XPath. I don't seem to have any unique identifier to locate the elements. Please advise on the best way to uniquely identify the...

what would be the right way of doing getallAttributes()

python,xpath,web-scraping,scrapy
I am trying to read the property (attributes) of given element. I want to extract a Dictionary of all the attributes name-value pairs. What I am currently doing is using regex and listing all the property values. But the problem here is, it only displays the value of the property...

Chrome element inspector Xpath with @href won't show link text

html,xml,url,xpath,web-scraping
I am trying to get url from a website's html and here is the Xpath code I tried for StackOverflow's landing page: $x('//*[@id="question-summary-30429261"]/div[2]/h3/a/@href') I think it should return the text after the equal sign in the following statement: href=xxxx but it doesn't. It just returns null. I tried Googling this...

Ruby - Find Tag by ID

ruby,web-scraping,nokogiri
I'm using mechanize and nokogiri. I'm trying to find this tag. When I inspect the HTML it looks like this. <table class="matchupBox" id="MLB_5_block "> When I print it out in my console it looks like this #<Nokogiri::XML::Element:0x2cc1a1c name="table" attributes=[ #<Nokogiri::XML::Attr:0x2cc1940 name="class" value="matchupBox">, #<Nokogiri::XML::Attr:0x2cc192c name="id" value="MLB_5_block\r\n ">] I am using this...

Python - Selenium and XPATH to extract all rows from a table

python,selenium,xpath,selenium-webdriver,web-scraping
I am using Selenium and XPATH to extract all rows from a table, but can only get the first row. Here is what I am doing: from selenium import webdriver path_to_chromedriver = '/Users/me/Desktop/chromedriver' browser = webdriver.Chrome(executable_path = path_to_chromedriver) url = "http://www.psacard.com/smrpriceguide/SetDetail.aspx?SMRSetID=1055" browser.get(url) browser.implicitly_wait(10) SMRtable = browser.find_element_by_xpath('//*[@class="set-detail-table"]/tbody') for i in SMRtable.find_element_by_xpath('.//tr'):...

Can SPARQL handle blank results for specific cells?

web-scraping,sparql,scrape,dbpedia
I am writing a SPARQL query and cant figure out how to allow blank results for specific columns. My current request is: select * where { ?game a dbpedia-owl:Game ; dbpprop:name ?name ; dbpedia-owl:publisher ?publisher . } Some Games have an owl for publisher while others do not. The above...

Beautifulsoup: Getting a new line when I tried to access the soup.head.next_sibling value with Beautifulsoup4

python,python-2.7,web,web-scraping,beautifulsoup
I am trying an example from the BeautifulSoupDocs and found it acting weird. When I try to access the next_sibling value, instead of the "body" a '\n' is coming in to picture. html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were...

Getting table in R from HTMLInternalDocument object

json,xml,r,web-scraping
I have to download several tables from website, table id is "tabela", I tried various functions XML::readHTMLTable, XML::xmlTreeParse, but only rvest package loads it : require(rvest) url="http://www.pse.pl/index.php?modul=21&id_rap=2&data=2013-01-01" wpkd <- html(url) class(wpkd) [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" "XMLAbstractDocument" str(wpkd) Classes 'HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr> now I would like to extract...

Scrapy Limit Requests For Testing

python,python-2.7,web-scraping,scrapy,scrapy-spider
I've been searching the scrapy documentation for a way to limit the number of requests my spiders are allowed to make. During development I don't want to sit here and wait for my spiders to finish an entire crawl, even though the crawls are pretty focused they can still take...

Return html code of dynamic page using selenium

python,python-2.7,selenium,selenium-webdriver,web-scraping
I'm trying to crawl this website, problem is it's dynamically loaded. Basically I want what I can see from the browser console, not what I see when I right click > show sources. I've tried some selenium examples but I can't get what I need. The code below uses selenium...

How to get javascript output in python BeautifulSoup or any other module

javascript,python,html,web-scraping,beautifulsoup
In my attempt to make a scraper, I found a website that uses javascript alot in its code, is it possible to retrieve the output of the script e.g. <html> <head> <title>Python</title> </head> <body> <script type="text/javascript" src='test.js'></script> <p> some stuff <br> more stuff <br> code <br> video <br> picture <br>...

Error fetching data from website

objective-c,osx,web-scraping,request
Hi so I wanted to retrieve data from this website: http://www.timeapi.org/utc/now for an app that I was making, but when I make the request with the following code, I always get null: NSURL * timeAPI = [[NSURL alloc]initWithString:@"http://www.timeapi.org/utc/now"]; NSURLRequest * urlRequest = [[NSURLRequest alloc]initWithURL:timeAPI]; __block NSData * responseData; [NSURLConnection sendAsynchronousRequest:urlRequest...

Crawl spider not crawling ~ Rule Issue

python,web-scraping,scrapy,scrapy-spider
I am having an issue with a spider that I am programming. I am trying to recursively scrape the courses off my university's website but I am having great trouble with Rule and LinkExtractor. # -*- coding: utf-8 -*- import scrapy from scrapy.spider import Spider from scrapy.contrib.spiders import CrawlSpider, Rule...

Selenium pdf automatic download not working

python,selenium,selenium-webdriver,web-scraping,web-crawler
I am new to selenium and I am writing a scraper to download pdf files automatically from a given site. Below is my code: from selenium import webdriver fp = webdriver.FirefoxProfile() fp.set_preference("browser.download.folderList",2); fp.set_preference("browser.download.manager.showWhenStarting",False) fp.set_preference("browser.download.dir", "/home/jill/Downloads/Dinamalar") fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf") browser = webdriver.Firefox(firefox_profile=fp)...

How to get extact page content in wget if error code is 404

python-3.x,curl,web-scraping,wget
I have two url one is working url another one is page deleted url.working url is fine but for page deleted url instead of getting the exact page content wget receives 404 Working url import os def curl(url): data = os.popen('wget -qO- %s '% url).read() print (url) print (len(data)) #print...

web scrapping and data processing in java

java,html,regex,web-scraping,data-processing
I am writing a web scrapper program to extract stock quotes from yahoo finance,google finance or nasdaq. I can get the html element containing the stock prices but I only need the dollar value from the result. For example the sample output looks like the image below: I am using...

Scraping with BeautifulSoup: want to scrape entire column including header and title rows

python,web-scraping,beautifulsoup
I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site using Python. With the code below I can get the first row of all the Columns data that I want....

Parse an HTML table with Nokogiri in Ruby

html,ruby,web-scraping,nokogiri
I have an HTML table that looks like the following: <table id="TTdata" border="0" cellspacing="0" cellpadding="3" align="center"> <tbody> <tr class="TTdata_ltblue"> <td class="ctr"><b>#</b></td> <td class="ctr"><b><a href="http://www.baseballprospectus.com/sortable/index.php?cid=1819124&amp;newsort1column=YEAR">YEAR</a><img src="/images/up.gif"></b></td> <td class="ctr" title="Player's name."><b><a...

Xpath text() wrong output

python,xpath,web-scraping,scrapy
This is my first scrapy program! I'm writing a program using python/scrapy and I've tested my Xpath in FirePath and it works perfectly, but it is not displaying properly in the console (still in the early testing phase) What I'm doing is attempting to scrape a page of amazon reviews....

Scrapy not giving individual results of all the reviews of a phone?

python,xpath,web-scraping,scrapy,scrapy-spider
This code is giving me results but the output is not as desired .what is wrong with my xpath? How to iterate the rule by +10. I have problem in these two always. import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.selector import Selector from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from urlparse...

Can I get Nokogiri to scrape text from span in Ruby?

html,ruby,web-scraping,nokogiri,curb
I'm trying to scrape info from a website using nokogiri and curb, but I can't seem to find the right name/title to find out where to scrape (I'm trying to scrape the api key, which is at the bottom of the html code as "xxxxxxx") or even how to, please...

CasperJS/SpookyJS css selector is existing and not-existing

javascript,css-selectors,web-scraping,phantomjs,spookyjs
I have a strange problem during screen scraping with spookyjs / capserjs. I want to catch information from the following website: 'https://www.rwe-smarthome.de/is-bin/INTERSHOP.enfinity/WFS/RWEEffizienz-SmartHome-Site/de_DE/-/EUR/ViewApplication-DisplayWelcomePage'. Because the site contains more than one page of products I want to open the other sites too. Normally one could use this.click(selector, function() {}); to achieve this....

Scrapy writing XPath expression for unknown depth

html,xpath,web-scraping,scrapy
I have an html file which is like: <div id='author'> <div> <div> ... <a> John Doe </a> I do not know how many div's would be under the author div. It may have different depth for different pages. So what would be the XPath expression for this kind of xml?...

How to exclude a particular html tag(without any id) from several tags while using scrapy?

python,html,web-scraping,scrapy,scrapy-spider
<div class="region size2of3"> <h2>Mumbai</h2> <strong>Fort</strong> <div>Elphinstone building, Horniman Circle,</div> <div>Veer Nariman Road, Fort</div> <div>Mumbai 400001</div> <div>Timings: 08:00-00:30 hrs (Mon-Sun)</div> <div><br></div> </div> I want to exclude the "Timings: 08:00-00:30 hrs (Mon-Sun)" div tag while parsing. Here's my code: import scrapy from job.items import StarbucksItem class StarbucksSpider(scrapy.Spider): name =...

Scraping data using simple html dom and simpleXML

php,web-scraping,simplexml,simple-html-dom
I'm trying to scrape data from several links which i retrieve from a xml file. However i keep getting an error which only seem to appear on some of the news. below you can see the output i get http://www.hltv.org/news/14971-rgn-pro-series-groups-drawnRGN Pro Series groups drawn http://www.hltv.org/news/14969-k1ck-reveal-new-teamk1ck reveal new team http://www.hltv.org/news/14968-world-championships-captains-unveiled Fatal...

Is there a way using scrapy to export each item that is scrapped into a separate json file?

web-scraping,scrapy,scrapy-spider
currently I am using "yield item" after every item i scrape, though it gives me all the items in one single Json file.

Scraping dynamic data with imacro to excell

web-scraping,imacros
I want to scrape dynamic data (refreshable every 4 seconds and it's a number ) with imacro and represent that number changing along the time in excell ( or any other way ). How can i do this ? Imacro, as further as i know can get the data...

Save image from url to special folder

python,web-scraping,beautifulsoup
I want to save images from url to special folder, for example 'my_images', but not to default(where my *.py file is). Is it possible to make it? Because my code saves all images to folder with *.py file. Here is my code: import urllib.request from bs4 import BeautifulSoup import re...

Web Scraper for dynamic forms in python

python,web-scraping,web-crawler,mechanize
I am trying to fill the form of this website http://www.marutisuzuki.com/Maruti-Price.aspx. It consists of three drop down lists. One is Model of the car, Second is the state and third is city. The first two are static and the third, city is generated dynamically depending upon the value of state,...

Scrapy CrawlSpider not following links

python,web-scraping,web-crawler,scrapy,scrapy-spider
I am trying to crawl some attributes from all(#123) detail pages given on this category page - http://stinkybklyn.com/shop/cheese/ but scrapy is not able to follow link pattern I set, I checked on scrapy documentation and some tutorials as well but No Luck! Below is the code: import scrapy from scrapy.contrib.linkextractors...

Can't get value from xpath python

python,html,xpath,web-scraping,html-parsing
I want to get values from page: http://www.tabele-kalorii.pl/kalorie,Actimel-cytryna-miod-Danone.html I can get all values from first section, but I can't get values from table "Wartości odżywcze" I use this xpath: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]/span/text()")) But I'm not getting anything. With xpath like this: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]//text()")) I'm...

Using XPath to select the href attribute of the following-sibling

html,google-chrome,xpath,web-scraping
I am attempting to scrape the following site: http://www.hudson211.org/zf/profile/service/id/659837 I am trying to select the href next to the "web address" text. The following xpath selector gets the tag I am after: $x("//th[contains(text(), 'Web Address')]/following-sibling::td/a") returns <a href="http://www.co.sullivan.ny.us">www.co.sullivan.ny.us</a> However, when I specifically try to extract the href using @href, the...

Getting specific element in Url using Nokogiri

ruby,web-scraping,nokogiri
I have this kind of html structure : <table class="list"> <tbody> <tr> <td> </td> <td> <a href="club.do?codeClub=01670001&millesime=2015"></a> </td> </tr> </tbody> </table> I want to get the link contained in the second <td> of each <tr> contained in the table that has the class list. Then actually in each Url I...

Loop through downloading files using selenium in Python

python,selenium,selenium-webdriver,web-scraping,python-3.4
This is a follow-up question to this previous question on how to download ~1000 files from Google Patents. I would like to iterate through a list of filenames fname = ["ipg150106.zip", "ipg150113.zip"] and simulate clicking and saving these files to my computer. The following example works for me and downloads...

Webpage content doesn't match the page's source code

html,web,web-scraping,beautifulsoup
I've been playing around with scraping webpages using BeautifulSoup for a few weeks now. An issue I recently ran into, and hadn't seen before is where the content of the webpage is different from what's show as the page's source code and what's given in the url request response. For...

Scraping dynamic content using python-Scrapy

python,web-scraping,scrapy
Disclaimer: I've seen numerous other similar posts on StackOverflow and tried to do it the same way but was they don't seem to work on this website. I'm using Python-Scrapy for getting data from koovs.com. However, I'm not able to get the product size, which is dynamically generated. Specifically, if...

getting specific images from page

python,html,web-scraping,beautifulsoup,html-parsing
I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl: redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('a')['t1'] #works fine print div.find('img')['src'] #This getting issue KeyError: 'src' But this gives only...

Scraping the second page of a website in Python does not work

python,python-2.7,web-scraping,beautifulsoup,urlopen
Let's say I want to scrape the data here. I can do it nicely using urlopen and BeautifulSoup in Python 2.7. Now if I want to scrape data from the second page with this address. What I get is the data from the first page! I looked at the page...

Python Web Scraping title in a special div & Page 1 + 15

python,css,xpath,web-scraping,request
Hey guys following problem. I want to scrap data from a website. But there are 2 issues: I have setup to check pricing. That works very well but it does only work for page 1 and 15. But I want all from 1-15 like 1,2,3,4,5 etc. I have the problem...

scrap data of different
  • tag and convert them in integer in python
  • python,web-scraping
    I am trying to scrap a webpage data having two li tag in a ul <ul class="yt-lockup-meta-info"><li>1 year ago</li><li>17,838 views</li></ul> There are many lines like above on the webpage. Following is my code to scrap the data. video_data['views'] = [li.get_text().split("<li>") for li in soup.select('ul.yt-lockup-meta-info')] This gives me the following result...

    GoogleScraper keeps searches in database

    python,bash,web-scraping
    I am using GoogleScraper for some automated searches in python. GoogleScraper keeps search results for search queries in its database named google_scraper.db.e.g. if i have searched site:*.us engineering books and due to internet issue while making json file by GoogleScraper.If the result is missed and json file is not like...

    Parse Json data to Excel

    python,json,excel,parsing,web-scraping
    I have data in Json format availaible on this link: Json Data What would be the best way to get this done? I know this could be done by Python but not sure how....

    Scrapy returning a null output when extracting an element from a table using xpath

    python,xpath,web-scraping,web-crawler,scrapy
    I have been trying to scrape this website that has details of oil wells in Colorado https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL Scrapy scrapes the website, and returns the URL when I scrape it, but when I need to extract an element inside a table using it's XPath (County of the oil well), all i...

    Download files using Python 3.4 from Google Patents

    python,python-3.x,download,web-scraping
    I would like to download (using Python 3.4) all (.zip) files on the Google Patent Bulk Download Page http://www.google.com/googlebooks/uspto-patents-grants-text.html (I am aware that this amounts to a large amount of data.) I would like to save all files for one year in directories [year], so 1976 for all the (weekly)...

    How to access response Body after simulating a POST request in Node.js?

    node.js,web-scraping,http-post,reddit
    I have been trying this out for a long time now. I want to scrap contents from a subreddit that has adult contents. But, the problem is that, you have to answer a simple question before you are given access to that page i.e. if you are 18+ or not....

    How to read xml directly from URLs with scrapy/python

    python,xml,web-scraping,scrapy,scrapy-spider
    In Scrapy you will have to define start_urls. But how can I crawl from other urls as well? Up to now I have a login script which logs into a webpage. After logging in, I want to extract xml from different urls. import scrapy class LoginSpider(scrapy.Spider): name = 'example' start_urls...

    Scraping location data in rvest

    javascript,r,web-scraping,scraper,rvest
    I'm currently trying to scrape latitude/longitude data from a list of urls I have using rvest. Each URL has an embedded google map with a specific location, but the urls themselves don't show the path that the API is taking. When looking at the page source, I see that the...

    get div attribute val and div text body

    python,web-scraping,beautifulsoup
    Here is small code to get div attr value. All div name are same with same attr name. redditFile = urllib2.urlopen("http://www.bing.com/videos?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('div', {"class":"vthumb"})['smturl'] #print div.find("div", {"class":"tl text-body"}) This print none rather then...

    Promises for going through URLS

    javascript,web-scraping,promise
    I'm trying to figure out a way to process through a dynamic number of URLS. The idea is to have a while run until we reach the limit of whatever we are searching for - let's say URLS for example. return new Promise(function(resolve, reject) { var links = []; var...

    Adding elements to BeautifulSoup's find_all list as a string

    python,windows,python-2.7,web-scraping,beautifulsoup
    I am testing a webscraping concept with BeautifulSoup's findall() function. I'm trying to get the contents of the p tags that have the class='first' inside of div class='dinner'. from bs4 import BeautifulSoup import urllib2 html_doc=""" <html> <head> <title>The practice html document</title> </head> <body> <div class='dinner'> <p class='first'>I like pizza</p> <p...

    Scraping successive pages until the last page using Nokogiri and Mechanize

    ruby,web-scraping,nokogiri,mechanize
    I am trying to scrape multiple pages from a website. I want to scrape a page, then click on next, get that page, and repeat until I hit the end. I wrote this so far: page = agent.submit(form, form.buttons.first) #submitting a form while lien = page.link_with(:text=>'Next') # while I have...

    Scraping Data From Interactive Map

    python,svg,web-scraping
    I would like to scrape the voter registration data underlying this map: http://www.bostonglobe.com/metro/2012/08/28/registration-figures-show-massachusetts-voters-continue-abandon-two-major-political-parties/p0zW7Snj9R07DK913P36kM/igraphic.html?p1=Article_Graphic As you hover over each town, both the total and the by-party figures in the box below change. I would like to record the name of each town and registration counts by party. Any suggestions about how...

    Python Beautiful Soup Table Data Scraping Specific TD Tags

    python,table,web-scraping,beautifulsoup,html-table
    This webpage... http://www.nfl.com/player/tombrady/2504211/gamelogs has multiple tables on it. Within the HTML all of the tables are labeled the exact same: <table class="data-table1" width="100%" border="0" summary="Game Logs For Tom Brady In 2014"> I can scrape data from only the first table (Preseason table) but I do not know how to skip...

    Beautifulsoup can't find tag by text

    python,web-scraping,beautifulsoup
    Beautifulsoup suddenly can't find a tag by its text. I have a html in which this tag appears: <span class="date">Telefon: <b>+421 902 808 344</b></span> BS4 can't find this tag: telephone = soup.find('span',{'text':re.compile('.*Telefon.*')}) print telephone >>> None I've tried many ways like find('span',text='Telefon: ') or find('span', text=re.compile('Telefon: .*') But nothing works....

    How can I use beautiful soup to get the current price of a stock on Google Finance?

    python,web-scraping
    I have the following python code and the goal is to get the current price of this stock, which is $110.80. import urlparse import urllib2 import pdb from bs4 import BeautifulSoup from pprint import pprint url = "https://www.google.com.hk/finance?q=0001&ei=yF14VYC4F4Wd0ASb64CoCw" def WebCrawl(url): htmltext = urllib2.urlopen(url).read() soup = BeautifulSoup(htmltext) P = soup.find() print...

    Scrapy Xpath construction producing empty brackets on dynamic site

    python,selenium,selenium-webdriver,web-scraping,scrapy
    I am trying to create a spider via scrapy to crawl a website and extract all links for specific stores. Ultimately, the spider would then use those store links to extract pricing information. The site is designed to break down store information into States and Regions. I have been able...

    Scrapy parse list of urls, open one by one and parse additional data

    python,parsing,web-scraping,scrapy
    I am trying to parse a site, an e-store. I parse a page with products, which are loaded with ajax, get urls of these products,and then parse additional info of each product following these parced urls. My script gets the list of first 4 items on the page, their urls,...

    python 2.7: scrapping a website

    python,python-2.7,web-scraping
    I am probably doing my scrapping incorrectly given I know little programming but I would like to know how I scrape data from an html table in python and associate it with its own class...I don't really know what Im doing so here is an example: <div class="example"> <a href="/example/thisexample">...

    Using Selenium and Python, how to check whether the button is still clickable?

    python,selenium,web-scraping
    so I am doing some web scraping using Selenium with Python and I am having a problem. I am clicking a Next button to move to the next page on a certain website, but I need to stop clicking it when I reach the last page. Now, my idea of...

    iMacros TAG to Find TXT and Click Nearby (previous) Link

    javascript,dom,web-scraping,scrape,imacros
    Below is the example code of Wordpress Backend tag management section. I'm trying to write an iMacros to find a tag and delete it. However the Tag text doesn't below in any HTML's tag. <div class="tagchecklist"> <span> <a id="post_tag-check-num-0" class="ntdelbutton" tabindex="0">X</a> &nbsp;Orange </span> <span> <a id="post_tag-check-num-1" class="ntdelbutton" tabindex="0">X</a> &nbsp;Apple </span>...

    Why scrapy not giving all the results and the rules part is also not working?

    python,xpath,web-scraping,web-crawler,scrapy
    This script is only providing me with the first result or the .extract()[0] if I change 0 to 1 then next item. Why it is not iterating the whole xpath again? The rule part is also not working. I know the problem is in the response.xpath. How to deal with...

    Extracting text between link tags using BeautifulSoup in Python

    python,html,web-scraping,beautifulsoup
    I have HTML code that looks like this: <a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a> and I'm trying to extract the text displayed when this HTML is rendered....

    Why scrapy not storing data into mongodb?

    python,mongodb,web-scraping,scrapy,scrapy-spider
    My main File: import scrapy from scrapy.exceptions import CloseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.http import Request class Product(scrapy.Item): brand = scrapy.Field() title = scrapy.Field() link = scrapy.Field() name = scrapy.Field() title = scrapy.Field() date = scrapy.Field() heading = scrapy.Field() data = scrapy.Field() Model_name =...

    Extracting links with scrapy that have a specific css class

    python,web-scraping,scrapy,screen-scraping,scrapy-spider
    Conceptually simple question/idea. Using Scrapy, how to I use use LinkExtractor that extracts on only follows links with a given CSS? Seems trivial and like it should already be built in, but I don't see it? Is it? It looks like I can use an XPath, but I'd prefer using...

    Using R to download *.xls files generates error

    r,web-scraping,rvest
    I'm trying to download a large number of xls files from the BLS servers. When I manually download any of the files, they open perfectly. But when I try to download the file from inside R: library(readxl) tp <- "http://www.bea.gov/histdata/Releases/GDP_and_PI/2014/Q4/Third_March-27-2015/Section1ALL_Hist.xls" temp <- paste0(tempfile(), ".xls") download.file(tp, destfile = temp, mode =...

    Link checker within R [closed]

    r,web-scraping
    Is there a way within R to list (find) all links for a given webpage? I'd like to enter a URL and produce a directory tree of all links from that site. The purpose is to find the relevant sub-page to scrape. Here is link to similar question on SO...

    Python text extraction does not work on some pdfs

    python,pdf,web-scraping,pypdf,pdfminer
    I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this : url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf" #url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf" f = urlopen(Request(url)).read() fileInput = StringIO(f) pdf = PyPDF2.PdfFileReader(fileInput) print pdf.getNumPages() print pdf.getDocumentInfo() print...

    How to use GoogleScraper package to scrape link from different search engines in Python

    python,web-scraping,scrape
    I want to scrape link from different search engine for my search query in python. For eg Query :- "who is Sachin Tendulkar" Output : Want link from google search , bing search. After digging many link i found google Scrapper packege . Google Scrapper Link https://pypi.python.org/pypi/GoogleScraper/0.1.37 But I didn't...

    Ruby Mechanize form input field text

    ruby,csv,automation,web-scraping,mechanize
    Resolved - the "abc = list.scan(/[([^)]+)]/).last.first" line was correct but also included the quotes, which the website search form did not accept. Corrected it to abc = list.scan(/\"([^)]+)\"/).join. Thanks for all the help. I have to automate a search using a list of 100 keywords that is in a csv...

    Scrapy middleware setup

    python,web-scraping,web-crawler,scrapy
    I am trying to access public proxy using scrapy to get some data. I get the following error when i try to run the code: ImportError: Error loading object 'craiglist.middlewares.ProxyMiddleware': No module named middlewares I've created middlewares.py file with following code: import base64 # Start your middleware class class ProxyMiddleware(object):...

    How is Ruby Mechanize fast after first get request?

    ruby,web-scraping,mechanize
    I recently programmed a scraper with Ruby's Mechanize gem for the first time. It had to hit the server (some 'xyz.com/a/number') where the number will be generated by the script. Like 'xyz.com/a/2' and 'xyz.com/a/3'. It turned out that the first request took a lot of time -- around 1.5s on...

    Scrapy not entering parse method

    python,selenium,web-scraping,web-crawler,scrapy
    I don't understand why this code is not entering the parse method. It is pretty similar to the basic spider examples from the doc: http://doc.scrapy.org/en/latest/topics/spiders.html And I'm pretty sure this worked earlier in the day... Not sure if I modified something or not.. from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import...

    BeautifulSoup is not getting all data, only some

    python,html,web-scraping,beautifulsoup,html-parsing
    import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 0 while page <= max_pages: url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('a', {'class':'hdrlnk'}): href = 'http://orangecounty.craigslist.org/' + link.get('href') title = link.string print title #print href get_single_item_data(href) page...

    Rvest loop breaks on redirecting site

    r,for-loop,web-scraping,vectorization,rvest
    My situation: I have a long (20k lines) list of URLs that I need to scrape particular data elements from for an analysis. For the purpose of this example, I'm looking for a particular field called "sol-num", which the the solicitation number. Using the following function, I can fetch the...

    Memory Leak in Scrapy

    python,web-scraping,scrapy
    i wrote the following code to scrape for email addresses (for testing purposes): import scrapy from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor from scrapy.selector import Selector from crawler.items import EmailItem class LinkExtractorSpider(CrawlSpider): name = 'emailextractor' start_urls = ['http://news.google.com'] rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),) def process_item(self, response):...

    Scrapy collect data from first element and post's title

    python,web-scraping,web-crawler,scrapy,scrapy-spider
    I need Scrapy to collect data from this tag and retrieve all three parts in one piece. The output would be something like: Tonka double shock boys bike - $10 (Denver). <span class="postingtitletext">Tonka double shock boys bike - <span class="price">$10</span><small> (Denver)</small></span> Second is to collect data from first span tag....

    xpath exclude certain child element with class

    html,xpath,web-scraping
    I have a html structure which for instance could look like below in a simplified version. I want to exclude the yarpp-related div from the xpath content. Here is what i'm using at the moment: //div[@class='entry-content'] How can i exlude the yarpp-related div? html structure <div class="entry-content"> <div class="yarpp-related"> </div>...

    Scrapy: catch responses with specific HTTP server codes

    python,web-scraping,scrapy,scrapy-spider
    We have a pretty much standard Scrapy project (Scrapy 0.24). I'd like to catch specific HTTP response codes, such as 200, 500, 502, 503, 504 etc. Something like that: class Spider(...): def parse(...): processes HTTP 200 def parse_500(...): processes HTTP 500 errors def parse_502(...): processes HTTP 502 errors ... How...

    HTTP Error 999: Request denied

    python,web-scraping,beautifulsoup,linkedin,mechanize
    I am trying to scrape some web pages from LinkedIn using BeautifulSoup and I keep getting error "HTTP Error 999: Request denied". Is there a way around to avoid this error. If you look at my code, I have tried Mechanize and URLLIB2 and both are giving me the same...

    Adding items to ArrayList with separation

    java,arraylist,web-scraping,data-modeling
    I'm doing a project on web scraping using ArrayLists. When I scrape the info, it comes back as item [0] = pencil, item [1] = $1.50. I would like these items to be together, or if possible it would be even better if the prices and item each had their...

    Web scraping error: exceptions.MemoryError

    python,web-scraping,scrapy,scrapy-spider
    I'm trying to download data from gsmarena. A sample code to download HTC one me spec is from the following site: "http://www.gsmarena.com/htc_one_me-7275.php" as mentioned below. The data on the website is classified in form of tables and table rows. The data is of the format: table header > td[@class='ttl'] >...

    How to use re() to extract data from javascript variable using scrapy?

    javascript,python,regex,web-scraping,scrapy
    My items.py file goes like this: from scrapy.item import Item, Field class SpiItem(Item): title = Field() lat = Field() lng = Field() add = Field() and the spider is: import scrapy import re from spi.items import SpiItem class HdfcSpider(scrapy.Spider): name = "hdfc" allowed_domains = ["hdfc.com"] start_urls = ["http://hdfc.com/branch-locator"] def parse(self,response):...

    Javascript function not returning updated object

    javascript,json,web-scraping,cheerio
    I wrote this function and when I run it, it returns {}. When I log json in the the function passed to $('.price').filter, it shows the json object has been updated with the correct data. However, at the end of my function it returns an empty object. I don't understand...

    Python Beautiful Soup Web Scraping Specific Numbers

    python,html,web-scraping,beautifulsoup,html-parsing
    On this page the final score (number) of each team has the same class name class="finalScore". When I call the final score of the away team (on top) the code calls that number without a problem. If ... favLastGM = 'A' When I try to call the final score of...

    Find and click links in ugly table with Python and Selenium webdriver

    python-2.7,selenium,xpath,web-scraping
    I'm trying to get Selenium Webdriver to click x number of links in a table, and I can't get it to work. I can print the links like this: links = driver.find_elements_by_xpath("//table[2]/tbody/tr/td/p/strong/a") for i in range(0,len(links)): print links[i].text But when I try to do a links[i].click() instead of printing python...

    Extract Google Analytics UA code with Javascript

    javascript,google-analytics,web-scraping
    How do you extract the Google Analytics UA code on a page with Javascript. Could this be done manipulating the ga function or scraping the site for the code?

    VBA skipping code directly after submitting form in IE

    vba,internet-explorer,excel-vba,web-scraping
    Currently I have 2 pieces of code that work separately, but when used together they don't work properly. The first code asks the user to input information which is stored. It then navigates to the correct webpage where it uses the stored user input information to navigate via filling and...