FAQ Database Discussion Community


How to take data from variable and put it into another

python,web-scraping,beautifulsoup,screen-scraping
i'm having a little bit of an issue: I would like to take this data, for item in g_data: print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[0]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[1]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[2]["href"] print item.contents[1].find_all("a", {"class":"a-link-normal s-access-detail-page a-text-normal"})[3]["href"] and use the...

get text oustside tags using Beautifulsoup

python,beautifulsoup
I am very new to all this and am having a hard time getting specific text outside of any tags using BeautifulSoup. Here is my code: from bs4 import BeautifulSoup soup = BeautifulSoup(''' <li id="SalesRank" style="list-style : none"> <b>Sellers Rank:</b> #81 in Fun (<a href="http://www.google.com">See Top 100</a>) </li> ''') theRank...

Get content of tags with empty id in BeautifulSoup

python,beautifulsoup
from bs4 import BeautifulSoup page = """<span id="something">useless</span> <span id="">some text</span> <span id="different">useless</span>""" soup = BeautifulSoup(page) How can I get some text only? Using soup.find_all('span', {'id': ""}) finds everything....

How to parse an XML file with multiple non-specific elements

python,xml,beautifulsoup,fogbugz,fogbugz-api
I'm trying to parse a list of cases that is returned from the Fogbugz API. Current code: from fogbugz import FogBugz from datetime import datetime, timedelta import fbSettings fb = FogBugz(fbSettings.URL, fbSettings.TOKEN) resp = fb.search(q='project:"Python Testing"',cols='ixBug') print resp.cases.case.ixbug.string The problem is that the XML has multiple cases returned simply as...

BeautifulSoup : TypeError: 'unicode' object is not callable

python,beautifulsoup
Here's my code : v_card = soup.find('div', {'class':'col subgroup vcard'}) if v_card is not None : print v_card.prettify() infos = v_card.findAll('li') print infos[0].text() Here's the output : <div class="col subgroup vcard"> <ul> <li> infos I need to get </li> <li> infos I need to get </li> <li> </li> </ul> </div>...

Regular expression for class using Beautifulsoup

python,html,regex,beautifulsoup,html-parsing
I am using Beautifulsoup for easy scraping. I have figured out there are more than 5 div in webpage which I want to scrap. Their names are different but has pattern. These divs are: divnewthing divnew divnewstring etc So the pattern is divnew* kind of regular expression. And I am...

How to remove parent tag with BeautifulSoup

python,beautifulsoup,html-parsing
I am trying to remove the header cells from a html table using BeautifulSoup. I have something like; <tr> <th> head1 </th> <th> head2 </th> </tr> I am using the following code to remove all the header cells; soup = BeautifulSoup(url) for headless in soup.find_all('th'): headless.decompose() This works great, except...

How do I sort an unordered list using BeautifulSoup 4 by a child element

python,sorting,python-3.x,beautifulsoup,html-lists
I am new to Python coding and BeautifulSoup4. I have a list in HTML that I need to sort, which follows the pattern: <div id="mgioLangSelector"> <ul id="mgioLangList"> <li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="am">አማርኛ</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Amharic</span</a></li> <li><a href="" class="mgio-autonym"><span class="mgioAutonymNative"...

Beautifulsoup: Getting a new line when I tried to access the soup.head.next_sibling value with Beautifulsoup4

python,python-2.7,web,web-scraping,beautifulsoup
I am trying an example from the BeautifulSoupDocs and found it acting weird. When I try to access the next_sibling value, instead of the "body" a '\n' is coming in to picture. html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were...

Beautiful Soup: Get text data from html

python,html,beautifulsoup
Here is my html code now I want extract data from following html code using beautiful soup <tr class="tr-option"> <td class="td-option"><a href="">A.</a></td> <td class="td-option">120 m</td> <td class="td-option"><a href="">B.</a></td> <td class="td-option">240 m</td> <td class="td-option"><a href="">C.</a></td> <td class="td-option" >300 m</td> <td class="td-option"><a href="">D.</a></td> <td...

Python Beautiful Soup Scraping Exact Content From Charts

python,table,data,beautifulsoup,scrape
In python using beautiful soup I want to be able to grab specific text/numbers from a sortable table online. http://www.nfl.com/stats/categorystats?archive=false&conference=null&role=OPP&offensiveStatisticCategory=null&defensiveStatisticCategory=INTERCEPTIONS&season=2014&seasonType=REG&tabSeq=2&qualified=false&Submit=Go I have attempted this about a million times and can't figure it out. This is the best i could do: from bs4 import BeautifulSoup import urllib2 import requests import pymongo...

Regex TypeError: 'NoneType' object is not callable

python,regex,string,beautifulsoup
I'm trying to extract some data from a web page. I'm using Beautiful Soup 4 and regexes. The problem is that it returns an error but I can't figure out why the error is raised. Here is a piece of my code: urls = soup.findall('a',href = re.compile(r'/katalog/stavebnictvi/'+'.')) Here is the...

How should I show results of BeautifulSoup parsing in Django?

python,django,django-templates,django-views,beautifulsoup
I'm trying to scrape a web page using BeautifulSoup and Django. Here's my views.py which do this task: def detail(request, article_id): article = get_object_or_404(Article, pk=article_id) html = urllib2.urlopen("...url...") soup = BeautifulSoup(html) title = soup.title return render(request, 'detail.html', {'article': article, 'title':title}) But when I use {{ title }} in django template...

BeautifulSoup gives garbage for html conversion

python,html,pdf,utf-8,beautifulsoup
I am trying to scape this url = 'http://www.jmlr.org/proceedings/papers/v36/li14.pdf url. This is my code html = requests.get(url) htmlText = html.text soup = BeautifulSoup(htmlText) print soup #gives garbage However it gives weird symbols that I think is garbage. It's an html file so it shouldn't be trying to parse it as...

How can I dependably web-scrape a largely unattached line effectively?

python,python-2.7,web-scraping,beautifulsoup
Sorry if that was a vague title. I'm trying to scrape the number of XKCD web-comics on a consistent basis. I saw that http://xkcd.com/ always has their newest comic on the front page along with a line further down the site saying: Permanent link to this comic: http://xkcd.com/1520/ Where 1520...

Python Beautiful Soup Web Scraping Specific Numbers

python,html,web-scraping,beautifulsoup,html-parsing
On this page the final score (number) of each team has the same class name class="finalScore". When I call the final score of the away team (on top) the code calls that number without a problem. If ... favLastGM = 'A' When I try to call the final score of...

Reading 1000s of XML documents with BeautifulSoup

python,xml,beautifulsoup,enthought
I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file. You can see a sample of the data hereWarning this will initiate a download of a 108MB zip...

Beautifulsoup, grab text with link

python,python-3.x,beautifulsoup
i'm making a web spider to automate some of my work. I have a table with lots of drivers and different version for different operating systems. So far everything works fine but i'm having a hard time separating the links for each operating system. I'll post part of the html...

Extracting strings from HTML with Python wont work with regex or BeautifulSoup

python,regex,parsing,beautifulsoup,python-requests
Im using Python 2.7, BeautifulSoup4, regex, and requests on windows 7. I've scraped some code from a website and I am having problems parsing and extracting the bits I want and storing them in a dictionary. What I'm after is text that is presented as follows in the code: @CAD_DTA\">I...

Save image from url to special folder

python,web-scraping,beautifulsoup
I want to save images from url to special folder, for example 'my_images', but not to default(where my *.py file is). Is it possible to make it? Because my code saves all images to folder with *.py file. Here is my code: import urllib.request from bs4 import BeautifulSoup import re...

trying to regex in python

python,regex,python-3.x,beautifulsoup,urllib
Can anyone please help me understand this code snippet, from http://garethrees.org/2007/05/07/python-challenge/ Level2 >>> import urllib >>> def get_challenge(s): ... return urllib.urlopen('http://www.pythonchallenge.com/pc/' + s).read() ... >>> src = get_challenge('def/ocr.html') >>> import re >>> text = re.compile('<!--((?:[^-]+|-[^-]|--[^>])*)-->', re.S).findall(src)[-1] >>> counts = {} >>> for c in text: counts[c] = counts.get(c, 0) +...

Python nested html tags with Beautifulsoup

python,html,regex,beautifulsoup
i'm trying to get some all the href URLs from a nested html code: ... <li class="dropdown"> <a href="#" class="dropdown-toggle wide-nav-link" data-toggle="dropdown">TEXT_1 <b class="caret"></b></a> <ul class="dropdown-menu"> <li class="class_A"><a title="Title_1" href="http://www.customurl_1.com">Title_1</a></li> <li class="class_B"><a title="Title_2" href="http://www.customurl_2.com">Title_2</a></li> ... <li class="class_A"><a...

Downloading Image Data URIs from Webpages via BeautifulSoup

python,python-2.7,beautifulsoup
I need to retrieve an image from a website using Python. However, the image is not in the form of a linked file, but as a GIF Data URI. How do I download this and store it in a .gif file?

Get text of HTML tags without text of inner child tags

python,python-2.7,beautifulsoup
Example: Sometimes the HTML is: <div id="1"> <div id="2"> this is the text i do NOT want </div> this is the text i want here </div> Other times it's just: <div id="1"> this is the text i want here </div> I want to get only the text in the one...

Python: Save Excel File As-Is To Folder

python,beautifulsoup
I'm downloading Excel files from a website using beautifulsoup4. I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is. the function takes in a beautifulsoup call, searches for <a> then makes a call to the...

Find all HTML and non-HTML encoded URLs in string

python,html,regex,beautifulsoup,lxml
I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string. For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml. On the other hand, if my string contained only a...

Create content snippet with Jinja filter

python,flask,beautifulsoup,jinja2
I want to create content snippets for my home page. An example post looks something like <p>Your favorite Harry Potter characters enter the Game of Thrones universe, and you'll never guess what happens!</p> <readmore/> <p>...they all die</p> On the home page I only want the things before <readmore/> to show...

Print outputs double results

python,beautifulsoup
the script is printing double results and I can't really pin down the problem. # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup as bs word = ("mission") with requests.Session() as s: r = s.get('http://www.tabula.ge/en') soup = bs(r.text) div = soup.find("div", {"class": "sets"}) for i in div.find_all('li'): for...

Labelling nodes in networkx

python,graph,beautifulsoup,label,networkx
I'm trying to extract one set of values from a URL. This set has a unique list of numbers. The number of elements in this list should be equal to the number of nodes. So the label that these nodes get should come from the list extracted. How can this...

How to limit the result of select tag in beautifulsoup?

python,html,beautifulsoup,html-parsing
For example, I have this: result = soup.select('div#test > div.filters > span.text') I want to limit the result of the above list to 10 items. In case of find_all() one can use the limit argument but what about select()?...

BeautifulSoup parsing unicode giving variable results

python-2.7,unicode,beautifulsoup,ipython-notebook
I am trying to to parse the following ipython notebook however I am getting varying results when I read the unicode into a BeautifulSoup object, i.e. from IPython.nbconvert.exporters import HTMLExporter from IPython.config import Config from bs4 import BeautifulSoup filepath = '2015-05-01_test2.ipynb' config = Config({'CSSHTMLHeaderTransformer': {'enabled': True, 'highlight_class': '.highlight-ipynb'}}) exporter =...

Printing and formatting results in BeautifulSoup

python,string,beautifulsoup
My problem is that I want to print only this results with '1', not '-1', but when I use find() I just get '1' or '-1'. I know that is working but is there any function to print only this with '1', not number but whole line? import requests import...

Logic flow - trying to iterate thru website pages with BeautifulSoup and CSV Writer

python,csv,beautifulsoup
I can't seem to figure out the proper indents/clause placements to get this to loop thru more than 1 page. This code current prints out a CSV file fine, but only does it for the first page. Any help?? #THIS WORKS BUT ONLY PRINTS THE FIRST PAGE from bs4 import...

BeautifulSoup is not getting all data, only some

python,html,web-scraping,beautifulsoup,html-parsing
import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 0 while page <= max_pages: url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('a', {'class':'hdrlnk'}): href = 'http://orangecounty.craigslist.org/' + link.get('href') title = link.string print title #print href get_single_item_data(href) page...

BeautifulSoup invalid syntax in Python 3.4 (after 2to3.py)

python,python-3.x,beautifulsoup,python-3.4
I am trying to install Beautiful Soup 4 in Python 3.4. I installed it from the command line, (got the invalid syntax error because I had not converted it), ran the 2to3.py conversion script to bs4 and now I get a new invalid syntax error. >>> from bs4 import BeautifulSoup...

Beautifulsoup can't find tag by text

python,web-scraping,beautifulsoup
Beautifulsoup suddenly can't find a tag by its text. I have a html in which this tag appears: <span class="date">Telefon: <b>+421 902 808 344</b></span> BS4 can't find this tag: telephone = soup.find('span',{'text':re.compile('.*Telefon.*')}) print telephone >>> None I've tried many ways like find('span',text='Telefon: ') or find('span', text=re.compile('Telefon: .*') But nothing works....

NoneType Error when using Beautiful Soup object inside function

python,beautifulsoup,python-requests
Why is it that the penultimate line of this snippet completes successfully, but the last one gives the error: TypeError: 'NoneType' object is not callable? What is different inside the scope of the function, and how can it be fixed? import requests from bs4 import BeautifulSoup def findDiv(soup): print soup.body.FindAll("div")...

Python BS4 Unsupported format: White space in attribute selector

python,css,beautifulsoup
I am beginning web scraping with BeautifulSoup in Python. Website I am trying to parse "http://www.moneycontrol.com/india/stockpricequote/computers-software/techmahindra/TM4" My code as below previous_close = content.select(".gD_12 PB3"); I have the following error when the line is interpreted previous_close = content.select(".gD_12 PB3"); File "/usr/local/lib/python2.7/dist-packages/bs4/element.py", line 1313, in select 'Unsupported or invalid CSS selector: "%s"'...

HTML list comprehension issue while using Beautiful Soup w Python

python,html,beautifulsoup,list-comprehension
I've narrowed my HTML down and I want to pull the hrefs from each line IF the content following the a tag is past 2010. What's the best way to do this? I'll post my code first, and then the HTML. Code: links = [STEM_URL + row.a["href"] for row in...

Beautifulsoup Table Scraping table navigation

beautifulsoup
I am trying to learn beautifulsoup to scrap HTML and have a difficult challenge. HTML I am trying to scrap is not well formatted and with lack of knowledge with beautifulsoup I am kind of stuck.. The HTML I am trying to scrap is as below <table> <tr> <td><b>Value 1<b/>HiddenValue1</td>...

BeautifulSoup: Parsing bad Wordpress HTML

python,html,regex,wordpress,beautifulsoup
So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with. For example <p style='font-size: 24px;'> <strong>Title A</strong> </p> <p> <strong> First Subtitle of Title A </strong> "Text for first subtitle" </p> Then it will...

Accessing text in html using BeautifulSoup

python,beautifulsoup
I'm trying to access the string Out of Stock using BeautifulSoup but cannot find the way to it: <span style="color: #727272; font-size: 14px; font-weight: normal;"> <strong>Price: $790</strong> (Out of stock) </span> Can anybody give hints how can I do this?...

Trying extract data under of specific div and sub div

python,beautifulsoup
I am trying to get it so I can have it print the title of the book and the chapters but only each book and title. So basically "The First Book of Jacob" Chapters 1-7 instead of it iterating over all the books. Here is the page layout (url included...

Delete index in list if multiple strings are matched

python,list,beautifulsoup
I've scraped a website containing a table and I want to format the headers for my desired final out. headers = [] for row in table.findAll('tr'): for item in row.findAll('th'): for link in item.findAll('a', text=True): headers.append(link.contents[0]) print headers Which returns: [u'Rank ', u'University Name ', u'Entry Standards', u'Click here to...

BeautifulSoup4 get input 'value' throws an error with good code?

html,parsing,beautifulsoup,html-parsing
print [(element['name'], element['value']) for element in soup.find_all('input')] I copied this code to get the value of an input and it throws this error: File "messager.py", line 116, in main print [(element['name'], element['value']) for element in soup.find_all('input')] File "C:\PYTHON27\lib\site-packages\bs4\element.py", line 905, in __getitem__ return self.attrs[key] KeyError: 'value' If I only provide...

Searching Large String for file path. Return filepath + filename

python,regex,string,beautifulsoup,html-parsing
I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python. I'm using the urllib library, which is returning a long string of web page data which includes <a href="http://website.com/wallpaper/filename.jpg"> I know that every filename I need to download...

Python, beautifulsoup scraping specific or exact numbers from a stat table

python,table,statistics,beautifulsoup,screen-scraping
On a player stat page. How can I make my anchor point the year "2014" and grab specific numbers in the 2014 column (scrape numbers to the right of 2014) The code below is skipping the "Passing" table (with all of the career passing stats) and trying to grab stats...

How to match a particular tag through css selectors where the class attribute contains spaces?

python,html,css-selectors,beautifulsoup,html-parsing
I want to select a table tag which has the value of class attribute as: drug-table data-table table table-condensed table-bordered So I tried the below code: for i in soup.select('table[class="drug-table data-table table table-condensed table-bordered"]'): print(i) But it fails to work: ValueError: Unsupported or invalid CSS selector: "table[class="drug-table" spaces in the...

encode_contents vs encode(“utf-8”) in Python BeautifulSoup

python,beautifulsoup,encode
OK, so as a beginner webscrapper I feel as though I've seen both used, seemingly interchangeably when converting the default unicode of text in HTML. I know contents() is a list object but other than that, what the heck is the difference? I've noticed that .encode("utf-8") seems to work more...

Python 2.7.10 Trying to print text from website using Beautiful Soup 4

python,python-2.7,beautifulsoup,urllib2
I want my output to be like: count:0 - Bournemouth and Watford to go head-to-head for Abdisalam Ibrahim Olympiacos midfielder Abdisalam Ibrahim is a target for Premier League new-boys Bournemouth and Watford.The former Manchester City man is keen to leave Greece this summer, and his potential availability has alerted Eddie...

return RSS attribute values via BeautifulSoup

python,attributes,rss,beautifulsoup
RSS: (in a file called myfeed.rss) <?xml version="1.0" encoding="utf-8" ?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:newznab="http://www.newznab.com/DTD/2010/feeds/attributes/"> <channel> <title>MyFeed</title> <link>http://website</link> <description>RSS Feed</description> <language>en-us</language> <item> <title>title goes here</title> <pubDate>Tue, 09 Jun 2015 15:15:23 -0600</pubDate>...

Beautiful soup 4 constructor error

python,beautifulsoup
I am running python 3.5 with BeautifulSoup4 and getting an error when I attempt to pass the plain text of a webpage to the constructor. The source code I am trying to run is import requests from bs4 import BeautifulSoup tcg = 'http://magic.tcgplayer.com/db/deck_search_result.asp?Format=Commander' sourcecode = requests.get(tcg) plaintext = sourcecode.text soup...

regex + beautifulsoup

python,regex,beautifulsoup
I've isolated a line of HTML procured from BeautifulSoup that i want to run regex on, but I keep getting AttributeError: 'NoneType' object has no attribute 'groups' I read another stackoverflow question (using regex on beautiful soup tags) but I can't see what I need to do to fix my...

Beautiful Soup - Nested table

python,web-scraping,beautifulsoup
I am first going to point out that I am new to all of this but struggling with trying to get to a nested tables cells. Here is the square footage field I am trying to get to down around line 282: view-source:http://services.wakegov.com/realestate/Account.asp?id=0355891 'square_feet': soup.findAll('table')[10].findAll('tr')[15].get_text().strip(), The error I receive is:...

Utilizing BeautifulSoup on a completely flat HTML hierarchy

python,html,beautifulsoup
so I'm a webscraping noob, and ran into some HTML format i've never seen before. All the info I need is in a completely flat hierarchy. I need to grab the Date/MovieName/Location/Amenities. It's laid out so (just like this): <div class="caption"> <strong>July 1</strong> <br> <em>Top Gun</em> <br> "Location: Millennium Park"...

Scraping nested tags

python,beautifulsoup,html-parsing
I know this type of question comes up frequently, however I have been browsing and have not seen a similar problem. <div class="casts"> <table cellpadding="0" cellspacing="0"> <tbody> <tr> <td class=""> <a class="cast"> <span class="title"> Nested data 1 <span class="schedule"> Nested data 2 </span> </span> </a> </td> </tr> </tbody> </table> </div>...

get div attribute val and div text body

python,web-scraping,beautifulsoup
Here is small code to get div attr value. All div name are same with same attr name. redditFile = urllib2.urlopen("http://www.bing.com/videos?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('div', {"class":"vthumb"})['smturl'] #print div.find("div", {"class":"tl text-body"}) This print none rather then...

Beautifulsoup Cannot FindAll

python,beautifulsoup
I'm trying to scrape nature.com to perform some analysis on journal articles. When I execute the following: import requests from bs4 import BeautifulSoup import re query = "http://www.nature.com/search?journal=nature&order=date_desc" for page in range (1, 10): req = requests.get(query + "&page=" + str(page)) soup = BeautifulSoup(req.text) cards = soup.findAll("li", "mb20 card cleared")...

AttributeError when web-scraping data using Python

python,beautifulsoup
I'm trying to access the data in the Table in this URL. I am using the code below but I'm coming across the Error AttributeError: 'NoneType' object has no attribute 'find' in the line data = iter(soup.find("table", {"class": "xtTblCon"}).find("div", {"id": "MATURITYY%"}).find_all_next("li")). The code is as follows: from bs4 import BeautifulSoup...

How to extract data within a cdata tag using python?

python,html,xml,beautifulsoup,cdata
I used beautiful soup to get CDATA from a html page but i have to extract contents from it and put it in a csv file. this is my code: from bs4 import BeautifulSoup from urllib.request import urlopen import re import csv f = open('try.html') ff = csv.writer(open("profiletry.csv", "w")) ff.writerow(["cdata"])...

Extracting Numbers From a Table on a Website

python,table,website,beautifulsoup
I am new to this website and programming in general, so bare with me please as my formatting for the question may be incorrect. I am trying to extract data from a website for personal use. I only want the precipitation at the top of the hour. I am nearly...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4

python,beautifulsoup,web-crawler,bs4
I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not. <h3 class="two-lines-name"> <a title="APPLE IPOD IPOD A1199...

python BeautifulSoup find all input for specific form

python,html,forms,beautifulsoup,html-parsing
I'm trying to use BeautifulSoup to extract input fields for a specific form only. Extracting the form using the following: soup.find('form') Now I want to extract all input fields which are a child to that form only. How can I do that with BS?...

Extracting text between link tags using BeautifulSoup in Python

python,html,web-scraping,beautifulsoup
I have HTML code that looks like this: <a href="/Content.aspx?id=102966" id="mylink" target="_blank">EZSTORAGE - PACK IT. STORE IT. WIN - <img src="/images/usa.png" style="border:none; height:14px; margin-bottom:-2px;"/> Nationwide - <span title="college students/staff of schools in valid states">Restrictions</span> - Ends 6/30/15</a> and I'm trying to extract the text displayed when this HTML is rendered....

How to get javascript output in python BeautifulSoup or any other module

javascript,python,html,web-scraping,beautifulsoup
In my attempt to make a scraper, I found a website that uses javascript alot in its code, is it possible to retrieve the output of the script e.g. <html> <head> <title>Python</title> </head> <body> <script type="text/javascript" src='test.js'></script> <p> some stuff <br> more stuff <br> code <br> video <br> picture <br>...

Replace text without escaping in BeautifulSoup

python,html,escaping,beautifulsoup
I would like to wrap some words that are not already links with anchor links in BeautifulSoup. I use this to achieve it: from bs4 import BeautifulSoup import re text = ''' replace this string ''' soup = BeautifulSoup(text) pattern = 'replace' for txt in soup.findAll(text=True): if re.search(pattern,txt,re.I) and txt.parent.name...

HTTP Error 999: Request denied

python,web-scraping,beautifulsoup,linkedin,mechanize
I am trying to scrape some web pages from LinkedIn using BeautifulSoup and I keep getting error "HTTP Error 999: Request denied". Is there a way around to avoid this error. If you look at my code, I have tried Mechanize and URLLIB2 and both are giving me the same...

Scraping the second page of a website in Python does not work

python,python-2.7,web-scraping,beautifulsoup,urlopen
Let's say I want to scrape the data here. I can do it nicely using urlopen and BeautifulSoup in Python 2.7. Now if I want to scrape data from the second page with this address. What I get is the data from the first page! I looked at the page...

Python + BS Picking a specific word(location) form webpage table

python,beautifulsoup
Hello all…I want to pick a word on specific locaiton from a table on webpage. The source code is like: table = ''' <TABLE class=form border=0 cellSpacing=1 cellPadding=2 width=500> <TBODY> <TR> <TD vAlign=top colSpan=3><IMG class=ad src="/images/ad.gif" width=1 height=1></TD></TR> <TR> <TH vAlign=top width=22>Code:</TH> <TD class=dash vAlign=top width=5 lign="left">&nbsp;</TD> <TD class=dash vAlign=top...

execution error: The variable display is not defined. (-2753)

python,osx,web-scraping,beautifulsoup
I'm using beautiful soup to extract 2 sets of data from a website . However strangely, I'm getting the following error ! Error; 0:7: execution error: The variable display is not defined. (-2753) Code : import requests import os from bs4 import BeautifulSoup word = [] meaning = [] r...

Error logging into instagram with python

python,beautifulsoup,mechanize
I am trying to log into my instagram via a python script using argparse. It seems to connect but it prints out "This page could not be loaded. If you have cookies disabled in your browser, oryou are browsing in Private Mode, please try enabling cookies or turning off Private...

Webpage content doesn't match the page's source code

html,web,web-scraping,beautifulsoup
I've been playing around with scraping webpages using BeautifulSoup for a few weeks now. An issue I recently ran into, and hadn't seen before is where the content of the webpage is different from what's show as the page's source code and what's given in the url request response. For...

Remove Tags - Beautiful Soup

xml,python-2.7,beautifulsoup
I'm having an issue where my code is returning the information I want from XML with the tags where I only want the information between the tags. My output looks like [<weekendingdate>2015-05-02</weekendingdate>] but it should be 2015-05-02. Thanks for the help! Below is my attempt and the XML code. Attempt:...

Not sure how to parse this

python,html,beautifulsoup,html-parsing
<div class="meaning"><span class="hinshi">[副]</span>物事の重点・大勢を述べるときに用いる。</div> All I need from this is おもに。もっぱら。物事の重点・大勢を述べるときに用いる. Usually the hinshi class is separate from the sentences I'm trying to parse, but for some of them they seem to be combined together. Is there anyway to just print the sentence while ignore the [副]?...

Scraping with BeautifulSoup: want to scrape entire column including header and title rows

python,web-scraping,beautifulsoup
I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site using Python. With the code below I can get the first row of all the Columns data that I want....

what makes a python webscrape output unicode?

python,unicode,beautifulsoup
i'm playing around with BeautifulSoup scraping a table and its contents and i've noticed I get different outputs based on how I end it - if i print it outright I get an output that has no unicode notation. html = urlopen('http://www.bcsfootball.org').read() soup = BeautifulSoup(html) for row in soup('table', {'class':'mod-data'})[0].tbody('tr'):...

BeautifulSoup scraping nested tables

python,beautifulsoup,html-parsing
I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost. Here is the said table: <form action="/rr/" class="form"> <table border="0" width="100%" cellpadding="2" cellspacing="0"...

AttributeError when scraping data from URL via Python

python,pandas,beautifulsoup
I am using the code below to try an extract the data from the table in this URL. I asked the same question here and got an Answer for it. However, despite the code from the Answer working at that time I've now come to realize that data in the...

To split html code using beautifulsoup for the required format

python-2.7,beautifulsoup
I have an HTML snippet which looks like following: <div class="myTestCode"> <strong>Abc: </strong> test1</br> <strong>Def: </strong> test2</br> </div> How do I parse it in Beautiful Soup to get: Abc: test1, Def: test2 This is what I have tried so far : data = """<div class="myTestCode"> <strong>Abc: </strong> test1</br> <strong>Def: </strong>...

Finding all tags and attributes in a HTML

python,html,xml-parsing,beautifulsoup,html-parsing
I am a newbie and looking at HTML code for first time. For my research I need to know the number of tags and attributes in a webpage. I looked at various parser and found Beautiful Soup to be one of the most preferred one. The following code (taken from...

Edit text from html with BeautifulSoup

python,html,beautifulsoup
I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag. For example, my HTML looks like this: <ul class="myBodyText"> <li class="fields"> This text still has children <b> Simple Text </b> <div class="s"> <ul class="section"> <li style="padding-left: 10px;"> Hello...

using python urllib and beautiful soup to extract information from html site

python,beautifulsoup,urllib
I am trying to extract some information from this website i.e. the line which says: Scale(Virgo + GA + Shapley): 29 pc/arcsec = 0.029 kpc/arcsec = 1.72 kpc/arcmin = 0.10 Mpc/degree but everything after the : is variable depending on galtype. I have written a code which used beautifulsoup and...

Parsing table for a link

python,html,beautifulsoup
I've been able to isolate a row in a html table using Beautiful Soup in Python 2.7. Been a learning experience, but happy to get that far. Unfortunately I'm a bit stuck on this next bit. I need to get the link that follows the "Select document Remittance Report I...

Beautifulsoup, unable to compare strings

python,python-3.x,beautifulsoup
i'm trying to write a web spider to gather me some links and text. I have a table i'm working with and the second cell of each row has a number in it, all i want to do is get that number, if it's the one i need then grab...

Ignoring a table's cell class in BeautifulSoup

python,class,beautifulsoup
I'm scraping the data off this website to create a table. I plan on creating a function to iterate through every subject but testing on just Accounting & Finance first. So far I have the following code: import os import requests from bs4 import BeautifulSoup import pandas as pd main_url...

Scraping a javascript / json object from a webpage using BeautifulSoup?

javascript,python,html,json,beautifulsoup
I am using BeautifulSoup to get the HTML of a webpage. That works fine so far. But what I really want are the contents of this javascript chunk inside the HTML, which is encapsulated with <script type="text/javascript"> and then inside that tag, eventually there is a giant array thing that...

In BeautifulSoup4, Python3, How to stop recursing inside a found tag?

python,python-3.x,beautifulsoup
My html document looks like: <html> <body> <font color="#151B54"> outer font <font color="#512222"> inner font </font> </font> <p> <font color="#512222"> sibling font </font> </p> </body> </html> I want to extract all the text between the 'font' tags. Expected Output: outer font inner font sibling font What I have tried is:...

Adding elements to BeautifulSoup's find_all list as a string

python,windows,python-2.7,web-scraping,beautifulsoup
I am testing a webscraping concept with BeautifulSoup's findall() function. I'm trying to get the contents of the p tags that have the class='first' inside of div class='dinner'. from bs4 import BeautifulSoup import urllib2 html_doc=""" <html> <head> <title>The practice html document</title> </head> <body> <div class='dinner'> <p class='first'>I like pizza</p> <p...

How to count the number of lines of code retrieved using beautiful soup?

python,printing,count,beautifulsoup
Is there any function in beautiful soup to count the number of lines retrieved? Or is there any other way this can be done? from bs4 import BeautifulSoup import string content = open("webpage.html","r") soup = BeautifulSoup(content) divTag = soup.find_all("div", {"class":"classname"}) for tag in divTag: ulTags = tag.find_all("ul", {"class":"classname"}) for tag...

Unique content identifier with Selenium: InvalidSelectorError

python-2.7,selenium,beautifulsoup
I'm trying to grab data from: http://www.boerse-frankfurt.de/de/etfs/ishares+msci+world+momentum+factor+ucits+etf+DE000A12BHF2 The types of data I'm looking for are located in the classes named singlebox list_component. Let's say I want to extract the Total Expense Ratio (0.30%). It is located in a td class called: right column-datavalue lastColOfRow. But if I do: dues =...

How can I get data from a specific class of a html tag using beautifulsoup?

python,beautifulsoup,scrape
I want to get data located(name, city and address) in div tag from a HTML file like this: <div class="mainInfoWrapper"> <h4 itemprop="name">name</h4> <div> <a href="/Wiki/Province/Tehran"></a> city <a href="/Wiki/City/Tehran"></a> Address </div> </div> I don't know how can I get data that i want in that specific tag. obviously I'm using python...

How do I completely remove all style, scripts, and html tags from an html page

python,html,beautifulsoup
Here is what I have so far: from bs4 import BeautifulSoup def cleanme(html): soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded for script in soup(["script"]): script.extract() text = soup.get_text() return text testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And...

Writing loop over multiple pages with BeautifulSoup

python,loops,beautifulsoup,mechanize,bs4
I'm attempting to scrape several pages of results from the county search tool here: http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main But I can't seem to figure out how to iterate over more than just the first page. import csv from mechanize import Browser from bs4 import BeautifulSoup url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main' br = Browser() br.set_handle_robots(False) br.open(url)...

BeautifulSoup Troubleshooting involving flat HTML hierarchy and next_sibling loop

python,beautifulsoup
so I have a flat hierarchy HTML as such: <div class="caption"> <strong>July 1</strong> <br> <em>Top Gun</em> <br> "Location: Millennium Park" <br> "Amenities: Please be a volleyball tournament..." <br> <em>Captain Phillips</em> <br> "Location: Montgomery Ward Park" <br> <br> <strong>July 2</strong> <br> <em>The Fantastic Mr. Fox </em> And I'm getting tripped up...

How to collect a continuous set of webpages using python?

python,regex,url,beautifulsoup,matching
https://example.net/users/x Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup. from bs4 import BeautifulSoup from urllib.request import urlopen import re content = urlopen(re.compile(r"https://example.net/users/[0-9]//")) soup = BeautifulSoup(content) Is this...

getting specific images from page

python,html,web-scraping,beautifulsoup,html-parsing
I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl: redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('a')['t1'] #works fine print div.find('img')['src'] #This getting issue KeyError: 'src' But this gives only...

Python Beautiful Soup Table Data Scraping Specific TD Tags

python,table,web-scraping,beautifulsoup,html-table
This webpage... http://www.nfl.com/player/tombrady/2504211/gamelogs has multiple tables on it. Within the HTML all of the tables are labeled the exact same: <table class="data-table1" width="100%" border="0" summary="Game Logs For Tom Brady In 2014"> I can scrape data from only the first table (Preseason table) but I do not know how to skip...

BeautifulSoup: AttributeError: 'NavigableString' object has no attribute 'children'

python,beautifulsoup
When using BeautifulSoup4, I can run this code to get one "Shout" without problems. When I use the for loop, I get the error AttributeError: 'NavigableString' object has no attribute 'children' class Shout: def __init__(self, user, msg, date): self.user = user self.msg = msg self.date = date def getShouts(): #s...

Find out which parser BeautifulSoup4 is using?

python,html,beautifulsoup,html-parsing
I've written a script using beautifulsoup4 that works in one machine but not another. The reason is that on that other machine, BeautifulSoup() constructor auto-convert <br> to <br/> whereas it's not the behaviour on my machine. Believe it or not, it matters to my script. I figured that the two...

BeautifulSoup: Difficulty accessing correct table

python,beautifulsoup
I'm using BeautifulSoup4 to scrape a page and the following function is giving me 2 issues: def getTeamRoster(teamURL): html = urllib.request.urlopen(teamURL).read() soup = BeautifulSoup(html) teamPlayers = [] #second table corebody = soup.find(id = "corebody") teamTable = corebody.table.next_sibling.next_sibling.next_sibling.next_sibling print(teamTable) tableBody = teamTable.find('tbody') print(tableBody) tableRows = tableBody.findAll('tr') 1) When I call ".next_sibling"...