FAQ Database Discussion Community


Removing newlines (\n) with BeautifulSoup

python,regex,bs4
I'm parsing an HTML page with BS4: import re import codecs import MySQLdb from bs4 import BeautifulSoup soup = BeautifulSoup(open("sprt.htm"), from_encoding='utf-8') sprt = [[0 for x in range(3)] for x in range(300)] i = 0 for para in soup.find_all('p'): if para.strong is not None: sprt[i][0] = para.strong.get_text() sprt[i][1] = para.get_text()...

New to Python, what am I doing wrong and not seeing tag (links) returned with BS4

python,beautifulsoup,web-crawler,bs4
I'm new to python and learning it. Basically I am trying to pull all the links from my e-commerce store products that is stored in the html below. I'm getting no results returned though and I can't seem to figure out why not. <h3 class="two-lines-name"> <a title="APPLE IPOD IPOD A1199...

BeautifulSoup (bs4) parsing wrong

python,html,python-2.7,bs4
Parsing this sample document with bs4, from python 2.7.6: <html> <body> <p>HTML allows omitting P end-tags. <p>Like that and this. <p>And this, too. <p>What happened?</p> <p>And can we <p>nest a paragraph, too?</p></p> </body> </html> Using: from bs4 import BeautifulSoup as BS ... tree = BS(fh) HTML has, for ages, allowed...

Extract News article content from stored .html pages

python,urllib2,bs4
I am reading text from a html files and doing some analysis. These .html files are news articles. Code: html = open(filepath,'r').read() raw = nltk.clean_html(html) raw.unidecode(item.decode('utf8')) Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively...

Writing loop over multiple pages with BeautifulSoup

python,loops,beautifulsoup,mechanize,bs4
I'm attempting to scrape several pages of results from the county search tool here: http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main But I can't seem to figure out how to iterate over more than just the first page. import csv from mechanize import Browser from bs4 import BeautifulSoup url = 'http://www2.tceq.texas.gov/oce/waci/index.cfm?fuseaction=home.main' br = Browser() br.set_handle_robots(False) br.open(url)...

Extract the URL of stored html file

python,urllib2,bs4
I have stored some html files and renamed them. Is there some possible way I can extract the URL of the html file in python. EDIT: I wish to find the URL of the .html file and not the links present in it. I am looking for a generalised approach...

BS4 and onclick(): how to make action?

python,django,bs4
In order to get info user have to click this link: <a style="cursor:pointer;" onclick="startScorebot();">here</a> How to get it pressed by bs4? Or maybe there is other solution?...