FAQ Database Discussion Community


Issue with parsing html with lxml by xpath

python,parsing,xpath,lxml,lxml.html
I am trying to parse data from a google interactive website. It is rendered in JS, so I use Qt to load the site to parse from. I believe I have the site loaded and rendered properly, but for some reason I am getting and empty list returned to me...

how to use lxml find all the src tags and replace them

python,html,html-parsing,lxml,lxml.html
I want to use lxml to got src content and replace them with space. But the body still not be replaced Please help me Thank you. import re import lxml.html #the content of source.log is a webpage source code I got by scrapy with open("source.log", "r") as bb: c_str =...

Capturing name in source page using xpath in python

python,xpath,lxml.html
I have the following url source page: <input type="hidden" name="QQQ" value="AAA" /> <input type="hidden" name="WWW" value="BBB" /> <input type="hidden" name="EEE" value="CCC" /> <input type="hidden" name="WANTED" value="DDD" /> I want to extract WANTED where the value is DDD from that. What I tried is: token=tree.xpath('//input[@type="hidden"]/input[@value="DDD"]/@name') but it gives me QQQ...

Get value using lxml

python,html,html-parsing,lxml,lxml.html
I have the following html: <div class="txt-block"> <h4 class="inline">Aspect Ratio:</h4> 2.35 : 1 </div> I want to get the value "2.35 : 1" from the content. However, when I try using lxml, it returns an empty string (I am able to get the 'Aspect Ratio' value, probably because that is...

Scraping IMDb Review Page with lxml and requests package

python,lxml,lxml.html
I want to extract the user reviews of a particular movie with help of lxml. Before that, I need to find out the number of reviews first. An example review page is Interstellar I found the XPath where User Reviews are found with the help of Firebug: /html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tbody/tr/td[2] I have...

Why is this lxml.etree.HTMLPullParser leaking memory?

python,lxml.html
I'm trying to use lxml's HTMLPullParser on Linux Mint but I'm finding that the memory usage keeps increasing and I'm not sure why. Here's my test code: # -*- coding: utf-8 -*- from __future__ import division, absolute_import, print_function, unicode_literals import lxml.etree import resource from io import DEFAULT_BUFFER_SIZE for _ in...

Parsing xpath with python

python,xpath,lxml,lxml.html
I'm trying to parse a web page that contains this: <table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;"> <tr> <td colspan="2" style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td> </tr> <tr> <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td> <td style="border-top: 1px solid...