FAQ Database Discussion Community


Python XML parsing, lxml, urllib.request

python,xml,lxml,urllib
I am a little bit stuck trying to parse a XML file retrieved from url, my goal is to get this xml file into a well structured object to easily retrieve its data. My current code results in the following error: >>> tree = etree.parse(data) Traceback (most recent call last):...

LXML - parse td content within tr tag

python,html,python-3.x,lxml
I want to parse each individual statistic from the yahoo finance tables for formatting purposes - when parsing the entire table the formatting is terrible!! I am currently using the code below and I would have to repeat the 4 lines of contentA code slightly altered to retrieve the stats...

python - find xpath of element containing string

python,html,xml,xpath,lxml
I build a small script that supposed to find some specific string in a page and return the xpath of the element containing this string. The purpose is to use this xpath for finding string with same context. I'm using this code: import requests from lxml import html page =...

Extracting between specific tag

python,lxml
I am able to extract all the tags with the following code. However, I don't know how to look inside between the <script> and </script> tag. In particular, say I wanted just this part (there is more in between but I am not interested in that) : <script> var quoteDataObj...

How do I get all content between two html tags in Python?

python,xml,xpath,beautifulsoup,lxml
I try to extract all content (tags and text) from one main tag on html page. For example: `my_html_page = ''' <html> <body> <div class="post_body"> <span class="polor"> <a class="p-color">Some text</a> <a class="p-color">another text</a> </span> <a class="p-color">hello world</a> <p id="bold"> some text inside p <ul> <li class="list">one li</li> <li>second li</li> </ul>...

how to install lxml with pypy in virtualenv

python,lxml,pypy
I am trying to use pypy in a virtualenv for better performance in running my python program. I was able to install all the required modules, except for lxml So far, I tried pip install lxml Also tried pip install --upgrade lxml It shows the following message at the end:...

Namespace argument in lxml parsing

python,lxml
I have an html page that I am trying to parse. Here is what I'm doing with lxml: node=etree.fromstring(html) >>> node <Element {http://www.w3.org/1999/xhtml}html at 0x110676a70> >>> node.xpath('//body') [] >>> node.xpath('body') [] Unfortunately, all my xpath calls are now returning an empty list. Why is this occurring and how would I...

XSLT template does not apply to all elements using lxml

xml,xslt,lxml
I am confused about the use of XSLT templates and when/how they are applied. Suppose I have the following XML file: <book> <chapter> 1 </chapter> <chapter> 2 </chapter> </book> and I'd like to match all chapters in order. This is a XSLT stylesheet: <?xml version="1.0"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:template...

Find all HTML and non-HTML encoded URLs in string

python,html,regex,beautifulsoup,lxml
I would like to find all URLs in a string. I found various solutions on StackOverflow that vary depending on the content of the string. For example, supposing my string contained HTML, this answer recommends using either BeautifulSoup or lxml. On the other hand, if my string contained only a...

How to get full text inside lxml element

python,lxml
I have the following html: <span class="episode">Episode: <a href="/title/tt2071912/"> !Que ve el Bisbe!</a> (2011) </span> How would I get the year from this? When I get the episode object, it only gives me the 'text' before the <a>: result.cssselect('.episode')[0].text 'Episode: ' The best I have so far is: year =...

Parse HTML from local file

python,html,google-app-engine,lxml
I'm using Google App Engine with Python. I want to get the tree of a HTML file from the same project as my Python script. I tried many things, like using the absolute url (e.g http://localhost:8080/nl/home.html) and the relative url (/nl/home.html). Both don't seem to work. I use this code:...

type error saving lxml elements in shelve

python,lxml,pickle
I'm processing some xml files. pb_id is a string. page_elements is a list. pb_id = x.xpath('//pb/@xml:id')[0] page_elements = x.xpath('//@xml:id[preceding::pb]') I want to save these values in a shelve cache: s = shelve.open('cache.shelve') s[str(pb_id)] = page_elements But it returns this error: can't pickle _Element objects Do I need to cast page_elements...

lxml — how to isolate element from children

python,html,xml,html-parsing,lxml
Using lxml, I'd like to be able to get an HTML element and turn it into a string, excluding its children. How do I do this? Thanks...

LXML to write in unicode?

python,unicode,lxml
I am currently using lxml to write a file. I build the node and then I write it to a file using etree.tostring(node, pretty_print=True). However, it seems to be using htmlencoding -- <Synopsis> Abila schlie&#223;lich die ersten sechs Aufgaben zu meistern. Wird der Junge auch </Synopsis> In order to decipher...

Print Soap Body data using lxml

python,lxml
I have a following XML . I need to store whole body xml from the Soap request in a variable . <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:cre="http://www.code.com/abc/V1/createCase"> <soapenv:Header><wsse:Security xmlns:wsse="http://docs.oasis-open.org/2" xmlns:wsu="http://docs.oasis-open.org/a.xsd"></wsse:Security> </soapenv:Header> <soapenv:Body xmlns:wsu="http://docs.oasis-open.org/30.xsd" wsu:Id="id-14"> <cre:createCase> <cre:Request>...

Scrape data using lxml python

python,json,dictionary,lxml
I'm trying to create a function which scrape the league into a dictionary. However it seem to add an array into the dictionary instead of just the string. How come is this html i'm trying to scrape: <fieldset> <legend align="center"> <a href="/dota2/events/297-the-summit-3">The Summit 3</a> </legend> </fieldset> Python get_league function. self.url...

Append parent to xml

python,lxml,elementtree
I want to add one more block to xml file. Basicly under parent Tss I want to create sublement Entry with its attributes. Here is what I want to add to xml file: <Entry> <System string = "rbs005019"/> <Type string = "SECURE"/> <User string = "rbs"/> <Password string = "rbs005019"/>...

Issue with parsing html with lxml by xpath

python,parsing,xpath,lxml,lxml.html
I am trying to parse data from a google interactive website. It is rendered in JS, so I use Qt to load the site to parse from. I believe I have the site loaded and rendered properly, but for some reason I am getting and empty list returned to me...

Get inner text from lxml

python,lxml
lxml.html.fromstring insists on wrapping up everything in a tag (p default). From this tag tree, <p>this is <b>the</b> good stuff<p> I want to extract the string: this is <b>the</b> good stuff How do I do this?...

How to set up XPath query for HTML parsing?

python,xml,parsing,xpath,lxml
Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project. <div id="names"> <h2>Names and Synonyms</h2> <div class="ds"><button class="toggle1Col"title="Toggle display between 1 column of wider results and multiple columns.">&#8596;</button> <h3 id="yui_3_18_1_3_1434394159641_407">Name of Substance</h3> <ul> <li id="ds2"> `` <div>Acetaldehyde</div> </li> </ul>...

Why won't lxml strip section tags?

python,html,lxml
I'm trying to parse some HTML with lxml and Python. I want to remove section tags. lxml seems to be capable of removing all other tags I specify but not section tags. e.g. test_html = '<section> <header> Test header </header> <p> Test text </p> </section>' to_parse_html = etree.fromstring(test_html) etree.strip_tags(to_parse_html,'header') etree.tostring(to_parse_html)...

Creating xsd document from file download

python,amazon-s3,xsd,lxml
I am trying to load an xsd document that is stored on s3. It gives me the following err >>> from lxml import etree >>> xsd_url = 'https://s3-us-west-1.amazonaws.com/premiere-avails/movie.xsd.xml' >>> node=etree.fromstring(requests.get(xsd_url).text) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 3092, in lxml.etree.fromstring (src/lxml/lxml.etree.c:70473) File "parser.pxi",...

Xpath extract current node content including all child node

python,xpath,lxml
I've met a problem while extracting current node content including all child node. Just like the following code, I want to get string abcdefg<b>b1b2b3</b> in pre tag. But I could not use "child::*" to get it. If I use "/text()", I lost b tag format information. Please help me out....

Gracefully recover from parse error in lxml

python,xml,parsing,exception,lxml
I want to continue parsing an invalid XML file, but capture the number of invalid files in a variable. Trying this: try: parser = etree.XMLParser(recover=False) tree = etree.parse(rawfile, parser=parser) print "Good XML!" except etree.XMLSyntaxError: parser = etree.XMLParser(recover=True) tree = etree.parse(rawfile, parser=parser) print "Bad XML!" misformattedXMLFile += 1 root = tree.getroot()...

lxml - how to get minimal xpath of element?

python,xpath,xml-parsing,html-parsing,lxml
tree.xpath("/exact/path/to/element") yields [<Element I want>]. exact/path/to/element is procured by a call to tree.getroottree().getpath(element). If I find the minimal xpath to the element with e.g. Firebug, tree.xpath("//@minimal-descriptor") yields [<Element I want>]. Question How do I get the minimal xpath from element using lxml, or other Python library?...

Python 3.4 : LXML web scraping

python,lxml
I am using the following code to try to return a list of tickers on that website. The result of the code is an empty list. I copy the xpath from google chromium developer tools. What am I doing wrong? from lxml import html import requests url = 'http://en.wikipedia.org/wiki/List_of_S%26P_500_companies' resp...

Japanese characters screwing up lxml parsing

python,lxml
How would I do the following in lxml? runtime_text = node.xpath("//dl/dt[text()=u'Runtime:' or text()=u'Laufzeit:' or text()=u'再生時間:']/following-sibling::dd")[0].text.strip() It works fine without the Kanji, but as soon as that line is added in, it fails with: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "lxml.etree.pyx", line 1498, in lxml.etree._Element.xpath...

how to use lxml find all the src tags and replace them

python,html,html-parsing,lxml,lxml.html
I want to use lxml to got src content and replace them with space. But the body still not be replaced Please help me Thank you. import re import lxml.html #the content of source.log is a webpage source code I got by scrapy with open("source.log", "r") as bb: c_str =...

Installing lxml on RHEL for Python 2.7

python,lxml,easy-install
So the RHEL machine originally had Python 2.4. I installed Python 2.7. I want to install lxml module and I cannot use pip or setup tools due to proxy concerns. I used sudo yum install python-lxml It installed lxml for Python 2.4 which is the default. How can I make...

Extracting data from webpage using lxml XPath in Python

python,xpath,web-crawler,lxml,python-requests
I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library. The page url is www.mangapanda.com/one-piece/1/1 I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to...

How can I strip namespaces out of an lxml tree?

python,xml,lxml,xml-namespaces,prefix
Following on from Removing child elements in XML using python ... Thanks to @Tichodroma, I have this code: If you can use lxml, try this: import lxml.etree tree = lxml.etree.parse("leg.xml") for dog in tree.xpath("//Leg1:Dog", namespaces={"Leg1": "http://what.not"}): parent = dog.xpath("..")[0] parent.remove(dog) parent.text = None tree.write("leg.out.xml") Now leg.out.xml looks like this: <?xml...

Issue Parsing a site with lxml and xpath in python

python,xpath,lxml
I think I am messing up my xpath. What I am trying to do is get the information of each row on the table in this page. This is what I have so far but its not outputting what I'm looking for. import requests from lxml import etree r =...

Python: specifying the namespace in an lxml.etree path

python,xml,svg,lxml,xml-namespaces
I'm trying to figure out how to access a specific element by id in an SVG file. I was using the python library of lxml to parse through the file, but it always comes up empty. Here is the python script I used to access the element: #!/usr/bin/env python from...

Parsing HTML tree in lxml : how can I retrieve the text inside the element?

python,html-parsing,lxml
I'm trying to retrieve the correct text inside an element. Here is the output: (Pdb) p etree.tostring(els[0]) '<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:&quot;Open Sans&quot;, &quot;Helvetica Neue&quot;, Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important">&#13;\n<i class="ic-icon-delivered"...

With Xpath, how do you select these elements but not those?

xpath,lxml,scrape
With a general XPath (or with specific functions of lxml in python), how do you select a set of elements that have a set of tags like this? <div class="cl1 a"> <div class="cl1 b"> but not <div class="cl1"> ...

Scraping IMDb Review Page with lxml and requests package

python,lxml,lxml.html
I want to extract the user reviews of a particular movie with help of lxml. Before that, I need to find out the number of reviews first. An example review page is Interstellar I found the XPath where User Reviews are found with the help of Firebug: /html/body/div[1]/div/layer/div[4]/div[3]/div[3]/div[3]/table[2]/tbody/tr/td[2] I have...

Parsing XML in Python using LXML

python,xml,lxml
tree = etree.parse("pinnacle_feed.xml") fdtime = tree.xpath('//rsp/fd/fdTime/text()') global lasttime lasttime = fdtime[0] for leagues in tree.getiterator('league'): leagueid = tree.xpath('//id/text()') for elt in leagues.getiterator('event'): startDateTime = elt.xpath('//startDateTime/text()') eventId = elt.xpath('//id/text()') homeTeam = elt.xpath('./homeTeam/name/text()') awayTeam = elt.xpath('./awayTeam/name/text()') homeTeamOdds = elt.xpath('./periods/period/moneyLine/homePrice/text()') awayTeamOdds =...

How to indent XML with lxml?

python,lxml
Suppose that I have created this XML document with lxml: from lxml import etree album=etree.Element("album") doc=etree.ElementTree(album) album.append(etree.Element("autor")) album.append(etree.Element("titulo")) album.append(etree.Element("formato")) album.append(etree.Element("localizacion")) album[0].text="album name" album[0].attrib["pais"]="ES" album[1].text="artist name" album[2].text="MP3" album[3].text="Varios CD5" How can I save this XML to file so that there is reasonable indentation?...

Python lxml - find tag block ammend

python,xml,xml-parsing,tags,lxml
I have the below xml which I have opened and parsed and I now need to find the specific product block with the territory 'IE' and then amend its 'cleared_for_sale' and 'wholesale_price_tier' values, but am unsure how to do it. Heres what doesnt work: a = 0 territory = "IE"...

Case insensitive xpath [duplicate]

python,xpath,lxml
This question already has an answer here: Is it possible for lxml to work in a case-insensitive manner? 3 answers case-insensitive matching in xpath? 3 answers How would I match the following two items with one xpath? <locales> <locale name="nl-NL"> </locales> <locales> <locale name="NL-NL"> </locales> So far I have...

how to get unresolved entities from html attributes using python and lxml

python,html,python-2.7,lxml
When parsing HTML with python/lxml, I would like to retrieve the actual attribute text for html elements but instead, I get the attribute text with resolved entities. That is, if the actual attribute reads this &amp; that, I get back this & that. Is there a way to get the...

lxml - how to remove element but not it's content?

python,lxml
Let's assume I have following code: <div id="first"> <div id="second"> <a></a> <ul>...</ul> </div> </div> Here's my code: div_parents = root_element.xpath('//div[div]') for div in reversed(div_parents): if len(div.getchildren()) == 1: # remove second div and replace it with it's content I'm reaching div's with div children and then I want to remove...

Select by xpath knowing only ending of element's attribute

python,xml,xpath,web-scraping,lxml
Having such xml file. How can I select only that tag, which href attribute ends with parent, like third element below. Determine it by position like elem = tree.findall('{*}CustomProperty')[2] does not fit because some documents might have only one parent href, others 5-10 and third might not have such hrefs...

TypeError: unhashable type: 'list' while posting data

python,list,lxml
I want to post some parameters in order to login to my page: session=requests.Session() cont=session.get('http://mywebsite.com/').content tree=html.fromstring(cont) token=tree.xpath[...] post_data={'A':'B', token:'1'} In last line it give me the error: TypeError: unhashable type: 'list' ...

Fix the location of attributes in etree.Element

python-2.7,lxml,xml.etree
I use the python, lxml package. I am wondering if someone knows how to output an element with fixed specified locations for the attributes. MMain = etree.Element('DockingConfig', FormatVersion="8", InsideFill="True", InnerMinimum="20, 20", SavedAt="1/27/2014 2:01:47 PM") outfile.write(etree.tostring(MMain, pretty_print=True)) if I output this, it will sort out the attributes alphabetically, which is not...

Obtaining position info when parsing HTML in Python

python,html,parsing,lxml,html5lib
I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need...

Saving spider results to database

python,python-3.x,sqlalchemy,web-scraping,lxml
Currently thinking about a good way to save my scraped data into a database. App flow: Run spider (data scraper), file located in spiders/ When data has been collected successfully save the data/items (title, link, pubDate) to the database by use of the class in pipeline.py I would like your...

Scraping data python lxml

python,lxml
I'm trying to retrieving a specific string by scraping. However it seem to return nothing. i'm using python and lxml, but not seem to return the string inside the a tag. here is the html i'm trying to retrieve <fieldset> <legend align="center"> <a href="/counterstrike/events/302-cs-go-champions-league">CS:GO Champions League</a> </legend> </fieldset> Here is...

python lxml loop through all tags

python,xml,dictionary,lxml
I have a dict mapping each xml tag to a dict key. I want to loop through each tag and text field in the xml, and compare it with the associated dict key value which is the key in another dict. <2gMessage> <Request> <pid>daemon</pid> <emf>123456</emf> <SENum>2041788209</SENum> <MM> <MID>jbr1</MID> <URL>http://jimsjumbojoint.com</URL> </MM>...

Parsing xpath with python

python,xpath,lxml,lxml.html
I'm trying to parse a web page that contains this: <table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;"> <tr> <td colspan="2" style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td> </tr> <tr> <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td> <td style="border-top: 1px solid...

Print only not null values

python,python-2.7,xpath,lxml
I am trying to print only not null values but I am not sure why even the null values are coming up in the output: Input: from lxml import html import requests import linecache i=1 read_url = linecache.getline('stocks_url',1) while read_url != '': page = requests.get(read_url) tree = html.fromstring(page.text) percentage =...

How to extract efficientely content from an xml with python?

python,xml,python-2.7,pandas,lxml
I have the following xml: <?xml version="1.0" encoding="UTF-8" standalone="no"?><author id="user23"> <document><![CDATA["@username: That boner came at the wrong time ???? http://t.co/5X34233gDyCaCjR" HELP I'M DYING ]]></document> <document><![CDATA[Ugh ]]></document> <document><![CDATA[YES !!!! WE GO FOR IT. http://t.co/fiI23324E83b0Rt ]]></document> <document><![CDATA[@username Shout out to me???? ]]></document> </author> What is the most efficient...

crawling imdb database using python and lxml

python,web-crawler,lxml
hxs = lxml.html.document_fromstring(requests.get("http://www.imdb.com/title/" + id).content) movie = {} try: movie['title'] = hxs.xpath('//*[@id="overview-top"]/h1/span[1/text()'[0].strip() except IndexError: movie['title'] i am not able to understand the meaning of "hxs.xpath('//*[@id="overview-top"]/h1/span[1]/text()')[0].strip()"...

how to write the opening of an xml doc in lxml?

python,xml,lxml,cxml
I'm using lxml to write out a cXML file, but I can't figure out how to get it to write out the opening <?xml version="1.0" encoding="UTF-8"?> along with the doctype following it. When I started this, I started straight in on the document itself, with the first Element being cXML...

lxml not parsing unicode properly for HTML

python,unicode,lxml
I am trying to parse HTML, but unfortunately lxml is not allowing me to grab the actual text: node = lxml.html.fromstring(r.content) self.fingerprint['Title'] = node.cssselect('.document-title div')[0].text print '@@####', self.fingerprint['Title'] # @@#### Démineurs What do I need to do to correctly parse this text? Here is the web page: https://play.google.com/store/movies/details/D%C3%A9mineurs?id=KChu8wf5eVo&hl=fr and the...

Join consecutive HTML tags of the same kind, same CSS class

html,xslt,lxml
I am trying to process several HTML files that were automatically generated and I 'm in a situation where I need to join consecutive span elements of the same class. The class is more or less known a priori. Edit #1: Example input #1: <p class='sC8420256'> <span class="s32A37344">OPINION EN PARTIE...

Get value using lxml

python,html,html-parsing,lxml,lxml.html
I have the following html: <div class="txt-block"> <h4 class="inline">Aspect Ratio:</h4> 2.35 : 1 </div> I want to get the value "2.35 : 1" from the content. However, when I try using lxml, it returns an empty string (I am able to get the 'Aspect Ratio' value, probably because that is...

URL with Ukrainian characters giving UnicodeEncodeError

python,character-encoding,lxml
I'm trying to extract dictionary entry: url = 'http://www.lingvo.ua/uk/Interpret/uk-ru/вікно' # parsed_url = urlparse(url) # parameters = parse_qs(parsed_url.query) # url = parsed_url._replace(query=urlencode(parameters, doseq=True)).geturl() page = urllib.request.urlopen(url) pageWritten = page.read() pageReady = pageWritten.decode('utf-8') xmldata = lxml.html.document_fromstring(pageReady) text = xmldata.xpath(//div[@class="js-article-html g-card"]) either with commented lines on or off, it keeps getting an error:...

How to merge two different paths in a XML file?

python,xml,parsing,xml-parsing,lxml
This is my xml file: <File> <Paths> <Path> <Node> <NodeName>Initial_Node</NodeName> <InnerNode> <Signal>Test_sig</Signal> <InnerNode> <Signal>Test_sig_1</Signal> <NodeRef>Ref0</NodeRef> </InnerNode> </InnerNode> </Node> </Path> <Path> <Node> <NodeName>Name1</NodeName> <InnerNode> <Signal>Test_sig_0</Signal> <InnerNode> <Signal>Test_sig_2</Signal>...

python lxml.html.parse not reading url

python,lxml,python-requests
Why is html.parse(url) failing, when using requests then html.fromstring works and html.parse(url2) works? lxml 3.4.2 Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()" for more information. >>> import requests >>> from lxml import html >>> url = 'http://www.oddschecker.com' >>>...

Merging xpath generated lists, Python, lxml

python,xml,list,xpath,lxml
I am currently scraping an xml file for some data and I need help merging 3 lists with values generated from 3 xpath calls.: Code: import urllib.request import lxml.etree as ET opener = urllib.request.build_opener() tree = ET.parse(opener.open('https://nordfront.se/feed')) The data i am interested in: >>> tree.xpath("/rss/channel/item/title/text()") ['Klistermärkesuppsättning i Trollhättan', 'Dror Feiler...

Close open tags without wrapping in

python,django,lxml
I am trying to write a clean method for my form so that if a user leaves an open in-line tag, it will be closed, e.g.: Here's an <b> open tag -> Here's an <b> open tag</b> <i>Here are two <b>open tags -> <i>Here are two <b> open tags</b></i> My...

Insert xml node in specific location

python,xml,lxml
I would like to build the following xml: <Item> <Name>Hello</Name> <Date>2014-01-01</Date> <Hero>1</Helo> </Item> Given the following code structure, how would I insert the <Date> node before the hero node? item = etree.SubElement(self.xml_node, 'Item') etree.SubElement(item, 'Name').text = 'Hello' etree.SubElement(item, 'Hero').text = 1 # Now, how to insert the 'Date' element before...

lxml: I can't remove a span tag and the text inside

html,lxml
I have an html file with some divs like this (a lot simplified): <div num="1" class="class1"> <div class="class1-text"> <span class="class2"> <span class="class3"> some chinese text </span> some english text </span> </div> </div> I'm trying to remove all the Chinese text by removing the span node that contains it with lxml:...

Python 3.4 : LXML : Parsing Tables

python,python-3.x,datatables,lxml
I want to parse an entire table from yahoo finance. As I understand it 'tbody' and 'thead' tags are not registered by lxml but rather as additional TR so I switched the xpath from: /html/body/div[4]/div[4]/table[2]/tbody/tr[2]/td/table[2]/tbody/tr/td/table/tbody to what is seen in the code below url = 'http://finance.yahoo.com/q/is?s=MMM+Income+Statement&annual' tree = html.parse(url) tick_content...

How to add spaces between nodes when using string() on a tree in XPath

html,xslt,xpath,lxml
I have a HTML tree where I use the 'string()' query on the root to get all the text from the nodes. However, I'd like to add a space between each nodes. I.e. string() on '<root><div>abc</div><div>def</div></root>' will become 'abcdef' string() on '<root><div>abc</div><div>def</div></root>' should become 'abc def '...

Convert XML to string to find length

python,lxml
I am trying to find the length of an XML document and would like to know how to convert an XML document into a string so that I can find its length.

Pandas read_html equivalent for a lxml table

python,pandas,lxml
Hi I have about 10 tables which I have used lxml to classify. >>>import pandas as pd >>>import lxml >>>root = lxml.etree.HTML(htmlcontent) >>>tables = root.findall('.//*[@id="info-container"]/table') >>>readabletables = tables[::2] >>>len(readabletables) = 5 >>>readabletables[0] <Element table at 0x105241e60> I want these 5 tables to be read and interpreted by pandas just like...

python install lxml on mac os 10.10.1

python,osx,python-2.7,scrapy,lxml
I bought a new macbook and I am so new to mac os. However, I read a lot on internet about how to install scrap I did everything, but i have a problem with installing lxml I tried this on terminal pip install lxml and a lot of stuff started...

lxml startswith for xpath

python,lxml
How would I get the following (using the % for a LIKE statement) -- assets['HasEN'] = self.node.xpath('//data_file[@role="source"]/locale[@name="en%"]') In other words, the name could be en, it could be en-US, it could be en-GB, etc. Is there a way to do that with lxml or do I have to do that...

Combine multiple tags with lxml

python,html,xpath,lxml
I have an html file which looks like: ... <p> <strong>This is </strong> <strong>a lin</strong> <strong>e which I want to </strong> <strong>join.</strong> </p> <p> 2. <strong>But do not </strong> <strong>touch this</strong> <em>Maybe some other tags as well.</em> bla bla blah... </p> ... What I need is, if all the tags...

Python 2.7 Etree/lxml minimizing [duplicate]

python,xsd,formatting,lxml
This question already has an answer here: Close a tag with no text in lxml 3 answers Im using lxml/Etree to parse and write to XSD documents. I have the basic structure tree = ET.parse('file.xsd') # do stuff tree.write('output.xsd') But tags get minimized in some instances, for example: <Cars>...

SOLVED: Installing lxml, libxml2, libxslt on Windows 8.1

python,windows,module,installation,lxml
After additional exploration, I found a solution to installing lxml with pip and wheel. Additional comments on approach welcomed. I'm finding the existing Python documentation for Linux distributions excellent. For Windows... not so much. I've configured my Linux system fine but I need some help getting a Windows 8.1 tablet...

Parsing XML using LXML and Python

python,xml,lxml
I have read the other questions on stack overflow but am still unsure how to proceed. I have simplified the XML document for ease of reading. <event> <time>2015-01-30T08:59:00Z</time> <homeTeam type="Team1"> <name>United Arab Emirates</name> </homeTeam> <awayTeam type="Team2"> <name>Iraq</name> </awayTeam> <periods> <period lineId="168809488"> <number>0</number> <description>Match</description>...