FAQ Database Discussion Community


How to extract the text between anchor tag in PHP?

php,html,string,html-parsing,anchor
I've one string in a variable titled $message as follows : $message = 'posted an event in <a href="http://52.1.47.143/group/186/">TEST PRA</a>'; I only want to get the text within anchor tag i.e. TEST PRA in this case using PHP. How should I do this in an efficient way? Can someone please...

getting specific images from page

python,html,web-scraping,beautifulsoup,html-parsing
I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl: redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl") redditHtml = redditFile.read() redditFile.close() soup = BeautifulSoup(redditHtml) productDivs = soup.findAll('div', attrs={'class' : 'dg_u'}) for div in productDivs: print div.find('a')['t1'] #works fine print div.find('img')['src'] #This getting issue KeyError: 'src' But this gives only...

Jsoup: Extracting innertext from anchor tag

java,html,html-parsing,jsoup
Here's my problem. I have a HTML code like this <div> <a href="#"> innerText </a> </div> I need to extract the "innerText". While trying this in Jsoup I found that the innertext goes outside the anchor tag when parsed by Jsoup. Here's my code Document doc=Jsoup.parse("<div> <a href="#"> innerText </a>...

Python Regex matching string between abcd=“_blank”> and

python,html,regex,python-2.7,html-parsing
How can I match strings between abcd="_blank"> and </a> using Regex in Python 2.7. For example for abcd="_blank">ABBA</a> the result should be ABBA.

Beautifulsoup no img

python,html,python-2.7,beautifulsoup,html-parsing
I'm trying to code up a script in python 2.7 using bs4 to scrape the images and rename the files to my server and display it in a low bandwidth friendly manner, and update it on cronjobs every 3hrs by overwriting the existing images. The problem in my code is...

JSoup how to parse table 3 rows

java,html-parsing,jsoup
I have a table like this that i want to Parse to get the data-code value of row.id and the second and third column of the table. <table> <tr class="id" data-code="100"> <td></td> <td>18</td> <td class="name">John</td> <tr/> <tr class="id" data-code="200"> <td></td> <td>21</td> <td class="name">Mark</td> <tr/> </table> I want to print out....

Extracting the main product image from a ecommerce product page

magento,html-parsing,bigcommerce,html-parser,image-extraction
I am looking for options to extract the main image from a product page on a retailer website, the problem is there are multiple images in a product page (related images) , one approach I thought would work would be to extract all the image links, download each one of...

Regular expression for class using Beautifulsoup

python,html,regex,beautifulsoup,html-parsing
I am using Beautifulsoup for easy scraping. I have figured out there are more than 5 div in webpage which I want to scrap. Their names are different but has pattern. These divs are: divnewthing divnew divnewstring etc So the pattern is divnew* kind of regular expression. And I am...

Python-Beautiful Soup not parsing entire unordered list

python,html,web-scraping,beautifulsoup,html-parsing
I am trying to scrape a website and having one part that is just baffling me. There is an unordered list of locations served by organizations and I can seem to parse the entire list. Here is an example of what the HTML looks like: <div id="current_tab"> <p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies...

How to remove parent tag with BeautifulSoup

python,beautifulsoup,html-parsing
I am trying to remove the header cells from a html table using BeautifulSoup. I have something like; <tr> <th> head1 </th> <th> head2 </th> </tr> I am using the following code to remove all the header cells; soup = BeautifulSoup(url) for headless in soup.find_all('th'): headless.decompose() This works great, except...

Evaluate image in html table using Python

python,html,beautifulsoup,html-parsing
I am trying to parse a table and save it into a csv file. However, some of the cells are images (*.gif) of a checkmark and I am unsure how to evaluate when exporting to csv. here is some html code: <BODY> <TABLE> <TH> <H3> <BR>TABLE 1 </H3> </TH> <TR>...

Parsing html using Xpath in python

python,html,xpath,html-parsing
I have a html below which i was trying to parse using xpath. But i am only get empty sting in return. Can anyone please tell me where i am mistaken. I have tried everything but couldn't succeed. Xpath Code for label : divLbl=ch.xpath("//div[@class='left-container']/article/ul[@class='list-unstyled row']/li[@class='col-sm-6 mrg-bottom']/span[@class='text-light']") Xpath Code for value...

Jsoup: take text and url

java,android,html,html-parsing,jsoup
I've this HTML block: <div class="singolo-contenuto link_azure"> <p>I'm a TEXTXXXXXXXXXXXXXXXX<p> <a href="http://example.com">Name of URL</a></p></p> <ul class="list_attachments"><li><a href="DON'T TOUCH"><img src='/img/fileicons/file.png' alt='file'/> TITLE</a></li></ul> </div> <div class="clear"></div> Actually I'm taking text with: document.select(".singolo-contenuto").text(); That returns to me: "I'm a TEXTXXXXXXXXXXXXXXXX...

Get data from web pages and save locally

wpf,wpf-controls,html-parsing
I start to learn WPF and there is something that is still unclear for me: I want to create app, which can get information from online web sites (e.i. news). How can I parse data from pages? And second question, which is connected first one, How can I save user...

How to get images from a saved html page

html,ruby,html-parsing,nokogiri
I have a huge amount of saved HTML pages in my PC. I had parsed the the HTML page and got the image src. I need to store the images in every HTML page in a specific structure in separate directory. I tried out NET::HTTP.get but i am getting a...

Simplexml: parsing HTML leaves out nested elements inside an element with a text node

php,xml,parsing,html-parsing,simplexml
I'm trying to parse a specific html document, some sort of a dictionary, with about 10000 words and description. It went well until I've noticed that entries in specific format doesn't get parsed well. Here is an example: <?php $html = ' <p> <b> <span>zot; zotz </span> </b> <span>Nista; nula....

URL as a link in listview in android

android,listview,hyperlink,html-parsing
I am new to android and I am trying to contribute for an android project. There is listview of messages and when a user clicks and a specific message, another view opens and shows the details as shown in below image(android view). All these information is being fetched from web...

BeautifulSoup4 missing tags

python,html,beautifulsoup,html-parsing
I'm using BeautifulSoup 4 under Anaconda's distribution as bs4. Correct me if I'm wrong - I'm understanding BeautifulSoup is lib for transforming ill-formed HTML into well-formed one. But, when I'm assigning HTML to it's constructor, I lose more then half of it's characters. Shouldn't it be only fixing HTML and...

Regular expression: Identify all html tag except , , or

java,regex,html-parsing
I would like to ask how can I remove all html tags except <tr>, <td>, </td>, or </tr> I can - Identify all html tag using <.*?> - Identify <tr>, <td>, </td>, </tr> using ^((?!<tr>)(?!<td>)(?!</td>)(?!</tr>).)*$. But I just do not know how to combine both criteria into one. Thank you...

Can't get value from xpath python

python,html,xpath,web-scraping,html-parsing
I want to get values from page: http://www.tabele-kalorii.pl/kalorie,Actimel-cytryna-miod-Danone.html I can get all values from first section, but I can't get values from table "Wartości odżywcze" I use this xpath: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]/span/text()")) But I'm not getting anything. With xpath like this: ''.join(tree2.xpath("//html/body/div[1]/div[3]/article/div[2]/div/div[4]/div[3]/div/div[1]/div[3]/table[1]/tr[3]/td[2]//text()")) I'm...

Searching Large String for file path. Return filepath + filename

python,regex,string,beautifulsoup,html-parsing
I've got a little project where I’m trying to download a series of wallpapers from a web page. I'm new to python. I'm using the urllib library, which is returning a long string of web page data which includes <a href="http://website.com/wallpaper/filename.jpg"> I know that every filename I need to download...

Get value using lxml

python,html,html-parsing,lxml,lxml.html
I have the following html: <div class="txt-block"> <h4 class="inline">Aspect Ratio:</h4> 2.35 : 1 </div> I want to get the value "2.35 : 1" from the content. However, when I try using lxml, it returns an empty string (I am able to get the 'Aspect Ratio' value, probably because that is...

how to use lxml find all the src tags and replace them

python,html,html-parsing,lxml,lxml.html
I want to use lxml to got src content and replace them with space. But the body still not be replaced Please help me Thank you. import re import lxml.html #the content of source.log is a webpage source code I got by scrapy with open("source.log", "r") as bb: c_str =...

Best approach for parsing html string Programatically without appending to DOM

javascript,html-parsing
I am currently building a parser and compiler to handle my own custom 2-way data bindings, everything is working fine however it involves me appending my template string via .innerHTML which is not very efficient. When parsing I need to access certain DOM methods such as .getElementsByTagName('*') and perform other...

How to create
for h1,h2 and so on using DOM parser in php?

php,html,dom,html-parsing
HTML <h1>heading 1</h1> <h2>heading 2</h2> <h1>heading 1</h1> <h2>heading 2</h2> <h3>heading 3</h3> Expected output <div class="sect1"> <h1>heading 1</h1> <div class="sect2"> <h2>heading 2</h2> </div> </div> <div class="sect1"> <h1>heading 1</h1> <div class="sect2"> <h2>heading 2</h2> <div class="sect3"> <h3>heading 3</h3> </div> </div> </div> I need to wrap h...

HTML tables with python beautiful soup

python,html,beautifulsoup,html-parsing,scrapy
I have a HTML table which looks like this : <table border=0 cellspacing=1 cellpadding=2 class=form> <tr class=form><td class=formlabel>Heating Coils in Bunker Tanks</td><td class=form>N</td></tr> <tr class=forma><td class=formlabel>Heating Coils in Cargo Tanks</td><td class=form>U</td></tr> <tr class=form><td class=formlabel>Manifold Type</td><td class=form>N</td></tr> <tr class=forma><td class=formlabel>No....

Regex strip all html except background style url

regex,html-parsing
I have the following regex that will find all the background style URLs in my HTML. I'm trying to strip all the HTML except for the background image URLs. My goal is to abstract a list of background image URLs from my HTML page. Expression URL\(\s*(['"]?)(.*?)\1\s*\) Example HTML <a href="#"><img...

UTF Encoding in selenium webdriver

python,selenium,utf-8,selenium-webdriver,html-parsing
I currently have the following: from selenium import webdriver d = webdriver.Chrome() # request the url and get the page contents title = result.find("span", {"class": "episode"}).find("a").text However, the 'text' that is returned to me is: # Note the truncation on the word "envol" <td class="title"><a href="/title/tt1844708/">La grande envol</a></td> However, when...

Design regex expression for this html

html,regex,string,vbscript,html-parsing
In this html line: <b>Cash Out: </b> 2.46x </p> <p> <b>Played: </b>Sat Ene 12 2015 00:20:13 How I could match the 2.46x value? This is the expression that I've tried to design: "(cash.+out.+\s+)([\d\.]+[^\s])" PS: I know that RegEx motor was not designed to parse Html, but anyways it can be...

Regex Match All Characters Between Tags on nth occurrence

regex,html-parsing
I need to match text between two tags, but starting at a specific occurrence of the tag. Imagine this text: Some long <br> text goes <br> here. And some <br> more can <br> go here.<br> In my example, I would like to match here. And some. I successfully matched the...

Not sure how to parse this

python,html,beautifulsoup,html-parsing
<div class="meaning"><span class="hinshi">[副]</span>物事の重点・大勢を述べるときに用いる。</div> All I need from this is おもに。もっぱら。物事の重点・大勢を述べるときに用いる. Usually the hinshi class is separate from the sentences I'm trying to parse, but for some of them they seem to be combined together. Is there anyway to just print the sentence while ignore the [副]?...

Parsing a HTML table in Perl

html,perl,table,html-parsing
I am trying to parse following HTML table : <table cellspacing="0" border="1" width="100%"> <tr bgcolor="#d0d0d0"> <th style="COLOR: #ff6600">number</th> <th style="COLOR: #ff6600">id</th> <th style="COLOR: #ff6600">result</th> <th style="COLOR: #ff6600">reason</th> </tr> <tr> <td>1027</td> <td><a href="<url>">21cs_337</a></td> <td>0</td> <td>catch-all caught </td> <td>reason</td>...

Can't find node using HTMLAgilityPack

c#,html,html-parsing
I have used the code sample from following video: https://youtu.be/8e3Wklc1H_A The code looks like this var webGet = new HtmlWeb(); var doc = webGet.Load("http://pastebin.com/raw.php?i=gF0DG08s"); HtmlNode OurNone = doc.DocumentNode.SelectSingleNode("//div[@id='footertext']"); if (OurNone != null) richTextBox1.Text = OurNone.InnerHtml; else richTextBox1.Text = "nothing found"; I thought at first that the original website might be...

Python Beautiful Soup Web Scraping Specific Numbers

python,html,web-scraping,beautifulsoup,html-parsing
On this page the final score (number) of each team has the same class name class="finalScore". When I call the final score of the away team (on top) the code calls that number without a problem. If ... favLastGM = 'A' When I try to call the final score of...

The second row and third row should be a single row

python,html,pandas,beautifulsoup,html-parsing
from bs4 import BeautifulSoup import urllib2 from lxml.html import fromstring import re import csv import pandas as pd wiki = "http://en.wikipedia.org/wiki/List_of_Test_cricket_records" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia req = urllib2.Request(wiki,headers=header) page = urllib2.urlopen(req) soup = BeautifulSoup(page) try: table = soup.find_all('table')[1] except AttributeError as e:...

lxml - how to get minimal xpath of element?

python,xpath,xml-parsing,html-parsing,lxml
tree.xpath("/exact/path/to/element") yields [<Element I want>]. exact/path/to/element is procured by a call to tree.getroottree().getpath(element). If I find the minimal xpath to the element with e.g. Firebug, tree.xpath("//@minimal-descriptor") yields [<Element I want>]. Question How do I get the minimal xpath from element using lxml, or other Python library?...

Does HTML(5) ignore graphemes?

html,html5,unicode,utf-8,html-parsing
A grapheme is the smallest "unit" in writing. In English, we normally just think of the characters A-Z, but other languages have accents. UTF allows you to add accents to characters to form a grapheme. There's a generalized algorithm that lets you break a sequence of UTF code points into...

BeautifulSoup is not getting all data, only some

python,html,web-scraping,beautifulsoup,html-parsing
import requests from bs4 import BeautifulSoup def trade_spider(max_pages): page = 0 while page <= max_pages: url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100) source_code = requests.get(url) plain_text = source_code.text soup = BeautifulSoup(plain_text) for link in soup.findAll('a', {'class':'hdrlnk'}): href = 'http://orangecounty.craigslist.org/' + link.get('href') title = link.string print title #print href get_single_item_data(href) page...

PHP split or explode string on tag

php,regex,split,html-parsing,explode
i would like to split a string on a tag into different parts. $string = 'Text <img src="hello.png" /> other text.'; The next function doesnt work yet on the right way. $array = preg_split('/<img .*>/i', $string); The output should be array( 0 => 'Text ', 1 => '<img src="hello.png" />',...

Extracting the body HTML and clean comments using PHP and Regex

php,html,regex,html-parsing
I want to clean the comments and some other garbage or tags from the <body> section in HTML using PHP and regex but my code not work: $str=preg_replace_callback('/<body>(.*?)<\/body>/s', function($matches){ return '<body>'.preg_replace(array( '/<!--(.|\s)*?-->/', ), array( '', ), $matches[1]).'</body>'; }, $str); The problem is that nothing happens. Comments will remain where they...

How to match a particular tag through css selectors where the class attribute contains spaces?

python,html,css-selectors,beautifulsoup,html-parsing
I want to select a table tag which has the value of class attribute as: drug-table data-table table table-condensed table-bordered So I tried the below code: for i in soup.select('table[class="drug-table data-table table table-condensed table-bordered"]'): print(i) But it fails to work: ValueError: Unsupported or invalid CSS selector: "table[class="drug-table" spaces in the...

How to get float from the string?

html,regex,html-parsing
For example I have string like: #resultStats{opacity:0;top:13px}</style><div id="extabar"><div id="topabar" style="position:relative"><div class="ab_tnav_wrp" id="slim_appbar"><div id="sbfrm_l"><div id="resultStats">About 5,320 results<nobr> (0.13 seconds)&nbsp;</nobr></div></div></div></div><div id="botabar" style="display:none"></div></div><div></div></div><div class="mw" data-jibp="h" data-jiis="uc"...

Find out which parser BeautifulSoup4 is using?

python,html,beautifulsoup,html-parsing
I've written a script using beautifulsoup4 that works in one machine but not another. The reason is that on that other machine, BeautifulSoup() constructor auto-convert <br> to <br/> whereas it's not the behaviour on my machine. Believe it or not, it matters to my script. I figured that the two...

Scraping nested tags

python,beautifulsoup,html-parsing
I know this type of question comes up frequently, however I have been browsing and have not seen a similar problem. <div class="casts"> <table cellpadding="0" cellspacing="0"> <tbody> <tr> <td class=""> <a class="cast"> <span class="title"> Nested data 1 <span class="schedule"> Nested data 2 </span> </span> </a> </td> </tr> </tbody> </table> </div>...

How can I parse an attribute string to an array in PHP?

php,arrays,regex,parsing,html-parsing
In PHP, I need to parse parameters in a string like: {keyword name1=val1 name2='val2' name3="val3"} And end up with an array like: { name1 => "val1", name2 => "val2", name3 => "val3" } Each value may or may not be quoted, and can be quoted using either single or double...

BeautifulSoup find only elements where an attribute contains a sub-string? Is this possible?

python,html,beautifulsoup,html-parsing
I have a call to find_all() in my BeautifulSoup code. This works currently to get me all images, but if I wanted to target only images which have a sub-string of "placeholder" in their src, how could I do this? for t in soup.find_all('img'): # WHERE img.href.contains("placeholder") ...

Python - BeautifulSoup Webscrape

python,html,web-scraping,beautifulsoup,html-parsing
I am trying to scrape a list of URLs off of the following website (http://thedataweb.rm.census.gov/ftp/cps_ftp.html), but I am having zero luck following the tutorials. Here is one example of the code I have tried: from bs4 import BeautifulSoup import urllib2 url = "http://thedataweb.rm.census.gov/ftp/cps_ftp.html" page = urllib2.urlopen(url) soup = BeautifulSoup(page.read()) cpsLinks...

Remove almost all HTML comments using Regex

php,html,regex,html-parsing,conditional-comments
Using this regex expression: preg_replace( '/<!--(?!<!)[^\[>].*?-->/', '', $output ) I'm able to remove all HTML comments from my page except for anything that looks like this: <!--[if IE 6]> Special instructions for IE 6 here <![endif]--> How can I modify this to also exclude HTML comments which include a unique...

What's wrong in my DOM parser php code?

php,html,dom,html-parsing
HTML $html='<h1>some text<h1> sometext <h2>some text</h2> sometext <h1>some text<h1> sometext <h2>some text</h2> sometext <h3>some text</h3> sometext'; I need to wrap h tags with div. Parent-child relationship is like h1->h2->h3 and so on. So, I need to wrap div according to it $dom = new DOMDocument(); $dom->loadHTML($html); $elements = $dom->getElementsByTagName('*'); for...

BeautifulSoup scraping nested tables

python,beautifulsoup,html-parsing
I have been trying to scrape the data from a website which is using a good amount of tables. I have been researching on the beautifulsoup documentation as well as here on stackoverflow but am still lost. Here is the said table: <form action="/rr/" class="form"> <table border="0" width="100%" cellpadding="2" cellspacing="0"...

Parsing HTML using LINQ

c#,linq,parsing,html-parsing
I need help in parsing a HTML file. I'm new to C# and LINQ and everything i tried has not been successfull in extracting the "link" and the "Name 1" <tr class="Row"> <td width="80"> <div align="left"> <a href="link">details</a> </div> </td> <td width="152">Name 1</td> <td width="151">Name 2</td> <td width="152">Name 3</td> <td...

Get Text outside of tags as well

python,html,python-3.x,beautifulsoup,html-parsing
I'm trying to get some text out of a god-awful website. This is the part where I'm stumped: <tr><td valign="top"> <br> <b>AGFA&nbsp;ACCUSET,&nbsp;<i>1994</i></b>&nbsp;<font color=grey>(46965)</font><br> <br> <b>Equipements : </b><br>AGFA 9800<br> WITH RIP VIPER N°2<br> FILM PROCESSOR GLUNZ AND JENSEN ML35 n°26498<br> (LAIZE 450/600mm)<br> Spectraset 2200<br> <b>Availability :...

Improve my regex+php replacement

php,regex,html-parsing
I'm trying to replace a string by a part of it with regex. My code do the job, but it is the right way? $string = 'blabla <!-- s:D --><img src="{SMILIES_PATH}/icon_biggrin.gif" alt=":D" title="Very Happy" /><!-- s:D --> blabla <!-- scat --><img src="{SMILIES_PATH}/cat2.gif" alt="cat" title="Cat" /><!-- scat --> blabla'; $pattern =...

Removing a span with a specific class from HTML , but not the content using regular expression

php,regex,html-parsing
Here is the sample html <div> <span class="target"> Remove parent span class only and save this text </span> </div> Here I want above html as following using regex function only <div> Remove parent span class only and save this text </div> I have tried this: $html = preg_replace('#<h3 class="target>(.*?)</h3>#', '',...

Regex for extracting substring in href

regex,html-parsing
I have a html element like below <a href="/Test/URL/Page/" title="" >Test</a> I am trying to extract the value 'Page' from the href. I have tried /href="\/(.*)\/" which gives out 'Test/URL/Page' but couldn't figure out how to proceed further. Tried /href="\/([^\/]*$)\/" but this doesnt work. Without going into details, I do...

lxml — how to isolate element from children

python,html,xml,html-parsing,lxml
Using lxml, I'd like to be able to get an HTML element and turn it into a string, excluding its children. How do I do this? Thanks...

How to delete a part of the htmlparse?

javascript,node.js,dom,html-parsing,html-parser
I make htmlparse of a webpage and i get a DOM of the page with this chunk: { raw: 'td', data: 'td', type: 'tag', name: 'td', children: [ { raw: '600', data: '600', type: 'text' } ] }, How can i delete all of the types "text" of that htmlparse?...

Is it possible to parse dynamically growing web pages?

java,android,html-parsing,jsoup
I'm writing an Android app that parses a web page (via JSoup), filters the image links from it and load them in a WebView. It works fine for static pages, but i have no idea how to handle pages that dynamically add content as i scroll down, such as 9gag,...

XPath for HTML table cell contents starting from given contents

python,html,xml,xpath,html-parsing
This is the HTML in tabular format: <tr><td style="width: 150px;">Development Name:</td><td><b>Bellewoods</b></td></tr> <tr><td style="width: 150px;">Property Type:</td><td><b>Executive Condominium</b></td></tr> <tr><td style="width: 150px;">Developer:</td><td><b>Qingjian Realty (Woodlands) Pte Ltd</b></td></tr> <tr><td style="width:...

Webscraping an IMDb page using BeautifulSoup

python,html,web-scraping,beautifulsoup,html-parsing
I am new to WebScraping/Python and BeautifulSoup and am having difficulty getting my code to work. I would like to scrape the url: http://m.imdb.com/feature/bornondate" to get the: Name of the celebrity Celebrity Image Profession Best Work for the ten celebrities on that page. I am not sure what I am...

how to access elements by path?

python,html,beautifulsoup,html-parsing
I am trying to parse with BeautifulSoup an awful HTML page to retrieve a few information. The code below: import bs4 with open("smartradio.html") as f: html = f.read() soup = bs4.BeautifulSoup(html) x = soup.find_all("div", class_="ue-alarm-status", playerid="43733") print(x) extracts the fragments I would like to analyze further: [<div alarmid="f319e1fb" class="ue-alarm-status" playerid="43733">...

regular expression to match the text inside a div but ignore the child elements if they exist

regex,html-parsing
I am trying to match the string that is contained inside a <div> the issue is I need to ignore anything inside of any child elements within the div, I cant seem to get it to match how I need it to. I have to keep a 3 part format...

PHP Regex to remove last paragraph and contents

php,regex,html-parsing
I have the following stored in a MySQL table: <p>First paragraph</p><p>Second paragraph</p><p>Third paragraph</p><div class="item"><p>Some paragraph here</p><p><strong><u>Specs</u>:</strong><br /><br /><strong>Weight:</strong> 10kg<br /><br /><strong>LxWxH:</strong> 5mx1mx40cm</p><p>This is the paragraph I am trying to remove with regex.</p></div> I'm trying to remove the last...

Extraxt nonHTML tags with regular expression in PHP

php,html,regex,html-parsing
I'm trying to extract nonHTML tags ( like: <!This TAG> ) from strings. I use below regular expression to extract tags: $Tags = preg_split('/(<![^>]*[^\/]>)/i', $Content, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE); But problem is all HTML comment tags ( like <!-- This One --> ) will be extract as well. I can...

How to limit the result of select tag in beautifulsoup?

python,html,beautifulsoup,html-parsing
For example, I have this: result = soup.select('div#test > div.filters > span.text') I want to limit the result of the above list to 10 items. In case of find_all() one can use the limit argument but what about select()?...

Remove
from inside
 
tag

php,regex,html-parsing,preg-replace
I made simple BBCode script and it works all fine. But then later i use a javascript library to beautify my codes in <pre></pre>. Now the only problem I am facing that I have <br /> tags after each line of code in <pre></pre> tags. So the question is how...

Scraping a website with clickable content in Python

python,python-2.7,web-scraping,html-parsing
I would like to scrap the content a the following website: http://financials.morningstar.com/ratios/r.html?t=AMD In there under Key Ratios I would like to click on "Growth" button and then scrap the data in Python. How can I do that?...

Jsoup and list of attachments

java,android,parsing,html-parsing,jsoup
I've this HTML block: ul class="list_attachments"><li> <a href="www.site1.com"><img src='pdf.png' alt='pdf'/> File1</a></li><li> <a href="www.site2.com"><img src='pdf.png' alt='pdf'/> File2</a></li> </ul> I would like to extract all the "a href" row, in particular site and name file informations. So I tried this: String [] fileName = new String[2]; String [] url = new String[2];...

How to parse html by part of a class name with JSOUP?

html-parsing,jsoup
I'm trying to get a piece of html, something like: <tr class="myclass-1234" rel="5678"> <td class="lst top">foo 1</td> <td class="lst top">foo 2</td> <td class="lst top">foo-5678</td> <td class="lst top nw" style="text-align:right;"> <span class="nw">1.00</span> foo </td> <td class="top">01.05.2015</td> </tr> I'm completely new to JSOUP, and first what came to mind is to get...

Ordered and Unordered List is not rendered during the PDF Generation

pdf,pdf-generation,html-parsing,html-lists,itext
I am using iText-5.5.6 and XMLWorker-5.5.6. I am having strange issue during the PDF Generation where as I am not able to see the Ordered or Unordered list from the source HTML Content. I am getting the HTML Content from the Editor Control. And the content is like the below:...

preg_replace add target=“_blank”, but exclude certain instances

php,regex,html-parsing
I'm having trouble putting together the proper RegEx pattern to add target="_blank" to my links. To add that to all links.. no problem, but I need to exclude certain instances based on the pattern. This is the preg_replace() I'm using to update ALL links with target that are showing http://...

Parsing HTML tree in lxml : how can I retrieve the text inside the element?

python,html-parsing,lxml
I'm trying to retrieve the correct text inside an element. Here is the output: (Pdb) p etree.tostring(els[0]) '<h5 class="msg-delivered" style="padding:0;text-rendering:optimizeLegibility;line-height:1.1;margin-bottom:15px;-webkit-font-smoothing:antialiased;font-family:&quot;Open Sans&quot;, &quot;Helvetica Neue&quot;, Arial, Helvetica, sans-serif;color:#888888;vertical-align:middle;margin:0;font-size:13px;font-weight:300 !important">&#13;\n<i class="ic-icon-delivered"...

Beautiful soup and extracting values

python,html,beautifulsoup,html-parsing
I would be gretful if you could give me some guidance on how I would grab the date of birth "16 June 1723" below while using beautifulsoup. Now using my code I have managed to grab the values which you see below under results however all what I need is...

Python RegEx for this HTML String

python,html,regex,html-parsing
I've got a string which is like that: <span class=\"market_listing_price market_listing_price_with_fee\">\r \t\t\t\t\t&#36;92.53 USD\t\t\t\t<\/span> I need to find this string via RegEx. My try: (^<span class=\\"market_listing_price market_listing_price_with_fee\\">\\r\\t\\t\\t\\t\\t&) But my problem is, the count of "\t" and "\r" may vary.. And of course this is not the Regular Expression for the whole...

python BeautifulSoup find all input for specific form

python,html,forms,beautifulsoup,html-parsing
I'm trying to use BeautifulSoup to extract input fields for a specific form only. Extracting the form using the following: soup.find('form') Now I want to extract all input fields which are a child to that form only. How can I do that with BS?...

Grabbing text data from Baseball-reference Python

python,web-scraping,html-parsing
http://www.baseball-reference.com/players/split.cgi?id=aardsda01&year=2015&t=p I would like to get the data of what arm this pitcher pitches with. If it were a table i would be able to grab the data but I dont know how to get the text. David Aardsma \ARDS-mah\ David Allan Aardsma (twitter: @TheDA53) Position: Pitcher Bats: Right, Throws:...

PHP replace characters except the HTML tags

php,regex,string,replace,html-parsing
I need to replace the characters 0,1,2,...,9 with \xD9\xA0,\xD9\xA1,\xD9\xA2,...,\xD9\xA9 in a string. This string comes from the CKEditor so it may contains html tags. Using the following code $body = str_replace("1", "\xD9\xA1", $body); it replaces every 1 with \xD9\xA1 so it effects the tag <h1> and also <table border="1"> while...

beautifulsoup not returning span results

python,html,python-2.7,beautifulsoup,html-parsing
I'm learning bs4 and trying to scrape the span tag data from this website put them in a list but no results are returned what am I doing wrong? import requests import bs4 root_url = 'http://www.timeanddate.com' index_url = root_url + '/astronomy/tonga/nukualofa' response = requests.get(index_url) soup = bs4.BeautifulSoup(response.text) spans = soup.find_all('span',...

Scrap Data from Website using css style Using Beautifull soup

python,html,python-2.7,beautifulsoup,html-parsing
I have a website where from where i want to scrap coupon codes.I have two issues here.Am using python and beautifull soup here. 1)Some coupons displayed in span tag doesnt have class or id,so am not able to get coupons from these tags.i need to get from strong tag(AXISCB50) <h6><span...

How to check is exists a tag in Jsoup html parser in android

android,parsing,html-parsing,jsoup
I parse tag "a" in my html using Jsoup. Document doc = Jsoup.parse(my html); Element p = doc.body().child(0); Element a = p.child(0); String text = a.text(); Log.d("tag", text); But when tag "a" doesn't exist, I get exception: java.lang.IndexOutOfBoundsException: Invalid index 0, size is 0 How to check is exists tag...

PHP dom parsing

php,parsing,dom,html-parsing
I'm trying to get the values of the following table. I tried both curl/regex (I know it's not recommended) and DOM separately, but wasn't able to get the values properly. There are multiple rows in the page, so I'll need to use a foreach. I need an exact match of...

Extracting Table Data with JSoup on Yahoo Finance

java,html-parsing,jsoup
Trying to practice extracting data from tables using JSoup. Can't figure out why I can't pull the "Shares Outstanding" field from https://finance.yahoo.com/q/ks?s=AAPL+Key+Statistics Here's two attempts where 's' is AAPL: public class YahooStatistics { String sharesOutstanding = "Shares Outstanding:"; public YahooStatistics(String s) { String keyStatisticsURL = ("https://finance.yahoo.com/q/ks?s="+s+"+Key+Statistics"); //Attempt 1 try {...

Why can't I get track titles from url?

python,html,python-2.7,beautifulsoup,html-parsing
I am trying to write a python script that uses beautifulsoup to scrape the track titles from this Interent Archive page. I'd like to be able to output: 391106 - Bruce-Partington Plans 400311 - The Retired Colourman ... But I am unable to find the tags. Here is my script:...

Parsing with Jsoup in arraylist

android,xml,html-parsing,jsoup
How could I parse this with jsoup? <!-- NOVINEEE --> <div class="right_naslov"><a href="/e-novine">e-novine</a></div> <div class="right_post"> <span class="right_post_nadnaslov"><font class="nadnaslov">Zanimljiv zadatak</font></span><span class="right_post_datum"><font class="datum">12.12.2014.</font></span> <span class="right_post_naslov_v"><font class="naslov"><a href="/e-novine/n/?id=340">Profesor učenicima zadao...

BeautifulSoup4 get input 'value' throws an error with good code?

html,parsing,beautifulsoup,html-parsing
print [(element['name'], element['value']) for element in soup.find_all('input')] I copied this code to get the value of an input and it throws this error: File "messager.py", line 116, in main print [(element['name'], element['value']) for element in soup.find_all('input')] File "C:\PYTHON27\lib\site-packages\bs4\element.py", line 905, in __getitem__ return self.attrs[key] KeyError: 'value' If I only provide...

how to extract text from a html element by id and assign to a php variable?

php,html,html-parsing
I have this: <h4 class="modal-title" id="exampleModalLabel"> hello </h4> and I want to extract the hello word using its id and assign this to a php var but I don't have idea. If it were an input would be easier but I have yo use a different element...

Beautiful Soup Get list elements not separating by commas

python,html,beautifulsoup,html-parsing
I am trying to parse a list of cities from a website using Beautiful Soup Here is the output: MonroeMatthewsWaxhawIndian Trail, Matthews What I need is: Monroe, Matthews, Waxhaw, Indian Trail, Matthews Here is the HTML: <div id="current_tab"> <p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p> <ul> <li class="view_type_geoserved" id="view_field_geoserved"> <p style="font-weight: bold; border-bottom:...

how to pass search key and get result through bs4

python,html,beautifulsoup,html-parsing
def get_main_page_url("https://malwr.com/analysis/search/", strDestPath, strMD5): base_url = 'https://malwr.com/' url = 'https://malwr.com/account/login/' username = 'myname' password = 'pswd' session = requests.Session() # getting csrf value response = session.get(url) soup = bs4.BeautifulSoup(response.content) form = soup.form csrf = form.find('input', attrs={'name': 'csrfmiddlewaretoken'}).get('value') ## csrf1 = form.find('input', attrs ={'name': 'search'}).get('value') # logging in data = {...

Regular Expression to find a complete html tag in source file

html,regex,replace,html-parsing,phpstorm
I want a regular expression (to search in phpstorm) that will search for the starting tag and it's corresponding closing tag in all html files of my project directory. E.g here is my code <ul class="sub-menu"> <li id="menu-item-215" class="menu-item menu-item-type-post_type menu-item-object-page menu-item-215"><a href="news-updates/index.html"><span>News / Updates</span></a></li> <li id="menu-item-295" class="menu-item menu-item-type-custom menu-item-object-custom...

Parsing content with the HTML Agility Pack and Linq

c#,linq,html-parsing,html-agility-pack
i am trying to get significant content for searched keywords in html. using the code below for generate a HtmlNodeCollection var findclasses = doc.DocumentNode.SelectNodes("//body//*[not(self::script)]").Where(x => x.InnerHtml.Contains("SearchedKeywordText") && x.InnerHtml.Contains("SearchedKeyword1Text")).OrderBy(x => x.Name); string FirstContent = findclasses.First().InnerText; And i am getting this result Results View Expanding the Results View will enumerate the IEnumerable...

Why Phantomjs/Selenium remove duplicated attributes on single HTML element

html5,selenium-webdriver,html-parsing,phantomjs
Its seems that Phantomjs/selenium autmatically removes duplicate attributes on HTML elements. Does this forced from HTML standard itself (any pointer?)or some implementation defacto of webkit/gecko?

Using JSoup to get data-code value of a table

java,html,html-parsing,jsoup
How would I be able to use JSoup to get the data-code value from a table row? Here is what I have tried but it just prints nothing: Document doc = Jsoup.connect("http://www.example.com").get(); Elements dataCodes = doc.select("table[class=team-list]"); for (Element dataCode : dataCodes) { System.out.println(dataCode.attr("data-code")); } The HTML code looks like this:...

Preg_replace, please little support?

php,html-parsing,preg-replace
So i have this preg_replace function (from a script someone else wrote) that adds a target="_blank" attribute to all the links. However, when I have a link that already has the target="_blank" attribute, it adds another one. This results in a double target="_blank" attribute in the link. Is there a...

Solved: HTMLParser does not parse the entire input

class,python-3.x,html-parsing
[Sry, I don't know which keywords might give an answer to my question, although it might be a general one. I was not able to find other questions pointing at my problem.] I wrote some scripts in python, but never really got in contact with classes. Now I need one...

Android - Derive the parent node of a child node using JSOUP

android,html,html-parsing,jsoup
I have to change the html code of a web page before showing it on my Android App. This is my situation: <html> <div class="something"> <a class="inner_something"> <span class="title">Titolo1</span> </a> </div> <div class="something"> <a class="inner_something"> <span class="title">Titolo2</span> </a> </div> </html> I want to remove the div that contains within it...

Why I am getting shifted output when write to a text after converting it to xls file in java [closed]

java,excel,csv,io,html-parsing
I am writing outputs of parsed web pages into two text files. "CrawledURLS.txt" saves the crawled pages and "CrawledURLSERROR.txt" saves uncrawled we pages. Since i should get some plot of output data, i converted the .txt files to .xls file. I am getting more than "300.000" URLs. When i stop...

Get short description part from Google search results

java,html-parsing,jsoup
I use jsoup HTML parser to filter URLs. I would like to get also short descriptions from result lists, like this: Stack Overflow is a privately held website, the flagship site of the Stack Exchange Network, created in 2008 by Jeff Atwood and Joel Spolsky, as a more open ......

Why the logic for array conversion and HTML parsing is not working in following scenario?

php,arrays,xpath,html-parsing,domdocument
I've an associative array titled $allFeeds (after executing print_r($allFeeds);) as follows : Note : The actual associative array $allFeeds is very large. For the understanding purpose I've only put one element from this large array. Array ( [0] => Array ( [feed_image] => Array ( [0] => <a href="http://52.1.47.143/photo/928/2_onclick_ok/userid_244/" class="...

Finding all tags and attributes in a HTML

python,html,xml-parsing,beautifulsoup,html-parsing
I am a newbie and looking at HTML code for first time. For my research I need to know the number of tags and attributes in a webpage. I looked at various parser and found Beautiful Soup to be one of the most preferred one. The following code (taken from...

Issue Scraping with PHP

php,regex,web-scraping,html-parsing
I am trying to attain a value, but step over other values that change dynamically. The table section looks as follows: Total 1.18 3.33 $20,000 16.2% The code I am using to find the third value in preg_match is: <?php function get_total(){ $file_string = file_get_contents('url'); preg_match('#Total</td><td>\d\.\d+</td><td>d\.\d+</td><td>$(\d+)</td><td>d+\.\d\%\</td></tr></table><br><span id="ExStockDetailTableF1F2"#',$file_string, $data); $loss =...