用Lxml解析HTML [英] Parsing HTML with Lxml

查看:246
本文介绍了用Lxml解析HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要帮助从lxml页面解析出一些文本。我尝试了美丽的和我正在解析的页面的HTML是如此的破碎,它不会工作。所以我已经转向lxml,但文档有点混乱,我希望有人能帮助我。

这里是我试图解析的页面,我需要获取附加信息部分下的文本。请注意,我在这个网站上有很多页面需要解析,每个页面的html并不总是完全相同(可能包含一些额外的空td标签)。任何有关如何获得该文本的建议都将非常感激。



感谢您的帮助。

解决方案

  import lxml.html as lh 
导入urllib2

def text_tail(节点):
yield node.text
yield node.tail
$ b $ url url ='http://bit.ly/bf1T12'
doc = lh.parse(urllib2.urlopen(url))
for doc.iter('td')中的elt:
text = elt.text_content()
if text.startswith('Additional Info'):
blurb = [text ('td')
为node.iter()中的子节点
为text_tail(子节点)中的文本如果文本和文本!= u'\xa0']
打破
print('\\\
'.join(blurb))

/ p>


在过去的65年里,Carl Stirn的Marine
已经创造了
卓越的新标准,并为划船
享受。因为我们提供优质的
商品,贴心的,认真的,
的销售和服务,我们已经能够
让我们的客户成为我们良好的b $ b好​​友。



我们的26,000平方英尺的设施包括一个
的完整零件和配件
部门,全服务部门
(Merc。Premier经销商,2个全职
Mercruiser Master Tech's)以及新的b $ b使用和经纪商销售。

编辑:这是一个基于Steven D. Majewski的xpath的替代解决方案,它解决了OP的评论,即从blurb中分离'Additional Info'的标签数量可能是未知的:

  import lxml.html as lh 
import urllib2
$ b $ url url ='http://bit.ly/bf1T12'
doc = lh.parse (urllib2.urlopen(url))

blurb = doc.xpath('// td [child :: * [text()=Additional Info]] / following-sibling :: td / text()')

blurb = [blurb中文本的文本if text!= u' \\xa0’ ]
打印( \\\
'.join(导语))


I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

Here is the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.

Thanks for the help.

解决方案

import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

For over 65 years, Carl Stirn's Marine has been setting new standards of excellence and service for boating enjoyment. Because we offer quality merchandise, caring, conscientious, sales and service, we have been able to make our customers our good friends.

Our 26,000 sq. ft. facility includes a complete parts and accessories department, full service department (Merc. Premier dealer with 2 full time Mercruiser Master Tech's), and new, used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))

这篇关于用Lxml解析HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆