从html检索尾文本 [英] Retrieving tail text from html

查看:90
本文介绍了从html检索尾文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用lxml的Python 2.7

我有一些恼人的HTML格式,如下所示:

<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>

所以基本上它是一个td,里面有很多东西.我正在尝试编译名称及其地址的列表或字典.

到目前为止,我所做的是使用tree.xpath('//td/b')列出了具有名称的节点列表.因此,假设我当前在John的b节点上.

我正在尝试获取当前节点之后但下一个b节点之前的所有内容(Sally)的whatever.xpath('string()').我尝试了一堆不同的xpath查询,但似乎无法正确解决这个问题.特别是,每当我在没有[]括号的表达式中使用and运算符时,它都会返回布尔值,而不是满足条件的所有节点的列表.有人可以帮忙吗?

解决方案

这应该有效:

from lxml import etree

p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)

my_dict = {}

for b in tree.iter('b'):
    br = b.getnext().tail.replace('\n', '')
    my_dict[b.text.replace('\n', '')] = br

print my_dict

此代码显示:

{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}

(您可能要删除引号!)

除了使用xpath之外,您还可以使用lxml的解析器之一来轻松浏览HTML.解析器会将HTML文档转换为"etree",您可以使用提供的方法进行导航. lxml模块提供了一种称为iter()的方法,该方法允许您传递标签名称并使用该名称接收树中的所有元素.在您的情况下,如果使用它来获取所有<b>元素,则可以手动导航到<br>元素并检索其尾部文本,其中包含所需的信息.您可以在 lxml.etree教程的元素包含文本"标题中找到有关此信息.

Python 2.7 using lxml

I have some annoyingly formed html that looks like this:

<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>

So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.

So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.

I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?

解决方案

This should work:

from lxml import etree

p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)

my_dict = {}

for b in tree.iter('b'):
    br = b.getnext().tail.replace('\n', '')
    my_dict[b.text.replace('\n', '')] = br

print my_dict

This code prints:

{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}

(You may want to strip the quotation marks out!)

Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.

这篇关于从html检索尾文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆