解析html时,为什么我需要item.text有时和item.text_content()其他 [英] When parsing html why do I need item.text sometimes and item.text_content() others

查看:141
本文介绍了解析html时,为什么我需要item.text有时和item.text_content()其他的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仍在学习lxml。我发现有时我无法使用item.text从树中获取项目的文本。如果我使用item.text_content(),我很好去。我不确定我明白为什么。任何提示,将不胜感激



好吧我不确定如何提供一个例子,而无需处理文件:

这里是我写的一些代码,试图找出为什么我没有收到我期望的文本:

  theTree = html.fromstring(open(notmatched [0])。read())
text = []
text_content = []
notText = []
hasText = []
为每个在theTree.iter()中:
如果each.text:
text.append(each.text)
hasText.append(each)#每个文本元素列表.text是true
text_content.append(each.text_content())#所有元素的文本
如果每个元素都不在hasText中:
notText.append(each)

所以在我运行这个命令后,我看看

 >>> len(notText)
3612
>>> notText [40]
< 26ab650处的元素b>
>>> notText [40] .text_content()
'(IRS Employer'
>>> notText [40] .text


解决方案

按照文档 text_content 方法:


返回元素的文本内容,包括
其子元素的文本内容,没有标记。

例如, / p>

  import lxml.html as lh 
data =< a>< b>< c>
< / a>
doc = lh.fromstring(data)
print(doc)
#<元素a at b76eb83c>

doc 元素 a a code>< a> < b> 所以 doc.text

  print(doc.text)
#None

但是在 c 标记后面有文本,所以 doc.text_content()不是

  print(doc.text_content())
#blah

PS。对文本属性的含义有一个清晰的描述此处。虽然它是 lxml.etree.Element 文档的一部分,但我认为文本 tail 属性同样适用于 lxml.html.Element 对象。


Still learning lxml. I discovered that sometimes I cannot get to the text of an item from a tree using item.text. If I use item.text_content() I am good to go. I am not sure I see why yet. Any hints would be appreciated

Okay I am not sure exactly how to provide an example without making you handle a file:

here is some code I wrote to try to figure out why I was not getting some text I expected:

theTree=html.fromstring(open(notmatched[0]).read()) 
text=[]
text_content=[]
notText=[]
hasText=[]
for each in theTree.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)

So after I run this I look at

>>> len(notText)
3612
>>> notText[40]
<Element b at 26ab650>
>>> notText[40].text_content()
'(I.R.S. Employer'
>>> notText[40].text

解决方案

Accordng to the docs the text_content method:

Returns the text content of the element, including the text content of its children, with no markup.

So for example,

import lxml.html as lh
data = """<a><b><c>blah</c></b></a>"""
doc = lh.fromstring(data)
print(doc)
# <Element a at b76eb83c>

doc is the Element a. The a tag has no text immediately following it (between the <a> and the <b>. So doc.text is None:

print(doc.text)
# None

but there is text after the c tag, so doc.text_content() is not None:

print(doc.text_content())
# blah

PS. There is a clear description of the meaning of the text attribute here. Although it is part of the docs for lxml.etree.Element, I think the meaning of the text and tail attributes applies equally well to lxml.html.Element objects.

这篇关于解析html时,为什么我需要item.text有时和item.text_content()其他的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆