解析html时，为什么我需要item.text有时和item.text_content（）其他 [英] When parsing html why do I need item.text sometimes and item.text_content() others

查看：141 发布时间：2018/6/21 18:01:33 python html parsing lxml

本文介绍了解析html时，为什么我需要item.text有时和item.text_content（）其他的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

仍在学习lxml。我发现有时我无法使用item.text从树中获取项目的文本。如果我使用item.text_content（），我很好去。我不确定我明白为什么。任何提示，将不胜感激

好吧我不确定如何提供一个例子，而无需处理文件：

这里是我写的一些代码，试图找出为什么我没有收到我期望的文本：

  theTree = html.fromstring（open（notmatched [0]）。read（））
 text = [] 
 text_content = [] 
 notText = [] 
 hasText = [] 
为每个在theTree.iter（）中：
如果each.text：
 text.append（each.text）
 hasText.append（each）＃每个文本元素列表.text是true 
 text_content.append（each.text_content（））＃所有元素的文本
如果每个元素都不在hasText中：
 notText.append（each）

所以在我运行这个命令后，我看看

 >>> len（notText）
 3612 
>>> notText [40] 
< 26ab650处的元素b> 
>>> notText [40] .text_content（）
'（IRS Employer'
>>> notText [40] .text

解决方案

按照文档 text_content 方法：

返回元素的文本内容，包括其子元素的文本内容，没有标记。

例如， / p>

  import lxml.html as lh 
 data =< a>< b>< c>   < / a>
 doc = lh.fromstring（data）
 print（doc）
＃<元素a at b76eb83c>

doc 是元素 a 。 a code>< a> 和< b> 所以 doc.text 无：

  print（doc.text）
＃None

但是在 c 标记后面有文本，所以 doc.text_content（）不是无：

  print（doc.text_content（））
＃blah

PS。对文本属性的含义有一个清晰的描述此处。虽然它是 lxml.etree.Element 文档的一部分，但我认为文本和 tail 属性同样适用于 lxml.html.Element 对象。

 
Still learning lxml.  I discovered that sometimes I cannot get to the text of an item from a tree using item.text.  If I use item.text_content() I am good to go.  I am not sure I see why yet.  Any hints would be appreciated

Okay I am not sure exactly how to provide an example without making you handle a file:

here is some code I wrote to try to figure out why I was not getting some text I expected:
theTree=html.fromstring(open(notmatched[0]).read()) 
text=[]
text_content=[]
notText=[]
hasText=[]
for each in theTree.iter():
    if each.text:
        text.append(each.text)
        hasText.append(each)   # list of elements that has text each.text is true
    text_content.append(each.text_content()) #the text for all elements 
    if each not in hasText:
        notText.append(each)
So after I run this I look at
>>> len(notText)
3612
>>> notText[40]
<Element b at 26ab650>
>>> notText[40].text_content()
'(I.R.S. Employer'
>>> notText[40].text

 解决方案 
Accordng to the docs the text_content method:

  Returns the text content of the element, including the text content of
  its children, with no markup.
So for example,
import lxml.html as lh
data = """<a><b><c>blah</c></b></a>"""
doc = lh.fromstring(data)
print(doc)
# <Element a at b76eb83c>
doc is the Element a. The a tag has no text immediately following it (between the <a> and the <b>. So doc.text is None:
print(doc.text)
# None
but there is text after the c tag, so doc.text_content() is not None:
print(doc.text_content())
# blah
PS. There is a clear description of the meaning of the text attribute here. Although it is part of the docs for lxml.etree.Element, I think the meaning of the text and tail attributes applies equally well to lxml.html.Element objects.

                        这篇关于解析html时，为什么我需要item.text有时和item.text_content（）其他的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

解析html时，为什么我需要item.text有时和item.text_content（）其他 [英] When parsing html why do I need item.text sometimes and item.text_content() others

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

解析html时，为什么我需要item.text有时和item.text_content（）其他 [英] When parsing html why do I need item.text sometimes and item.text_content() others

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭