Python，lxml-访问文本 [英] Python, lxml - access text

查看：81 发布时间：2020/5/4 8:34:57 python text html-parsing lxml

本文介绍了Python，lxml-访问文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我目前有点想法，我真的希望您能给我一个提示: 最好用一小段示例代码来解释我的问题:

I m currently a bit out of ideas, and I really hope that you can give me a hint: Its probably best to explain my question with a small piece of sample code:

from lxml import etree
from io import StringIO

testStr = "<b>text0<i>text1</i><ul><li>item1</li><li>item2</li></ul>text2<b/><b>sib</b>"
parser = etree.HTMLParser()
# generate html tree
htmlTree   = etree.parse(StringIO(testStr), parser)
print(etree.tostring(htmlTree, pretty_print=True).decode("utf-8"))
bElem = htmlTree.getroot().find("body/b") 
print(".text only contains the first part: "+bElem.text+ " (which makes sense in some way)")
for text in bElem.itertext():
    print(text)

输出:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <body>
    <b>text0<i>text1</i><ul><li>item1</li><li>item2</li></ul>text2<b/><b>sib</b></b>
  </body>
</html>

.text only contains the first part: text0 (which makes sense in some way)
text0
text1
item1
item2
text2
sib

我的问题:

我想直接访问"text2"，或获取所有文本部分的列表，仅包括在父标记中可以找到的部分. 到目前为止，我只找到了itertext()，它确实显示了"text2".

I would like to access "text2" directly, or get a list of all text parts, only including the ones that can be found in the parent tag. So far I only found itertext(), which does display "text2".

我还有其他方法可以检索"text2"吗?

Is there any other way I could retrieve "text2"?

现在您可能会问为什么我需要这个: 基本上itertext()已经差不多做了我想要的事情:

Now you might be asking why I need this: Basically itertext() is pretty much already doing what I want:

创建一个列表，其中包含在元素的子级中找到的所有文本
但是，我要处理遇到的表和列表一个不同的函数(随后创建一个列表结构就像这样:["text0 text1",["item1","item2"],"text2"]或一个表(1. 1列的行，2. 具有2列的行):["1. row, 1 col",["2. row, 1. col","2. row, 2. col"]])

Create a list, that contains all text found in an element's children
However, I want to process tables and lists that are encountered with a different function (which subsequently creates a list structure like this: ["text0 text1",["item1","item2"],"text2"] or for a table (1. Row with 1 Column, 2. Row with 2 Columns): ["1. row, 1 col",["2. row, 1. col","2. row, 2. col"]])

也许我采用了完全错误的方法?

Maybe I m taking a completely wrong approach?

输出

['text0', 'text1', ['item1', 'item2'], 'text2', 'sib']

注意:在低于Python 3.3的版本上，yield from X可以替换为for x in X: yield x.

Note: yield from X could be replaced by for x in X: yield x on older than Python 3.3 versions.

要连接相邻的字符串，请执行以下操作:

To join adjacent strings:

def joinadj(iterable, join=' '.join):
    adj = []
    for item in iterable:
        if isinstance(item, str):
            adj.append(item) # save for later
        else:
            if adj: # yield items accumulated so far
                yield join(adj)
                del adj[:] # remove yielded items
            yield item # not a string, yield as is

    if adj: # yield the rest
        yield join(adj)

print(list(joinadj(itertext(html.fromstring(
                "<b>text0<i>text1</i><ul><li>item1</li>"
                "<li>item2</li></ul>text2<b/><b>sib</b>")))))

输出

['text0 text1', ['item1', 'item2'], 'text2 sib']

要允许表，在<ul>中的嵌套列表中，处理程序应递归调用itertext():

To allow tables, nested list in <ul> the handler should call itertext() recursively:

def ul_handler(el):
    yield list(itertext(el, with_tail=False))
    if el.tail:
        yield el.tail

def itertext(root, handlers=dict(ul=ul_handler), with_tail=True):
    if root.text:
        yield root.text
    for el in root:
        yield from handlers.get(el.tag, itertext)(el)
    if with_tail and root.tail:
        yield root.tail

print(list(joinadj(itertext(html.fromstring(
                    "<b>text0<i>text1</i><ul><li>item1</li>"
                    "<li>item2<ul><li>sub1<li>sub2</li></ul></ul>"
                    "text2<b/><b>sib</b>")))))

输出

['text0 text1', ['item1', 'item2', ['sub1', 'sub2']], 'text2 sib']

这篇关于Python，lxml-访问文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python，lxml-访问文本 [英] Python, lxml - access text

问题描述

推荐答案

输出

输出

输出

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python，lxml-访问文本 [英] Python, lxml - access text

问题描述

推荐答案

输出

输出

输出

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭