从python中的xml文档中提取文本 [英] extract text from xml documents in python

查看：669 发布时间：2020/5/1 10:09:53 python linux xml-parsing ubuntu-10.04

本文介绍了从python中的xml文档中提取文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这是示例xml文档:

<bookstore>
    <book category="COOKING">
        <title lang="english">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>300.00</price>
    </book>

    <book category="CHILDREN">
        <title lang="english">Harry Potter</title>
        <author>J K. Rowling </author>
        <year>2005</year>
        <price>625.00</price>
    </book>
</bookstore>

我想提取文本而不指定元素，我该怎么做，因为我有10个这样的文档.我想要这样做是因为我的问题是用户正在输入某个我不知道的单词，因此必须在其各自文本部分的所有10个xml文档中进行搜索.为此，我应该在不知道元素的情况下知道文本的位置.所有这些文档都不同的另一件事.

I want to extract the text without specifying the elements how can i do this , because i have 10 such documents. I want so because my problem is that user is entering some word which I don't know , it has to be searched in all of the 10 xml documents in their respective text portions. For this to happen I should know where the text lies without knowing about the element. One more thing that all these documents are different.

请帮助！！

推荐答案

您可以简单地删除所有标签:

You could simply strip out any tags:

>>> import re
>>> txt = """<bookstore>
...     <book category="COOKING">
...         <title lang="english">Everyday Italian</title>
...         <author>Giada De Laurentiis</author>
...         <year>2005</year>
...         <price>300.00</price>
...     </book>
...
...     <book category="CHILDREN">
...         <title lang="english">Harry Potter</title>
...         <author>J K. Rowling </author>
...         <year>2005</year>
...         <price>625.00</price>
...     </book>
... </bookstore>"""
>>> exp = re.compile(r'<.*?>')
>>> text_only = exp.sub('',txt).strip()
>>> text_only
'Everyday Italian\n        Giada De Laurentiis\n        2005\n        300.00\n
  \n\n    \n        Harry Potter\n        J K. Rowling \n        2005\n        6
25.00'

但是，如果您只想在Linux中搜索文件中的某些文本，则可以使用grep:

But if you just want to search files for some text in Linux, you can use grep:

burhan@sandbox:~$ grep "Harry Potter" file.xml
        <title lang="english">Harry Potter</title>

如果要搜索文件，请使用上面的grep命令，或打开文件并在Python中搜索它:

If you want to search in a file, use the grep command above, or open the file and search for it in Python:

>>> import re
>>> exp = re.compile(r'<.*?>')
>>> with open('file.xml') as f:
...     lines = ''.join(line for line in f.readlines())
...     text_only = exp.sub('',lines).strip()
...
>>> if 'Harry Potter' in text_only:
...    print 'It exists'
... else:
...    print 'It does not'
...
It exists

这篇关于从python中的xml文档中提取文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从python中的xml文档中提取文本 [英] extract text from xml documents in python

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录关闭

从python中的xml文档中提取文本 [英] extract text from xml documents in python

问题描述

推荐答案

相关文章

服务器开发最新文章

热门教程

热门工具

登录 关闭

登录关闭