从 reStructuredText 中提取字段列表 [英] Extract field list from reStructuredText

查看:70
本文介绍了从 reStructuredText 中提取字段列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有以下 reST 输入:

Say I have the following reST input:

Some text ...

:foo: bar

Some text ...

我最终想要的是这样的字典:

What I would like to end up with is a dict like this:

{"foo": "bar"}

我尝试使用这个:

tree = docutils.core.publish_parts(text)

它确实解析了字段列表,但我最终在 tree["whole"] 中得到了一些伪 XML?:

It does parse the field list, but I end up with some pseudo XML in tree["whole"]?:

<document source="<string>">
    <docinfo>
        <field>
            <field_name>
                foo
            <field_body>
                <paragraph>
                    bar

由于 tree dict 不包含任何其他有用的信息,而这只是一个字符串,我不确定如何从 reST 文档中解析字段列表.我该怎么做?

Since the tree dict does not contain any other useful information and that is just a string, I am not sure how to parse the field list out of the reST document. How would I do that?

推荐答案

您可以尝试使用类似于以下代码的内容.我没有使用 publish_parts 方法,而是使用了 publish_doctree,获取文档的伪 XML 表示.然后我转换为 XML DOM 以提取所有 field 元素.然后我得到每个 field 元素的第一个 field_namefield_body 元素.

You can try to use something like the following code. Rather than using the publish_parts method I have used publish_doctree, to get the pseudo-XML representation of your document. I have then converted to an XML DOM in order to extract all the field elements. Then I get the first field_name and field_body elements of each field element.

from docutils.core import publish_doctree

source = """Some text ...

:foo: bar

Some text ...
"""

# Parse reStructuredText input, returning the Docutils doctree as
# an `xml.dom.minidom.Document` instance.
doctree = publish_doctree(source).asdom()

# Get all field lists in the document.
fields = doctree.getElementsByTagName('field')

d = {}

for field in fields:
    # I am assuming that `getElementsByTagName` only returns one element.
    field_name = field.getElementsByTagName('field_name')[0]
    field_body = field.getElementsByTagName('field_body')[0]

    d[field_name.firstChild.nodeValue] = \
        " ".join(c.firstChild.nodeValue for c in field_body.childNodes)

print d # Prints {u'foo': u'bar'}

xml.dom 模块不是最容易使用的 (为什么我需要使用 .firstChild.nodeValue 而不仅仅是 .nodeValue 例如),所以你可能希望使用 xml.etree.ElementTree 模块,我发现它更容易使用.如果您使用 lxml,您还可以使用 XPATH 符号来查找所有 fieldfield_namefield_body 元素.

The xml.dom module isn't the easiest to work with (why do I need to use .firstChild.nodeValue rather than just .nodeValue for example), so you may wish to use the xml.etree.ElementTree module, which I find a lot easier to work with. If you use lxml you can also use XPATH notation to find all of the field, field_name and field_body elements.

这篇关于从 reStructuredText 中提取字段列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆