从 reStructuredText 中提取字段列表 [英] Extract field list from reStructuredText
问题描述
假设我有以下 reST 输入:
Say I have the following reST input:
Some text ...
:foo: bar
Some text ...
我最终想要的是这样的字典:
What I would like to end up with is a dict like this:
{"foo": "bar"}
我尝试使用这个:
tree = docutils.core.publish_parts(text)
它确实解析了字段列表,但我最终在 tree["whole"] 中得到了一些伪 XML?
:
It does parse the field list, but I end up with some pseudo XML in tree["whole"]?
:
<document source="<string>">
<docinfo>
<field>
<field_name>
foo
<field_body>
<paragraph>
bar
由于 tree
dict 不包含任何其他有用的信息,而这只是一个字符串,我不确定如何从 reST 文档中解析字段列表.我该怎么做?
Since the tree
dict does not contain any other useful information and that is just a string, I am not sure how to parse the field list out of the reST document. How would I do that?
推荐答案
您可以尝试使用类似于以下代码的内容.我没有使用 publish_parts
方法,而是使用了 publish_doctree
,获取文档的伪 XML 表示.然后我转换为 XML DOM 以提取所有 field
元素.然后我得到每个 field
元素的第一个 field_name
和 field_body
元素.
You can try to use something like the following code. Rather than using the publish_parts
method I have used publish_doctree
, to get the pseudo-XML representation of your document. I have then converted to an XML DOM in order to extract all the field
elements. Then I get the first field_name
and field_body
elements of each field
element.
from docutils.core import publish_doctree
source = """Some text ...
:foo: bar
Some text ...
"""
# Parse reStructuredText input, returning the Docutils doctree as
# an `xml.dom.minidom.Document` instance.
doctree = publish_doctree(source).asdom()
# Get all field lists in the document.
fields = doctree.getElementsByTagName('field')
d = {}
for field in fields:
# I am assuming that `getElementsByTagName` only returns one element.
field_name = field.getElementsByTagName('field_name')[0]
field_body = field.getElementsByTagName('field_body')[0]
d[field_name.firstChild.nodeValue] = \
" ".join(c.firstChild.nodeValue for c in field_body.childNodes)
print d # Prints {u'foo': u'bar'}
xml.dom 模块不是最容易使用的 (为什么我需要使用 .firstChild.nodeValue
而不仅仅是 .nodeValue
例如),所以你可能希望使用 xml.etree.ElementTree 模块,我发现它更容易使用.如果您使用 lxml,您还可以使用 XPATH 符号来查找所有 field
、field_name
和 field_body
元素.
The xml.dom module isn't the easiest to work with (why do I need to use .firstChild.nodeValue
rather than just .nodeValue
for example), so you may wish to use the xml.etree.ElementTree module, which I find a lot easier to work with. If you use lxml you can also use XPATH notation to find all of the field
, field_name
and field_body
elements.
这篇关于从 reStructuredText 中提取字段列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!