尝试使用Python解析xml格式的docx文档以打印粗体字 [英] Trying to use Python to parse a docx document in xml format to print words that are in bold
问题描述
我有一个 word docx 文件,我想打印 Bold 中的单词,以 xml 格式查看文档,似乎我要打印的单词具有以下属性.
I have a word docx file that I would like to print the words that are in Bold looking through the document in xml format it seems the words I'm looking to print have the following attribute.
<w:r w:rsidRPr="00510F21">
<w:rPr><w:b/>
<w:noProof/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
<w:t>Print this Sentence</w:t>
</w:r>
特别是指定文本为粗体的 w:rsidRPr="00510F21"
属性.下面是更多的 XML 文档,可以更好地了解结构.
Specifically the w:rsidRPr="00510F21"
attribute which specifies that the text is bold. Below is more of the XML document give a better idea of the structure.
<w:p w14:paraId="64E19BC3" w14:textId="4D8C930F" w:rsidR="00FF6AD1" w:rsidRDefault="00FF6AD1" w:rsidP="00C11B48">
<w:pPr>
<w:ind w:left="360" w:hanging="360"/>
<w:jc w:val="both"/>
<w:rPr>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr><w:b/>
<w:noProof/><w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr><w:t xml:space="preserve">Some text</w:t>
</w:r>
<w:r w:rsidRPr="0009466D">
<w:rPr><w:i/><w:noProof/>
<w:sz w:val="22"/><w:szCs w:val="22"/>
</w:rPr>
<w:t>For example</w:t>
</w:r>
<w:r>
<w:rPr>
<w:noProof/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr><w:t xml:space="preserve">
</w:t>
</w:r>
<w:r w:rsidRPr="00510F21">
<w:rPr>
<w:b/>
<w:noProof/>
<w:sz w:val="22"/>
<w:szCs w:val="22"/>
</w:rPr>
<w:t>Print this stuff</w:t>
</w:r>
在做了一些研究并尝试使用 Python-docx 库完成此操作后,我决定尝试使用 lxml
.我收到有关命名空间的错误,并尝试添加该命名空间,但它返回一个空集.下面是文档中的一些命名空间内容.
After doing some research and trying to do this with the Python-docx library I've decided to try using lxml
. I was getting an error about the namespace and tried to add that namespace but it's returning an empty set. Below is some of the namespace stuff from the document.
<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas"
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main"
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
xmlns:mv="urn:schemas-microsoft-com:mac:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml"
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">
下面是我正在使用的代码.如果属性是 w:rsidRPr="00510F21"
,我想再次打印.
Below is the code I'm using. Again I'd like to print if the attribute is w:rsidRPr="00510F21"
.
from lxml import etree
root = etree.parse("document.xml")
namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
wr_roots = root.findall('w:r', namespaces)
print wr_roots # prints empty set
for atype in wr_roots:
if w:rsidRPr == '00510F21':
print(atype.get('w:t'))
推荐答案
如果你想找到所有粗体文本,你可以使用 findall()
和 xpath
表达式:
If you want to find all the bold text you can use findall()
with an xpath
expression:
from lxml import etree
namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
print(e.text)
不是寻找 w:r
节点,以 w:rsidRPr="00510F21"
作为属性(我不相信它表示粗体文本),寻找 run运行属性标签(w:rPr
)中带有w:b
的节点(w:r
),然后访问文本标签(<代码>w:t) 内.w:b
标记是粗体属性,如此处记录.
Instead of looking for w:r
nodes with w:rsidRPr="00510F21"
as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r
) with w:b
in the run properties tag (w:rPr
), and then access the text tag (w:t
) within. The w:b
tag is the bold property as documented here.
xpath 表达式可以简化为 './/w:b/../../w:t'
,尽管这不那么严格并且可能会导致错误匹配.
The xpath expression can be simplified to './/w:b/../../w:t'
although this is less rigorous and might result in false matches.
这篇关于尝试使用Python解析xml格式的docx文档以打印粗体字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!