尝试使用Python解析xml格式的docx文档以打印粗体字 [英] Trying to use Python to parse a docx document in xml format to print words that are in bold

查看:41
本文介绍了尝试使用Python解析xml格式的docx文档以打印粗体字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 word docx 文件,我想打印 Bold 中的单词,以 xml 格式查看文档,似乎我要打印的单词具有以下属性.

I have a word docx file that I would like to print the words that are in Bold looking through the document in xml format it seems the words I'm looking to print have the following attribute.

<w:r w:rsidRPr="00510F21">
  <w:rPr><w:b/>
     <w:noProof/>
     <w:sz w:val="22"/>
     <w:szCs w:val="22"/>
  </w:rPr>
  <w:t>Print this Sentence</w:t>
</w:r>

特别是指定文本为粗体的 w:rsidRPr="00510F21" 属性.下面是更多的 XML 文档,可以更好地了解结构.

Specifically the w:rsidRPr="00510F21" attribute which specifies that the text is bold. Below is more of the XML document give a better idea of the structure.

<w:p w14:paraId="64E19BC3" w14:textId="4D8C930F" w:rsidR="00FF6AD1" w:rsidRDefault="00FF6AD1" w:rsidP="00C11B48">
<w:pPr>
   <w:ind w:left="360" w:hanging="360"/>
   <w:jc w:val="both"/>
   <w:rPr>
       <w:sz w:val="22"/>
       <w:szCs w:val="22"/>
   </w:rPr>
 </w:pPr>
 <w:r>
    <w:rPr><w:b/>
       <w:noProof/><w:sz w:val="22"/>
       <w:szCs w:val="22"/>
    </w:rPr><w:t xml:space="preserve">Some text</w:t>
 </w:r>
 <w:r w:rsidRPr="0009466D">
     <w:rPr><w:i/><w:noProof/>
          <w:sz w:val="22"/><w:szCs w:val="22"/>
     </w:rPr>
     <w:t>For example</w:t>
 </w:r>
 <w:r>
     <w:rPr>
        <w:noProof/>
        <w:sz w:val="22"/>
        <w:szCs w:val="22"/>
     </w:rPr><w:t xml:space="preserve">
     </w:t>
 </w:r>
 <w:r w:rsidRPr="00510F21">
     <w:rPr>
         <w:b/>
         <w:noProof/>
         <w:sz w:val="22"/>
         <w:szCs w:val="22"/>
     </w:rPr>
     <w:t>Print this stuff</w:t>
 </w:r>

在做了一些研究并尝试使用 Python-docx 库完成此操作后,我决定尝试使用 lxml.我收到有关命名空间的错误,并尝试添加该命名空间,但它返回一个空集.下面是文档中的一些命名空间内容.

After doing some research and trying to do this with the Python-docx library I've decided to try using lxml. I was getting an error about the namespace and tried to add that namespace but it's returning an empty set. Below is some of the namespace stuff from the document.

<w:document
xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" 
xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" 
xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" 
xmlns:mv="urn:schemas-microsoft-com:mac:vml" 
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" 
xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing"  xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" 
xmlns:w10="urn:schemas-microsoft-com:office:word" 
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" 
xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml"
xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup"            xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk"
xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" 
xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape"
mc:Ignorable="w14 w15 wp14">

下面是我正在使用的代码.如果属性是 w:rsidRPr="00510F21",我想再次打印.

Below is the code I'm using. Again I'd like to print if the attribute is w:rsidRPr="00510F21".

from lxml import etree
root = etree.parse("document.xml")

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

wr_roots = root.findall('w:r', namespaces)
print wr_roots # prints empty set

for atype in wr_roots:
   if w:rsidRPr == '00510F21':
       print(atype.get('w:t'))

推荐答案

如果你想找到所有粗体文本,你可以使用 findall()xpath 表达式:

If you want to find all the bold text you can use findall() with an xpath expression:

from lxml import etree

namespaces = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

root = etree.parse('document.xml').getroot()
for e in root.findall('.//w:r/w:rPr/w:b/../../w:t', namespaces):
    print(e.text)

不是寻找 w:r 节点,以 w:rsidRPr="00510F21" 作为属性(我不相信它表示粗体文本),寻找 run运行属性标签(w:rPr)中带有w:b的节点(w:r),然后访问文本标签(<代码>w:t) 内.w:b 标记是粗体属性,如此处记录.

Instead of looking for w:r nodes with w:rsidRPr="00510F21" as an attribute (which I am not convinced denotes bolded text), look for run nodes (w:r) with w:b in the run properties tag (w:rPr), and then access the text tag (w:t) within. The w:b tag is the bold property as documented here.

xpath 表达式可以简化为 './/w:b/../../w:t',尽管这不那么严格并且可能会导致错误匹配.

The xpath expression can be simplified to './/w:b/../../w:t' although this is less rigorous and might result in false matches.

这篇关于尝试使用Python解析xml格式的docx文档以打印粗体字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆