有什么方法可以读取.docx文件,包括使用python-docx自动编号 [英] Is there any way to read .docx file include auto numbering using python-docx
问题描述
问题陈述:从.docx文件中提取节,包括自动编号.
Problem statement: Extract sections from .docx file including autonumbering.
我尝试使用python-docx从.docx文件中提取文本,但它不包括自动编号功能.
I tried python-docx to extract text from .docx file but it excludes the autonumbering.
from docx import Document
document = Document("wadali.docx")
def iter_items(paragraphs):
for paragraph in document.paragraphs:
if paragraph.style.name.startswith('Agt'):
yield paragraph
if paragraph.style.name.startswith('TOC'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Title'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Table Normal'):
yield paragraph
if paragraph.style.name.startswith('List'):
yield paragraph
for item in iter_items(document.paragraphs):
print item.text
推荐答案
似乎当前 python-docx v0.8不完全支持编号.您需要进行一些黑客入侵.
It appears that currently python-docx v0.8 does not fully support numbering. You need to do some hacking.
首先,对于演示而言,要迭代文档段落,您需要编写自己的迭代器.这是一些功能:
First, for the demo, to iterate the document paragraphs, you need to write your own iterator. Here is something functional:
import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph
def iter_paragraphs(parent, recursive=True):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, docx.document.Document):
parent_elm = parent.element.body
elif isinstance(parent, docx.table._Cell):
parent_elm = parent._tc
else:
raise TypeError(repr(type(parent)))
for child in parent_elm.iterchildren():
if isinstance(child, docx.oxml.text.paragraph.CT_P):
yield docx.text.paragraph.Paragraph(child, parent)
elif isinstance(child, docx.oxml.table.CT_Tbl):
if recursive:
table = docx.table.Table(child, parent)
for row in table.rows:
for cell in row.cells:
for child_paragraph in iter_paragraphs(cell):
yield child_paragraph
您可以使用它来查找所有文档段落,包括表格单元格中的段落.
You can use it to find all document paragraphs including paragraphs in table cells.
例如:
import docx
document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
print(paragraph.text)
要访问编号属性,您需要搜索受保护"成员 paragraph._p.pPr.numPr
,这是 docx.oxml.numbering.CT_NumPr
对象:
To access the numbering property, you need to search in the "protected" members paragraph._p.pPr.numPr
, which is a docx.oxml.numbering.CT_NumPr
object:
for paragraph in iter_paragraphs(document):
num_pr = paragraph._p.pPr.numPr
if num_pr is not None:
print(num_pr) # type: docx.oxml.numbering.CT_NumPr
请注意,此对象是从 numbering.xml
文件(在docx内部)提取的(如果存在).
Note that this object is extracted from the numbering.xml
file (inside the docx), if it exists.
要访问它,您需要像打包文件一样读取docx文件.例如:
To access it, you need to read your docx file like a package. For instance:
import docx.package
import docx.parts.document
import docx.parts.numbering
package = docx.package.Package.open("sample.docx")
main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)
numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)
ct_numbering = numbering_part._element
print(ct_numbering) # CT_Numbering
for num in ct_numbering.num_lst:
print(num) # CT_Num
print(num.abstractNumId) # CT_DecimalNumber
有关更多信息,请参见 Office Open XMl 文档.
Mor information is available in the Office Open XMl documentation.
这篇关于有什么方法可以读取.docx文件,包括使用python-docx自动编号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!