提取具有与内容相关联的样式的 Word 文档 [英] Extracting word document with styles associated to the content
问题描述
我正在尝试提取包含不同字体和字体大小、图像、注释等文本的 word 文档的格式.我使用 zipfile
模块来提取 word 的 XML 文件文档.
I'm trying to extract the format of a word document containing text in different fonts and font-sizes, images, comments etc. I have used zipfile
module to extract the XML files of the word document.
XML 文件是:
['[Content_Types].xml',
'_rels/.rels',
'word/_rels/document.xml.rels',
'word/document.xml',
'word/footer2.xml',
'word/header1.xml',
'word/footer1.xml',
'word/endnotes.xml',
'word/footnotes.xml',
'word/_rels/header1.xml.rels',
'word/header2.xml',
'word/_rels/header2.xml.rels',
'word/embeddings/Microsoft_Word_97_-_2003_Document1.doc',
'word/media/image3.wmf',
'word/media/image2.emf',
'word/theme/theme1.xml',
'word/media/image1.png',
'word/embeddings/oleObject1.bin',
'word/comments.xml',
'word/settings.xml',
'word/styles.xml',
'customXml/itemProps1.xml',
'word/numbering.xml',
'customXml/_rels/item1.xml.rels',
'customXml/item1.xml',
'docProps/app.xml',
'word/stylesWithEffects.xml',
'word/webSettings.xml',
'word/fontTable.xml',
'docProps/core.xml',
'docProps/custom.xml']
我无法理解与 word/document.xml
中的内容相关的样式.
I'm unable to understand the styles associated with the content present in word/document.xml
.
我正在尝试以下列方式封装结果:
I'm trying to encapsulate the results in the following manner:
{
"text": "some-text-in-document",
"font": "some-font",
"font_size": 10,
"some_field": "some-more-value",
...
}
尝试使用 python-docx
来获取字体和字体大小,但主要是 None
Tried using python-docx
to get the fonts and font-sizes but mostly the value is None
这是代码片段:
from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
#print(styles.default)
paragraph_styles = [s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH]
for style in paragraph_styles:
#print(style.font.name)
if(style.font.name):
print(style.font.name, style.font.size)
for paragraph in document.paragraphs:
#print(paragraph.text)
for run in paragraph.runs:
print(run.text)
font = run.style.font
print(font.size)
结果大多是 None
字体和大小.
Results are mostly None
for font and size.
推荐答案
none
对于 style
的值意味着 Normal
.
A value of None
for style
means Normal
.
所有段落都有一个样式,只是大多数段落具有相同的样式,因此 Word 不会在大多数情况下将其拼写出来,也许是为了节省空间.
All paragraphs have a style, it's just that most have the same style, so Word doesn't spell it out for that majority case, perhaps to save space.
这篇关于提取具有与内容相关联的样式的 Word 文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!