提取具有与内容相关联的样式的 Word 文档 [英] Extracting word document with styles associated to the content

查看:54
本文介绍了提取具有与内容相关联的样式的 Word 文档的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提取包含不同字体和字体大小、图像、注释等文本的 word 文档的格式.我使用 zipfile 模块来提取 word 的 XML 文件文档.

I'm trying to extract the format of a word document containing text in different fonts and font-sizes, images, comments etc. I have used zipfile module to extract the XML files of the word document.

XML 文件是:

['[Content_Types].xml',
 '_rels/.rels',
 'word/_rels/document.xml.rels',
 'word/document.xml',
 'word/footer2.xml',
 'word/header1.xml',
 'word/footer1.xml',
 'word/endnotes.xml',
 'word/footnotes.xml',
 'word/_rels/header1.xml.rels',
 'word/header2.xml',
 'word/_rels/header2.xml.rels',
 'word/embeddings/Microsoft_Word_97_-_2003_Document1.doc',
 'word/media/image3.wmf',
 'word/media/image2.emf',
 'word/theme/theme1.xml',
 'word/media/image1.png',
 'word/embeddings/oleObject1.bin',
 'word/comments.xml',
 'word/settings.xml',
 'word/styles.xml',
 'customXml/itemProps1.xml',
 'word/numbering.xml',
 'customXml/_rels/item1.xml.rels',
 'customXml/item1.xml',
 'docProps/app.xml',
 'word/stylesWithEffects.xml',
 'word/webSettings.xml',
 'word/fontTable.xml',
 'docProps/core.xml',
 'docProps/custom.xml']

我无法理解与 word/document.xml 中的内容相关的样式.

I'm unable to understand the styles associated with the content present in word/document.xml.

我正在尝试以下列方式封装结果:

I'm trying to encapsulate the results in the following manner:

{
    "text": "some-text-in-document",
    "font": "some-font",
    "font_size": 10,
    "some_field": "some-more-value",
    ...
}

尝试使用 python-docx 来获取字体和字体大小,但主要是 None

Tried using python-docx to get the fonts and font-sizes but mostly the value is None

这是代码片段:

from docx.enum.style import WD_STYLE_TYPE
styles = document.styles
#print(styles.default)
paragraph_styles = [s for s in styles if s.type == WD_STYLE_TYPE.PARAGRAPH]
for style in paragraph_styles:
    #print(style.font.name)
    if(style.font.name):
        print(style.font.name, style.font.size)

for paragraph in document.paragraphs:
    #print(paragraph.text)
    for run in paragraph.runs:
        print(run.text)
        font = run.style.font
        print(font.size)

结果大多是 None 字体和大小.

Results are mostly None for font and size.

推荐答案

none 对于 style 的值意味着 Normal.

A value of None for style means Normal.

所有段落都有一个样式,只是大多数段落具有相同的样式,因此 Word 不会在大多数情况下将其拼写出来,也许是为了节省空间.

All paragraphs have a style, it's just that most have the same style, so Word doesn't spell it out for that majority case, perhaps to save space.

这篇关于提取具有与内容相关联的样式的 Word 文档的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆