提取 MS Word 文档格式元素以及原始文本信息 [英] Extracting MS Word document formatting elements along with raw text information

查看:84
本文介绍了提取 MS Word 文档格式元素以及原始文本信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这篇文章中@mikemaccana 描述了如何使用python-docx 从 Python 中的 MS Word 文档中提取原始文本数据.我想更进一步.除了简单地提取原始文本信息之外,我还可以使用此模块来获取有关字体(例如粗体与斜体)或字体大小(例如 12 与 18pt)的信息.我最接近的是这篇文章询问使用此模块提取突出显示的文本条目.

看起来有点抽象,我不完全确定这里发生了什么.有没有更直接的方法可以从 python 中的 Word 文档中提取格式信息?通过快速文档模板:

<块引用>

这里的第一行是一个包含一个句子的大标题.

第二行略小.它也有两个句子.

更小.但这还不是全部.这行有三个句子.

最后是一行常规的非粗体文本.

如果我们将这四行称为我的 Word 文档,我想编写一个解析函数,将其命名为 doc_parser,它返回的内容类似如下:<预><代码>>>>>doc_data = doc_parser(path_to_example_doc)>>>>打印(文档数据)[1] [{'font': 18, 'face': 'bold', 'n_sentence': 1},{'font': 16, 'face': 'bold', 'n_sentence': 2},{'font': 14, 'face': 'bold', 'n_sentence': 3},{'font': 12, 'face': 'plain', 'n_sentence': 1}]

解决方案

字符级格式(字体")属性在运行级可用.一个段落由运行组成.所以你可以通过下降到那个级别来获得你想要的东西,比如:

 用于 document.paragraphs 中的段落:用于在paragraph.runs 中运行:字体 = run.fontis_bold = font.bold等等.

您可能遇到的最大问题是运行只知道直接应用于它的格式.如果由于应用了 style 而看起来像它的样子,则必须查询样式(也有字体对象)以查看它具有哪些属性.

请注意,Mike 所说的 python-docx 是在 v0.2.0(现在是 0.8.6)之后完全重写的遗留版本.文档在这里:http://python-docx.readthedocs.org/en/latest/

In this post @mikemaccana describes how to use python-docx to extract raw text data from an MS Word document from within python. I'd like to go one step further. Instead of simple extracting the raw text information, can I also use this module to harvest information about font face (e.g. bold versus italic) or font size (e.g. 12 versus 18pt). The closest I was able to come was this post asking about using this module to extract highlighted text entries.

Seemed a little abstract, and I'm not totally sure what's going on here. Is there a more straightforward way to extract formatting information from a Word doc in python? By way of a quick document template:

Here the first line is a large header with one sentence.

The second line is slightly smaller. It also has two sentences.

Even smaller. But that's not all. This line has three sentences.

And finally here's a regular line of unbolded text.

If we call these four lines my word document, I'd like to writing a parsing function, call it doc_parser, that returns something like the following:

>>>> doc_data = doc_parser(path_to_example_doc)
>>>> print(doc_data)
[1] [{'font': 18, 'face': 'bold', 'n_sentence': 1}, 
{'font': 16, 'face': 'bold', 'n_sentence': 2}, 
{'font': 14, 'face': 'bold', 'n_sentence': 3}, 
{'font': 12, 'face': 'plain', 'n_sentence': 1}]

解决方案

The character level formatting ("font") properties are available at the run level. A paragraph is made up of runs. So you can get what you want by going down to that level, like:

for paragraph in document.paragraphs:
    for run in paragraph.runs:
        font = run.font
        is_bold = font.bold
        etc.

The biggest problem you're likely to encounter with that is that the run only knows about formatting that's been directly applied to it. If it looks the way it does because a style has been applied to it, you would have to query the style (which also has a font object) to see what properties it has.

Note that the python-docx that Mike was talking about is the legacy version which was completely rewritten after v0.2.0 (now 0.8.6). Docs are here: http://python-docx.readthedocs.org/en/latest/

这篇关于提取 MS Word 文档格式元素以及原始文本信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆