读取docx文件,识别并存储斜体文本 [英] Reading docx files, recognizing and storing italicized text
问题描述
我应该如何使用Python读取.docx文件并能够识别斜体文本并将其存储为字符串?
How should I go about reading a .docx file with Python and being able to recognize the italicized text and storing it as a string?
我看了看docx python软件包,但所看到的只是写入.docx文件的功能.
I looked at the docx python package but all I see is features for writing to a .docx file.
我非常感谢您的帮助
推荐答案
这是我的示例文档 TestDocument.docx
的样子.
Here's what my example document, TestDocument.docx
, looks like.
注意:斜体"一词在斜体中,但是强调"使用的样式是强调".
Note: The word "Italic" is in Italics, but "Emphasis" uses the style, Emphasis.
如果安装 python-docx
模块.这是一个相当简单的练习.
If you install the python-docx
module. This is a fairly simple exercise.
>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
... for run in p.runs:
... print run.text, run.italic, run.bold
...
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]
Run.italic
属性捕获文本是否设置为斜体格式,但是它不知道文本块是否具有以斜体呈现的样式,但是可以通过检查来检测Run.style.name(如果您知道文档中的样式以斜体显示.
The Run.italic
attribute captures whether the text is formatted as Italic, but it doesn't know if a text block has a Style that is rendered in Italic, but it can be detected by checking Run.style.name (if you know what styles in your document are rendered in Italics.
>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]
这篇关于读取docx文件,识别并存储斜体文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!