读取docx文件,识别并存储斜体文本 [英] Reading docx files, recognizing and storing italicized text

查看:63
本文介绍了读取docx文件,识别并存储斜体文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我应该如何使用Python读取.docx文件并能够识别斜体文本并将其存储为字符串?

How should I go about reading a .docx file with Python and being able to recognize the italicized text and storing it as a string?

我看了看docx python软件包,但所看到的只是写入.docx文件的功能.

I looked at the docx python package but all I see is features for writing to a .docx file.

我非常感谢您的帮助

推荐答案

这是我的示例文档 TestDocument.docx 的样子.

Here's what my example document, TestDocument.docx, looks like.

注意:斜体"一词在斜体中,但是强调"使用的样式是强调".

Note: The word "Italic" is in Italics, but "Emphasis" uses the style, Emphasis.

如果安装 python-docx 模块.这是一个相当简单的练习.

If you install the python-docx module. This is a fairly simple exercise.

>>> from docx import Document
>>> document = Document('TestDocument.docx')
>>> for p in document.paragraphs:
...     for run in p.runs:
...             print run.text, run.italic, run.bold
... 
Test Document None None
Italics True None
Emp None None
hasis None None
>>> [[run.text for run in p.runs if run.italic] for p in document.paragraphs]
[[], ['Italics'], []]

Run.italic 属性捕获文本是否设置为斜体格式,但是它不知道文本块是否具有以斜体呈现的样式,但是可以通过检查来检测Run.style.name(如果您知道文档中的样式以斜体显示.

The Run.italic attribute captures whether the text is formatted as Italic, but it doesn't know if a text block has a Style that is rendered in Italic, but it can be detected by checking Run.style.name (if you know what styles in your document are rendered in Italics.

>>> [[run.text for run in p.runs if run.style.name=='Emphasis'] for p in document.paragraphs]
[[], [], ['Emp', 'hasis']]

这篇关于读取docx文件,识别并存储斜体文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆