从word doc中提取标题文本 [英] Extracting headings' text from word doc

查看:37
本文介绍了从word doc中提取标题文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 MS Word 文档(.docx 文件)中的标题(任何级别)中提取 text.目前我正在尝试使用 python-docx 来解决,但不幸的是我看了之后还是不知道是否可行(可能是我弄错了).

I am trying to extract text from headings(of any level) in a MS Word document(.docx file). Currently I am trying to solve using python-docx, but unfortunately I am still not able to figure out if it is even feasible after reading it(maybe I am mistaken).

我试图在网上寻找解决方案,但没有找到与我的任务相关的任何内容.如果有人能在这里指导我,那就太好了.

I tried to look for the solutions online but found nothing specific to my task. It would be great if someone could guide me here.

推荐答案

基本挑战是识别标题段落.就读者而言,没有什么能阻止作者将常规"段落格式化为看起来像(并充当)标题.

The fundamental challenge is identifying heading paragraphs. There's nothing stopping an author from formatting a "regular" paragraph to look like (and serve as) a heading as far as a reader is concerned.

然而,作者可靠地使用样式来创建标题的情况并不少见,因为这样做可以自动将这些标题编译成目录.

However, it's not uncommon for authors to reliably use styles to create headings, because doing so makes it possible to automatically compile those headings into a table of contents.

在这种情况下,您可以遍历段落,并挑选出具有其中一种标题样式的段落.

In that case, you can just iterate over the paragraphs, and pick out those with one of the heading styles.

def iter_headings(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Heading'):
            yield paragraph

for heading in iter_headings(document.paragraphs):
    print heading.text

如果标题级别保留了默认值(如标题 1"、标题 2"、...),则可以从完整的样式名称解析标题级别.

Heading levels may be parsed from the full style name if they've kept the defaults (like 'Heading 1', 'Heading 2', ...).

如果作者重命名了标题样式,这可能需要调整.

This may need to be adjusted if the author has renamed the heading styles.

有更复杂的方法更可靠(就样式名称独立而言),但这些方法没有 API 支持,因此您需要深入研究内部代码并与一些样式 XML 交互直接我期望.

There are more sophisticated approaches which are more reliable (as far as being style-name independent), but those don't have API support so you'd need to dig into the internal code and interact with some of the style XML directly I expect.

这篇关于从word doc中提取标题文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆