不使用COM/自动化从Word文档中提取文本的最佳方法? [英] Best way to extract text from a Word doc without using COM/automation?

查看：103 发布时间：2020/5/13 1:18:49 python ms-word

本文介绍了不使用COM/自动化从Word文档中提取文本的最佳方法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否存在从不依赖COM自动化的Word文件中提取纯文本的合理方法? (这是在非Windows平台上部署的Web应用程序的功能-在这种情况下是不可协商的.)

Is there a reasonable way to extract plain text from a Word file that doesn't depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform - that's non-negotiable in this case.)

Antiword似乎是一个合理的选择，但似乎已被放弃.

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

Python解决方案将是理想的选择，但似乎不可用.

A Python solution would be ideal, but doesn't appear to be available.

推荐答案

为此，我使用catdoc或反字词，无论给出的结果是最容易解析的.我将其嵌入到python函数中，因此在解析系统(使用python编写)中易于使用.

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

将-w切换到catdoc会关闭换行，顺便说一句.

The -w switch to catdoc turns off line wrapping, BTW.

这篇关于不使用COM/自动化从Word文档中提取文本的最佳方法?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

不使用COM/自动化从Word文档中提取文本的最佳方法? [英] Best way to extract text from a Word doc without using COM/automation?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

不使用COM/自动化从Word文档中提取文本的最佳方法? [英] Best way to extract text from a Word doc without using COM/automation?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭