用C++/Qt编写的程序中的RTF/doc/docx文本提取 [英] RTF / doc / docx text extraction in program written in C++/Qt

查看:43
本文介绍了用C++/Qt编写的程序中的RTF/doc/docx文本提取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 Qt/C++ 编写一些程序,我需要从 Microsoft Word/RTF/docx 文件.

I am writing some program in Qt/C++, and I need to read text from Microsoft Word/RTF/docx files.

我正在寻找一些可以进行提取的命令行程序.可能是几个程序.

And I am looking for some command-line program that can make that extraction. It may be several programs.

我找到的最接近的是 DocToText,但它有几个错误,所以我不能使用它.我还在 PC 上安装了 Microsoft Word.也许有一些方法可以使用它来阅读文本(不知道如何使用 COM)?

The closest thing I found is DocToText, but it has several bugs, so I can't use it. I have also Microsoft Word installed on the PC. Maybe there is some way to read text using it (have no idea how to use COM)?

推荐答案

现在,这很丑陋而且很hacky,但它似乎适用于我的基本文本提取.显然要在 Qt 程序中使用它,你必须为它生成一个进程等等,但我一起破解的命令行是:

Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:

unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

就是这样:

unzip -p file.docx:-p ==解压到标准输出"

unzip -p file.docx: -p == "unzip to stdout"

grep '<w:t':只抓取包含 '<w:t' 的行(<w:t> 是 Word 2007 中文本"的 XML 元素,如据我所知)

grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)

sed 's/<[^<]>//g'*:删除标签内的所有内容

sed 's/<[^<]>//g'*: Remove everything inside tags

grep -v '^[[:space:]]$'*:删除空行

可能有一种更有效的方法来做到这一点,但在我测试过的少数文档中,它似乎对我有用.

There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.

据我所知,unzip、grep 和 sed 都有适用于 Windows 和任何 Unix 的端口,因此它应该是合理的跨平台.尽管有点丑陋的黑客 ;)

As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)

这篇关于用C++/Qt编写的程序中的RTF/doc/docx文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆