用C++/Qt编写的程序中的RTF/doc/docx文本提取 [英] RTF / doc / docx text extraction in program written in C++/Qt
问题描述
我正在用 Qt/C++ 编写一些程序,我需要从 Microsoft Word/RTF/docx 文件.
I am writing some program in Qt/C++, and I need to read text from Microsoft Word/RTF/docx files.
我正在寻找一些可以进行提取的命令行程序.可能是几个程序.
And I am looking for some command-line program that can make that extraction. It may be several programs.
我找到的最接近的是 DocToText,但它有几个错误,所以我不能使用它.我还在 PC 上安装了 Microsoft Word.也许有一些方法可以使用它来阅读文本(不知道如何使用 COM)?
The closest thing I found is DocToText, but it has several bugs, so I can't use it. I have also Microsoft Word installed on the PC. Maybe there is some way to read text using it (have no idea how to use COM)?
推荐答案
现在,这很丑陋而且很hacky,但它似乎适用于我的基本文本提取.显然要在 Qt 程序中使用它,你必须为它生成一个进程等等,但我一起破解的命令行是:
Now, this is pretty ugly and pretty hacky, but it seems to work for me for basic text extraction. Obviously to use this in a Qt program you'd have to spawn a process for it etc, but the command line I've hacked together is:
unzip -p file.docx | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'
就是这样:
unzip -p file.docx:-p ==解压到标准输出"
unzip -p file.docx: -p == "unzip to stdout"
grep '<w:t':只抓取包含 '<w:t' 的行(<w:t> 是 Word 2007 中文本"的 XML 元素,如据我所知)
grep '<w:t': Grab just the lines containing '<w:t' (<w:t> is the Word 2007 XML element for "text", as far as I can tell)
sed 's/<[^<]>//g'*:删除标签内的所有内容
sed 's/<[^<]>//g'*: Remove everything inside tags
grep -v '^[[:space:]]$'*:删除空行
可能有一种更有效的方法来做到这一点,但在我测试过的少数文档中,它似乎对我有用.
There is likely a more efficient way to do this, but it seems to work for me on the few docs I've tested it with.
据我所知,unzip、grep 和 sed 都有适用于 Windows 和任何 Unix 的端口,因此它应该是合理的跨平台.尽管有点丑陋的黑客 ;)
As far as I'm aware, unzip, grep and sed all have ports for Windows and any of the Unixes, so it should be reasonably cross-platform. Despit being a bit of an ugly hack ;)
这篇关于用C++/Qt编写的程序中的RTF/doc/docx文本提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!