如何计算复杂文档(.rtf,.doc,.odt等)中的单词? [英] How can I count words in complex documents (.rtf, .doc, .odt, etc)?

查看:277
本文介绍了如何计算复杂文档(.rtf,.doc,.odt等)中的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试编写一个Python函数,该函数给出给定文档文件的路径,并返回该文档中的单词数. .txt文件非常容易做到这一点,并且有一些工具可以让我一起破解一些更复杂的文档格式的支持,但是我想要一个真正全面的解决方案.

I'm trying to write a Python function that, given the path to a document file, returns the number of words in that document. This is fairly easy to do with .txt files, and there are tools that allow me to hack support for a few more complex document formats together, but I want a really comprehensive solution.

查看OpenOffice.org的py-uno脚本接口和受支持的格式列表,将文档加载到无头OOo并调用其单词计数功能似乎是理想的.但是,除了基本的文档生成之外,我找不到任何py-uno教程或示例代码,甚至我发现的代码片段都已经过了半个十年了,不再可用.

Looking at OpenOffice.org's py-uno scripting interface and list of supported formats, it would seem ideal to load the documents in a headless OOo and call its word-count function. However, I can't find any py-uno tutorials or sample code that go beyond basic document generation, and even the code snippets I have found are out of date by a half-decade and no longer work.

无论是否使用OOo和Uno,如何获得各种格式文档的可靠字数统计?

Whether by using OOo and Uno or not, how can I get reliable word-counts for documents of various formats?

推荐答案

将文档加载到无头OOo中 并调用其字数统计功能

PyODConverter 是最近的(11-2009)脚本,可使用OOo转换多个文件类型.查看脚本,它具有所有OOo支持的文档的基本加载.

PyODConverter is a recent (11-2009) script to use OOo to convert multiple file types. Looking at the script, it has basic loading of all the OOo supported documents.

这是您作为无头服务开始OOo的方式:

This is how you start OOo as a headless service:

soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

soffice -headless -accept="socket,host=127.0.0.1,port=8100;urp;" -nofirststartwizard

然后,您只需要编写一个小的引导程序,即可在命令行上调用OOo,运行您的脚本,然后关闭OOo.

Then you just have to write a small bootstrapper that calls OOo on the commandline, runs your script, then closes OOo.

这篇关于如何计算复杂文档(.rtf,.doc,.odt等)中的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆