Java:Apache POI:我可以从 MS Word (.doc) 文件中获取干净的文本吗? [英] Java: Apache POI: Can I get clean text from MS Word (.doc) files?

查看:43
本文介绍了Java:Apache POI:我可以从 MS Word (.doc) 文件中获取干净的文本吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用 Apache POI 时(以编程方式)从 MS Word 文件中获取的字符串与我使用 MS Word 打开文件时可以看到的文本不同.

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

使用以下代码时:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());

输出是一行包含许多无效"字符(是的,框")和许多不需要的字符串,例如FORMTEXT"、HYPERLINK \l"_Toc##########"" ('#' 是数字), "PAGEREF _Toc########## \h 4",等

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.

以下代码修复"了单行问题,但保留了所有无效字符和不需要的文本:

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
  System.out.println(paragraph);
}

<小时>

我不知道我是否使用了错误的提取文本的方法,但这就是我在查看 POI 的快速指南.如果是,正确的方法是什么?


I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

如果输出是正确的,是否有标准方法可以去除不需要的文本,还是我必须自己编写过滤器?

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

推荐答案

有两个选项,一个直接在 Apache POI 中提供,另一个通过 Apache Tika(内部使用 Apache POI).

There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).

第一个选项是使用 WordExtractor,但将其包装在对 stripFields(String) 调用它时.这将删除文本中包含的基于文本的字段,例如您看到的 HYPERLINK.您的代码将变为:

The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:

NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());

for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}

另一种选择是使用 Apache Tika.Tika 为各种文件提供文本提取和元数据,因此相同的代码也适用于 .doc、.docx、.pdf 和许多其他文件.要获得干净、纯文本的 Word 文档(如果您愿意,也可以获取 XHTML),您可以执行以下操作:

The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:

TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();

这篇关于Java:Apache POI:我可以从 MS Word (.doc) 文件中获取干净的文本吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆