Java的:Apache的POI:我可以从微软Word(.doc)格式文件得到干净的文字? [英] Java: Apache POI: Can I get clean text from MS Word (.doc) files?

查看:652
本文介绍了Java的:Apache的POI:我可以从微软Word(.doc)格式文件得到干净的文字?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Apache POI时,我(编程)从MS Word文件得到的字符串是不一样的文字我可以看看当我打开文件,MS Word中。

The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

在使用下列code:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
HWPFDocument wordDoc = new HWPFDocument(inputStrm);
System.out.println(wordDoc.getText());

输出与许多无效字符一行(是的,'盒子'),以及许多不想要的字符串,如 FORMTEXT HYPERLINK \\ l的_Toc ##########('#'是位数字), PAGEREF _Toc #### ###### \\ H 4 等。

the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like "FORMTEXT", "HYPERLINK \l "_Toc##########"" ('#' being numeric digits), "PAGEREF _Toc########## \h 4", etc.

以下code修复单行的问题,而是维护所有无效字符和不需要的文本:

The following code "fixes" the single-line problem, but maintains all the invalid characters and unwanted text:

File someFile = new File("some\\path\\MSWFile.doc");
InputStream inputStrm = new FileInputStream(someFile);
WordExtractor wordExtractor = new WordExtractor(inputStrm);
for(String paragraph:wordExtractor.getParagraphText()){
  System.out.println(paragraph);
}


我不知道如果我使用了错误的方法提取文本,但是这是我想出的的 POI的快速指南。如果我,什么是正确的做法?


I don't know if I'm using the wrong method for extracting the text, but that's what I've come up with when looking at POI's quick-guide. If I am, what is the correct approach?

如果该输出是正确的,是有摆脱不需要的文本的标准方法,或者我会写我自己的过滤器?

If that output is correct, is there a standard way for getting rid of the unwanted text, or will I have to write a filter of my own?

推荐答案

有两种选择,一是直接在Apache的POI提供,通过Apache提卡(它使用Apache POI内部)。其他

There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).

第一种选择是使用 WordExtractor ,但在通话包裹,<一个href=\"http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29\"相对=nofollow> stripFields(字符串) 调用它时。这将删除包含在基于文本字段的文本,像超链接你见过。您code将变成:

The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:

NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
WordExtractor extractor = new WordExtractor(fs.getRoot());

for(String rawText : extractor.getParagraphText()) {
String text = extractor.stripFields(rawText);
System.out.println(text);
}

另一种选择是使用的Apache提卡。蒂卡提供文本提取和元数据,为各种各样的文件,所以同样code将为.DOC,.DOCX,.PDF和许多其他工作了。为了让您的Word文档的干净,纯文本(你也可以XHTML如果您愿​​意),你会做这样的事情:

The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:

TikaConfig tika = TikaConfig.getDefaultConfig();
TikaInputStream stream = TikaInputStream.get(file);
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
tika.getParser().parse(input, handler, metadata, new ParseContext());
String text = handler.toString();

这篇关于Java的:Apache的POI:我可以从微软Word(.doc)格式文件得到干净的文字?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆