Apache tika:删除结果字符串中的额外换行符 [英] Apache tika: remove extra line breaks in result string
本文介绍了Apache tika:删除结果字符串中的额外换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有 html 文件:
I have html file:
<html><head></head><body><div style="font-family: Verdana;font-size: 12.0px;">
<div>Test message.</div>
<div> </div>
<div>More content here...</div>
<div> </div>
<div>Best regards,</div>
<div>Mr. Crowley</div></div></body></html>
我尝试使用 Apache Tika 获取上述文件的内容...
I try to get content of the file above using Apache Tika...
final InputStream input = new FileInputStream("file.html");
final ContentHandler handler = new BodyContentHandler();
final Metadata metadata = new Metadata();
final HtmlParser htmlParser = new HtmlParser();
htmlParser.parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(plainText);
...除了额外的换行符外,一切都很好:
...and all is fine except extra linebreaks:
Test message.
More content here...
Best regards,
Mr. Crowley
<and 3 empty lines here>
是否可以避免这种行为?是否有可能获得更多预期的结果:
Is it possible to avoid this behavior? Is it possible to get more expected result:
Test message.
More content here...
Best regards,
Mr. Crowley
?
代码结构如
plainText = plainText.replaceAll("(\n)+", "\n");
不幸的是,这对我来说是不可能的.我也无法更改 HTML 文件的结构.
are unfortunately impossible here for me. Also I can't change the structure of my HTML file.
推荐答案
一种解决方案是实现自定义的 ContentHandler 不会写那些新行(仍然保留原始文档中的新行):
One solution is to implement custom ContentHandler which would not write those new lines (still new lines from the original document will be kept):
public class OriginalBodyContentHandler extends BodyContentHandler {
@Override
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException {
// Not writing extra new lines generated by XHTMLContentHandler.
}
}
这篇关于Apache tika:删除结果字符串中的额外换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文