使用boilerpipe提取非英文文章 [英] Using boilerpipe to extract non-english articles

查看：357 发布时间：2018/6/20 15:22:58 java html text-extraction

本文介绍了使用boilerpipe提取非英文文章的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试使用 boilerpipe java库，从一组网站中提取新闻报道。
对于英文文本非常适用，但对于带有特殊字符的文本（例如带有重音符号的文字（历史记录）），此特殊字符无法正确提取。我认为这是一个编码问题。

在boilerpipe faq中，它说如果您提取非英文文本，您可能需要更改一些参数，然后引用论文。我在本文中找不到解决方案。

我的问题是，在使用boilerpipe时，是否有任何参数可以指定编码？有什么方法可以解决问题吗？

我如何使用库：
（基于URL的第一次尝试）：

 网址url =新网址（链接）; 
 String article = ArticleExtractor.INSTANCE.getText（url）;

（HTLM源代码中的第二个）

 字符串文章= ArticleExtractor.INSTANCE.getText（html_page_as_string）;

解决方案

好的，得到了一个解决方案。
正如Andrei所说，我必须更改包de.l3s.boilerpipe.sax
中的类HTMLFecther。我所做的就是将所有已提取的文本转换为UTF-8。
在获取函数结束时，我必须添加两行，并更改最后一行：

  final byte [] data = bos.toByteArray（）; //保持不变
 byte [] utf8 = new String（data，cs.displayName（））。getBytes（UTF-8）; // new new（convertion）
 cs = Charset.forName（UTF-8）; //将字符集设置为UFT-8 
返回新的HTMLDocument（utf8，cs）; //被编辑的行

I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library: (first attempt based on the URL):
URL url = new URL(link); String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

解决方案
Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion) cs = Charset.forName("UTF-8"); //set the charset to UFT-8 return new HTMLDocument(utf8, cs); // edited line

这篇关于使用boilerpipe提取非英文文章的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用boilerpipe提取非英文文章 [英] Using boilerpipe to extract non-english articles

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

使用boilerpipe提取非英文文章 [英] Using boilerpipe to extract non-english articles

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭