使用boilerpipe提取非英文文章 [英] Using boilerpipe to extract non-english articles

查看:357
本文介绍了使用boilerpipe提取非英文文章的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 boilerpipe java库,从一组网站中提取新闻报道。
对于英文文本非常适用,但对于带有特殊字符的文本(例如带有重音符号的文字(历史记录)),此特殊字符无法正确提取。我认为这是一个编码问题。

在boilerpipe faq中,它说如果您提取非英文文本,您可能需要更改一些参数,然后引用论文。我在本文中找不到解决方案。



我的问题是,在使用boilerpipe时,是否有任何参数可以指定编码?有什么方法可以解决问题吗?



我如何使用库:
(基于URL的第一次尝试):

 网址url =新网址(链接); 
String article = ArticleExtractor.INSTANCE.getText(url);

(HTLM源代码中的第二个)

 字符串文章= ArticleExtractor.INSTANCE.getText(html_page_as_string); 


解决方案

好的,得到了​​一个解决方案。
正如Andrei所说,我必须更改包de.l3s.boilerpipe.sax
中的类HTMLFecther。我所做的就是将所有已提取的文本转换为UTF-8。
在获取函数结束时,我必须添加两行,并更改最后一行:

  final byte [] data = bos.toByteArray(); //保持不变
byte [] utf8 = new String(data,cs.displayName())。getBytes(UTF-8); // new new(convertion)
cs = Charset.forName(UTF-8); //将字符集设置为UFT-8
返回新的HTMLDocument(utf8,cs); //被编辑的行


I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.

In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.

My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?

How i'm using the library: (first attempt based on the URL):

URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);

(second on the HTLM source code)

String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

解决方案

Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:

final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

这篇关于使用boilerpipe提取非英文文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆