使用boilerpipe提取非英文文章 [英] Using boilerpipe to extract non-english articles
问题描述
我正在尝试使用 boilerpipe java库,从一组网站中提取新闻报道。
对于英文文本非常适用,但对于带有特殊字符的文本(例如带有重音符号的文字(历史记录)),此特殊字符无法正确提取。我认为这是一个编码问题。
在boilerpipe faq中,它说如果您提取非英文文本,您可能需要更改一些参数,然后引用论文。我在本文中找不到解决方案。
我的问题是,在使用boilerpipe时,是否有任何参数可以指定编码?有什么方法可以解决问题吗?
我如何使用库:
(基于URL的第一次尝试):
网址url =新网址(链接);
String article = ArticleExtractor.INSTANCE.getText(url);
(HTLM源代码中的第二个)
字符串文章= ArticleExtractor.INSTANCE.getText(html_page_as_string);
好的,得到了一个解决方案。
正如Andrei所说,我必须更改包de.l3s.boilerpipe.sax
中的类HTMLFecther。我所做的就是将所有已提取的文本转换为UTF-8。
在获取函数结束时,我必须添加两行,并更改最后一行:
final byte [] data = bos.toByteArray(); //保持不变
byte [] utf8 = new String(data,cs.displayName())。getBytes(UTF-8); // new new(convertion)
cs = Charset.forName(UTF-8); //将字符集设置为UFT-8
返回新的HTMLDocument(utf8,cs); //被编辑的行
I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library: (first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
Ok, got a solution. As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax What i did was to convert all the text that was fetched, to UTF-8. At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line
这篇关于使用boilerpipe提取非英文文章的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!