我怎样才能从HTML页面提取主要文本内容? [英] How can I extract only the main textual content from an HTML page?
问题描述
更新
Boilerpipe似乎工作得很好,但我意识到我不需要只有主要内容,因为许多页面没有文章,但只有一些简短的描述链接到整个文本(这在新闻门户网站中很常见),我不想放弃这些短文本。
所以如果一个API的确这样,不同的文本部分/块以某种不同于单个文本的方式分割(只有一个文本中的所有内容都没有用),请举报。问题
我从随机站点下载一些页面,现在我想分析页面的文本内容。
问题是,网页上有很多内容,比如菜单,宣传,横幅广告等等。
我想尝试排除所有与页面内容无关的内容。
以此页为例,我不希望菜单上方的菜单链接在页脚中。
重要提示:所有网页均为HTML,并且是来自各种不同网站的网页。我需要建议如何排除这些内容。
现在,我认为从HTML中排除menu和banner类中的内容以及看起来像专有名称(第一个大写字母)的连续单词。
解决方案可以基于文本内容(不带HTML标签)或HTML内容(带有HTML标签)
编辑:我想在我的Java代码中执行此操作,而不是外部应用程序(如果可以的话)。
我尝试了解析此问题中描述的HTML内容的方法: https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
查看 Boilerpipe 。它的目的是为了完成你想要的东西,去除网页主要文本内容的剩余混乱(样板,模板)。
您可以使用网址:
ArticleExtractor.INSTANCE.getText(URL);
您可以使用字符串:
ArticleExtractor.INSTANCE.getText(myHtml);
还有使用Reader ,这会打开大量的的选项。
Update
Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.
So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.
The Question
I download some pages from random sites, and now I want to analyze the textual content of the page.
The problem is that a web page have a lot of content like menus, publicity, banners, etc.
I want to try to exclude all that is not related with the content of the page.
Taking this page as example, I don't want the menus above neither the links in the footer.
Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.
At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).
The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)
Edit: I want to do this inside my Java code, not an external application (if this can be possible).
I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
There are a few ways to feed HTML into Boilerpipe and extract HTML.
You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.
这篇关于我怎样才能从HTML页面提取主要文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!