我怎样才能从HTML页面提取主要文本内容? [英] How can I extract only the main textual content from an HTML page?

查看:111
本文介绍了我怎样才能从HTML页面提取主要文本内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新



Boilerpipe似乎工作得很好,但我意识到我不需要只有主要内容,因为许多页面没有文章,但只有一些简短的描述链接到整个文本(这在新闻门户网站中很常见),我不想放弃这些短文本。

所以如果一个API的确这样,不同的文本部分/块以某种不同于单个文本的方式分割(只有一个文本中的所有内容都没有用),请举报。






问题



我从随机站点下载一些页面,现在我想分析页面的文本内容。



问题是,网页上有很多内容,比如菜单,宣传,横幅广告等等。

我想尝试排除所有与页面内容无关的内容。



以此页为例,我不希望菜单上方的菜单链接在页脚中。



重要提示:所有网页均为HTML,并且是来自各种不同网站的网页。我需要建议如何排除这些内容。

现在,我认为从HTML中排除menu和banner类中的内容以及看起来像专有名称(第一个大写字母)的连续单词。

解决方案可以基于文本内容(不带HTML标签)或HTML内容(带有HTML标签)

编辑:我想在我的Java代码中执行此操作,而不是外部应用程序(如果可以的话)。



我尝试了解析此问题中描述的HTML内容的方法: https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

解决方案

查看 Boilerpipe 。它的目的是为了完成你想要的东西,去除网页主要文本内容的剩余混乱(样板,模板)。





您可以使用网址

  ArticleExtractor.INSTANCE.getText(URL); 

您可以使用字符串

  ArticleExtractor.INSTANCE.getText(myHtml); 

还有使用Reader ,这会打开大量的的选项。


Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.


The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering

解决方案

Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:

ArticleExtractor.INSTANCE.getText(url);

You can use a String:

ArticleExtractor.INSTANCE.getText(myHtml);

There are also options to use a Reader, which opens up a large number of options.

这篇关于我怎样才能从HTML页面提取主要文本内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆