我怎样才能从HTML页面提取主要文本内容？ [英] How can I extract only the main textual content from an HTML page?

查看：111 发布时间：2018/6/19 19:49:11 java html information-retrieval jsoup

本文介绍了我怎样才能从HTML页面提取主要文本内容？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新

Boilerpipe似乎工作得很好，但我意识到我不需要只有主要内容，因为许多页面没有文章，但只有一些简短的描述链接到整个文本（这在新闻门户网站中很常见），我不想放弃这些短文本。

所以如果一个API的确这样，不同的文本部分/块以某种不同于单个文本的方式分割（只有一个文本中的所有内容都没有用），请举报。

问题

我从随机站点下载一些页面，现在我想分析页面的文本内容。

问题是，网页上有很多内容，比如菜单，宣传，横幅广告等等。

我想尝试排除所有与页面内容无关的内容。

以此页为例，我不希望菜单上方的菜单链接在页脚中。

重要提示：所有网页均为HTML，并且是来自各种不同网站的网页。我需要建议如何排除这些内容。

现在，我认为从HTML中排除menu和banner类中的内容以及看起来像专有名称（第一个大写字母）的连续单词。

解决方案可以基于文本内容（不带HTML标签）或HTML内容（带有HTML标签）

编辑：我想在我的Java代码中执行此操作，而不是外部应用程序（如果可以的话）。

我尝试了解析此问题中描述的HTML内容的方法： https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
解决方案
查看 Boilerpipe 。它的目的是为了完成你想要的东西，去除网页主要文本内容的剩余混乱（样板，模板）。

您可以使用网址：

ArticleExtractor.INSTANCE.getText（URL）;
您可以使用字符串：
ArticleExtractor.INSTANCE.getText（myHtml）;
还有使用Reader ，这会打开大量的的选项。

Update

Boilerpipe appears to work really well, but I realized that I don't need only the main content because many pages don't have an article, but only links with some short description to the entire texts (this is common in news portals) and I don't want to discard these shorts text.

So if an API does this, get the different textual parts/the blocks splitting each one in some manner that differ from a single text (all in only one text is not useful), please report.

The Question

I download some pages from random sites, and now I want to analyze the textual content of the page.

The problem is that a web page have a lot of content like menus, publicity, banners, etc.

I want to try to exclude all that is not related with the content of the page.

Taking this page as example, I don't want the menus above neither the links in the footer.

Important: All pages are HTML and are pages from various differents sites. I need suggestion of how to exclude these contents.

At moment, I think in excluding content inside "menu" and "banner" classes from the HTML and consecutive words that looks like a proper name (first capital letter).

The solutions can be based in the the text content(without HTML tags) or in the HTML content (with the HTML tags)

Edit: I want to do this inside my Java code, not an external application (if this can be possible).

I tried a way parsing the HTML content described in this question : https://stackoverflow.com/questions/7035150/how-to-traverse-the-dom-tree-using-jsoup-doing-some-content-filtering
解决方案
Take a look at Boilerpipe. It is designed to do exactly what your looking for, remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

There are a few ways to feed HTML into Boilerpipe and extract HTML.

You can use a URL:
ArticleExtractor.INSTANCE.getText(url);
You can use a String:
ArticleExtractor.INSTANCE.getText(myHtml);
There are also options to use a Reader, which opens up a large number of options.

这篇关于我怎样才能从HTML页面提取主要文本内容？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

我怎样才能从HTML页面提取主要文本内容？ [英] How can I extract only the main textual content from an HTML page?

问题描述

更新

问题

Update

The Question

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

我怎样才能从HTML页面提取主要文本内容？ [英] How can I extract only the main textual content from an HTML page?

问题描述

更新

问题

Update

The Question

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭