识别页面的主要内容 [英] Identifying a Page's Primary Content

查看：114 发布时间：2020/5/25 1:39:54 parsing semantics

本文介绍了识别页面的主要内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

鉴于HTML网页是一篇文字繁重的文章，我想识别并解析出主要内容.

Given an HTML page that is a text heavy article, I would like to identify and parse out the primary content.

使用 http://www.fivethirtyeight.以com/2009/08/chavismo-obama-and-monroe-doctrine.html 为例，我要标识div#post-4438372351887392855，其中包含标题和文章.

Using http://www.fivethirtyeight.com/2009/08/chavismo-obama-and-monroe-doctrine.html as an example, I want to identify div#post-4438372351887392855, which contains the title and article.

我知道什么都不是完美的，或者不可能100％地起作用，但是有没有一种方法可以在合理的情况下为我提供理想的结果呢?

I know nothing can be perfect or work 100% of the time, but is there an approach that can give me the desired result in a reasonable number of circumstances?

我目前的想法是遍历每个div，剥离标记，然后找到包含最多文本的最里面的div.

My present thought is to iterate through each div, stripping out the markup, then finding the inner-most div that contains the most text.

至此，我才刚刚起步，因此，我可以寻求概念上的投入.或者，如果有东西，那么开源库就不错了.

At this point, I'm just getting started, so looking for input I can put towards a conceptual approach. Or, if something is out there, an open source library would be nice.

提前感谢您的见解.

识别页面的主要内容 [英] Identifying a Page's Primary Content

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

识别页面的主要内容 [英] Identifying a Page&#39;s Primary Content

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

识别页面的主要内容 [英] Identifying a Page's Primary Content

登录关闭