通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本 [英] Extracting pure content / text from HTML Pages by excluding navigation and chrome content

查看：21 发布时间：2022/1/2 17:59:24 html artificial-intelligence nlp html-content-extraction text-extraction

本文介绍了通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在抓取新闻网站，想提取新闻标题、新闻摘要(第一段)等

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

我插入了 webkit 解析器代码，以轻松地将网页作为树进行导航.为了消除导航和其他非新闻内容，我采用了文章的文本版本(减去 html 标签，webkit 提供了相同的 api).然后我运行 diff 算法比较来自同一网站的各种文章的文本，这导致相似的文本被消除.这给了我内容减去常见的导航内容等

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

尽管采用了上述方法，但我的最终文本中仍然有一些垃圾.这会导致提取不正确的新闻摘要.错误率是 10 篇文章中有 5 篇，即 50%.错误如

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

你可以吗

建议一种提取纯内容的替代策略，

Suggest an alternative strategy for extraction of pure content,

学习自然语言处理是否有助于从这些文章中提取正确的摘要?

Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?

您将如何解决上述问题?.

How would you approach the above problem ?.

这些研究论文是关于同一个吗?

Are these any research papers on the same ?.

问候

安库尔古普塔

通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本 [英] Extracting pure content / text from HTML Pages by excluding navigation and chrome content

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本 [英] Extracting pure content / text from HTML Pages by excluding navigation and chrome content

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭