通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本 [英] Extracting pure content / text from HTML Pages by excluding navigation and chrome content

查看:21
本文介绍了通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取新闻网站,想提取新闻标题、新闻摘要(第一段)等

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

我插入了 webkit 解析器代码,以轻松地将网页作为树进行导航.为了消除导航和其他非新闻内容,我采用了文章的文本版本(减去 html 标签,webkit 提供了相同的 api).然后我运行 diff 算法比较来自同一网站的各种文章的文本,这导致相似的文本被消除.这给了我内容减去常见的导航内容等

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

尽管采用了上述方法,但我的最终文本中仍然有一些垃圾.这会导致提取不正确的新闻摘要.错误率是 10 篇文章中有 5 篇,即 50%.错误如

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

你可以吗

  1. 建议一种提取纯内容的替代策略,

  1. Suggest an alternative strategy for extraction of pure content,

学习自然语言处理是否有助于从这些文章中提取正确的摘要?

Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?

您将如何解决上述问题?.

How would you approach the above problem ?.

这些研究论文是关于同一个吗?

Are these any research papers on the same ?.

问候

安库尔古普塔

推荐答案

对于问题 (1),我不确定.我以前没有这样做过.也许其他答案之一会有所帮助.

For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.

对于问题(2),摘要的自动创建不是一个发达的领域.它通常被称为句子选择",因为现在典型的方法是只选择整个句子.

For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.

对于问题 (3),从机器学习中创建摘要的基本方法是:

For question (3), the basic way to create abstracts from machine learning would be to:

  1. 创建现有摘要的语料库
  2. 以有用的方式注释摘要.例如,您可能想要指出是否选择了原文中的每个句子以及为什么(或为什么不).
  3. 在语料库上训练某种分类器,然后用它对新文章中的句子进行分类.

我最喜欢的机器学习参考资料是 Tom Mitchell 的 机器学习.它列出了多种实现步骤 (3) 的方法.

My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).

对于问题(4),我肯定有几篇论文,因为我的导师去年提到过,但我不知道从哪里开始,因为我不是该领域的专家.

For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.

这篇关于通过排除导航和 chrome 内容从 HTML 页面中提取纯内容/文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆