通过排除导航和镶边内容从HTML页面提取纯内容/文本 [英] Extracting pure content / text from HTML Pages by excluding navigation and chrome content

查看:142
本文介绍了通过排除导航和镶边内容从HTML页面提取纯内容/文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓取新闻网站,并想提取新闻标题,新闻摘要(第一段)等。

我插入webkit解析器代码以轻松浏览网页为一颗树。为了消除导航和其他非新闻内容,我采用文章的文本版本(减去html标签,webkit提供了相同的API)。然后我运行diff算法,比较来自同一网站的各种文章的文本,从而导致类似的文本被淘汰。这使我的内容减去常见的导航内容等。

尽管采用了上述方法,但我在最终文本中仍然有一些垃圾。这导致提取不正确的新闻摘要。在10篇文章中错误率是5%,即50%。在

中存在错误


  1. 建议替代策略提取纯内容,


  2. 会/可以学习自然语言处理有助于从这些文章中提取正确的摘要吗?

  3. 这些研究论文关于同样的问题。


问候

Ankur Gupta

解决方案

对于问题(1),我不确定。我以前没有这样做过。对于问题(2),自动创建摘要不是一个发达的领域。它通常被称为'句子选择',因为现在典型的方法是只选择整个句子。



对于问题(3),创建机器学习的摘要应为:


  1. 创建现有摘要的语料库
  2. 注释摘要以有用的方式。例如,您可能想指出是否选择了原文中的每个句子,以及为什么(或为什么)。

  3. 在语料库上训练某种分类器,然后使用它在新文章中对句子进行分类。

我最喜欢的机器学习参考是Tom Mitchell的机器学习。它列出了许多实施步骤(3)的方法。

对于问题(4),我确信有几篇论文是因为我的顾问去年提到它,但我不知道从哪里开始,因为我不是该领域的专家。


I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc

I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.

Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in

Can you

  1. Suggest an alternative strategy for extraction of pure content,

  2. Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?

  3. How would you approach the above problem ?.

  4. Are these any research papers on the same ?.

Regards

Ankur Gupta

解决方案

For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.

For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.

For question (3), the basic way to create abstracts from machine learning would be to:

  1. Create a corpus of existing abstracts
  2. Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
  3. Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.

My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).

For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.

这篇关于通过排除导航和镶边内容从HTML页面提取纯内容/文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆