不可读性使用哪种算法从URL中提取文本? [英] What algorithm does Readability use for extracting text from URLs?

查看:208
本文介绍了不可读性使用哪种算法从URL中提取文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一段时间,我一直在试图找到消除有关广告和研发的所有其他clutter.After几个月文字智能地提取从URL相关文字的方式,我把它放弃了为不能准确地确定的问题。 (我尝试不同的方法,但没有一个是可靠的)

For a while, I've been trying to find a way of intelligently extracting the "relevant" text from a URL by eliminating the text related to ads and all the other clutter.After several months of researching, I gave it up as a problem that cannot be accurately determined. (I've tried different ways but none were reliable)

一个星期后,我偶然发现可读性 - 任何URL转换为可读的文本插件。它看起来pretty准确的给我。我的猜测是,他们莫名其妙地有一种算法,是足够聪明,提取相关的文字。

A week back, I stumbled across Readability - a plugin that converts any URL into readable text. It looks pretty accurate to me. My guess is that they somehow have an algorithm that's smart enough to extract the relevant text.

有谁知道他们是如何做到的呢?或如何,我可以可靠地做到这一点?

Does anyone know how they do it? Or how I could do it reliably?

推荐答案

可读性主要包括启发式,在很多情况下,不知怎么竟很好地工作的。

Readability mainly consists of heuristics that "just somehow work well" in many cases.

我写这个话题的一些研究论文,我想解释为什么它很容易后台拿出行之有效和解决方案时,它会很难得到接近100%的准确率。

I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.

有似乎是一个语言法律人的语言背后,这也是网页中的内容,这已经相当明确分离两种类型的文本(全文与非全文或(但不限于)清单,粗略地说,主要内容与样板)。

There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").

要得到的HTML的主要内容,它在许多情况下,足以使仅HTML文本元素(即不被标记打断文本的,即块),其具有大于约10个字以上。看来,人类由两种类型的文本(短和长,用文字的它们发出的数量来衡量)写入文本两种不同的动机选择。我会叫他们导航和信息化的动机。

To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.

如果一个作家要你的快速的获取写的是什么,他/她使用导航的文字,即几句话(如STOP,阅读,点击这里) 。这是主要是突出型导航元素的文本(菜单等)

If an author wants you to quickly get what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)

如果一个作者希望你深入了解他/她的意思,他/她使用了许多话。通过这种方式,模糊度以增加冗余的费用除去。第二十样的内容通常属于这一类,因为它比只有几句话了。

If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.

虽然这似乎分离的情况下过多的工作,是越来越棘手与标题,简短的句子,免责声明,版权页脚等。

While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.

有更复杂的策略和功能,即帮助分离的样板主要内容。例如链接密度(在该被链接与在该块的字的总数的块的字数),在previous /下块的功能,一个特定块的文本中的整体的频率网页,HTML文档的DOM结构,页面的视觉图像等

There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.

您可以阅读我的​​最新文章样板检测浅使用文本功能来获得从理论的角度来看一些见解。你也可以看我的论文presentation的视频VideoLectures.net。

You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.

可读性使用其中的一些功能。如果你仔细看SVN更改日志中,您将看到的策略数量随时间变化,所以没有可读性提取质量。例如,2009年12月推出联密度非常有助于提升。

"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.

在我看来,这因此是没有意义的说:可读性确实会这样,就不能不提确切的版本号。

In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.

我已经出版了一个开源的HTML内容提取库调用 boilerpipe ,它提供了几种不同的提取策略。根据使用的情况下,一个或其他抽取器工作得更好。您可以尝试使用在谷歌的AppEngine同伴boilerpipe的Web应用程序在您选择的网页,这些提取。

I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.

要数我们说话,请参阅boilerpipe维基基准网页,其中比较了提取策略,包括boilerpipe,可读性和苹果Safari浏览器。

To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.

我应该指出,这些算法假定的主要内容实际上是全文。有些情况下的主要内容,是别的东西,例如图像,表,视频等的算法将不为这样的情况下很好地工作。

I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.

干杯,

基督教

这篇关于不可读性使用哪种算法从URL中提取文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆