创建伟大的解析器 - 从HTML /博客中提取相关文本 [英] Create Great Parser - Extract Relevant Text From HTML/Blogs

查看:119
本文介绍了创建伟大的解析器 - 从HTML /博客中提取相关文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个适用于博客文章的通用HTML解析器。我想将我的解析器指向特定的entrie的URL并获取文章本身的干净文本。我的基本方法(从Python)一直使用BeautifulSoup / Urllib2,这是好的,但它假定你知道博客条目的正确标签。有没有人有任何更好的想法?



以下是一些想法,也许有人可以扩展,我还没有足够的知识/诀窍实施。 p>


  1. unix程序'lynx'似乎特别好解析博客帖子 - 他们使用什么解析器,或者如何利用它?

  2. 是否有任何服务/解析器会自动删除垃圾广告等?

  3. 在这种情况下,我有一个模糊的概念,它可能是一个好的假设,博客帖子通常包含在class =entry或类似的特定定义标签中。因此,有可能创建一个算法找到包含最干净文本的封闭标签 - 有关这方面的任何想法?

谢谢!

解决方案

小子,我有没有完美解决方案。

Arc90的可读性算法正是如此。给定HTML内容,它会挑选主要博客文章的内容,忽略页眉,页脚,导航等。



以下是实现:





我将发布一个Perl端口给CPAN几天。完成。

希望这有助于!


I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entry. Does anyone have any better ideas?

Here are some thoughts maybe someone could expand upon, that I don't have enough knowledge/know-how yet to implement.

  1. The unix program 'lynx' seems to parse blog posts especially well - what parser do they use, or how could this be utilized?

  2. Are there any services/parsers that automatically remove junk ads, etc?

  3. In this case, i had a vague notion that it may be an okay assumption that blog posts are usually contained in a certain defining tag with class="entry" or something similar. Thus, it may be possible to create an algorithm that found the enclosing tags with the most clean text between them - any ideas on this?

Thanks!

解决方案

Boy, do I have the perfect solution for you.

Arc90's readability algorithm does exactly this. Given HTML content, it picks out the content of the main blog post text, ignoring headers, footers, navigation, etc.

Here are implementations in:

I'll be releasing a Perl port to CPAN in a couple of days. Done.

Hope this helps!

这篇关于创建伟大的解析器 - 从HTML /博客中提取相关文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆