用Python创建博客摘要? [英] Creating a Blog Summary in Python?

查看:89
本文介绍了用Python创建博客摘要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有任何好的库(或正则表达式魔术)都可以将博客条目转换为博客摘要?我希望摘要显示前四个句子,第一段或前X个字符...不确定是否最好.理想情况下,我希望它保留诸如<a><b><u><i>之类的html格式标签,但它可以删除所有其他html标签,javascript和css.

Is there any good library (or regex magic) which can convert a blog entry into a blog summary? I'd like the summary to display the first four sentences, first paragraph, or first X number of characters... not really sure what would be the best. Ideally, I would like it to keep html formatting tags such as <a>, <b>, <u> and <i>, but it could remove all other html tags, javascript and css.

更具体地说,作为输入,我将给出代表整个博客文章的html字符串.作为输出,我想要一个html字符串,其中包含前几个句子,段落或X个字符.删除所有可能不安全的html标签.请使用Python.

More specifically, as input I'd give an html string representing an entire blog post. As output, I'd like an html string which contains the first few sentences, paragraph, or X number of characters. With all potentially unsafe html tags removed. In Python please.

推荐答案

如果您正在查看HTML,则需要对其进行解析.除了前面提到的BeautifulSoup, lxml.html 还有一些不错的HTML处理工具.

If you're looking at the HTML you'll need to parse it. In addition to aforementioned BeautifulSoup, lxml.html has some nice HTML handling tools.

但是,如果它是博客,则可能会发现使用RSS/Atom提要更容易. Feedparser 很棒,而且会很容易.您会获得兼容性和耐用性(因为RSS的定义更加清晰,更改的内容会减少),但是如果Feed中未包含您所需要的内容,那么它将无济于事.

However if it's a blog you may find it even easier to work with RSS/Atom feeds. Feedparser is fantastic and would make it easy. You'd gain compatibility and durability (because RSS is more defined things will change less) but if the feed doesn't include what you need it won't help you.

这篇关于用Python创建博客摘要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆