从存储的.html页面中提取新闻文章内容 [英] Extract News article content from stored .html pages

查看:199
本文介绍了从存储的.html页面中提取新闻文章内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从html文件中读取文本并进行一些分析.这些.html文件是新闻文章.

I am reading text from html files and doing some analysis. These .html files are news articles.

代码:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

现在我只想要文章的内容,而不是广告,标题等文本的其余部分.我该如何在python中相对准确地做到这一点?

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

我知道一些工具,例如Jsoup(一种Java API)和 bolier ,但是我想在python中这样做.我可以使用 bs4 找到一些技巧,但是仅限一种类型的页面.我有来自众多来源的新闻页面.此外,还缺少任何示例代码示例.

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

我正在寻找完全像这样的东西

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

为了更好地理解,请编写示例代码以提取以下链接的内容

To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

推荐答案

Python中也有此库:)

There are libraries for this in Python too :)

自从您提到Java以来​​,就有了一个样板程序的Python包装器,可让您在python脚本中直接使用它: https://github.com/misja/python-boilerpipe

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

如果要使用纯python库,则有2个选项:

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

https://github.com/grangier/python-goose

在这两者中,我更喜欢Goose,但是请注意,由于某些原因,它的最新版本有时无法提取文本(我的建议是现在使用1.0.22版)

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

这是使用Goose的示例代码:

here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

这篇关于从存储的.html页面中提取新闻文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆