从存储的 .html 页面中提取新闻文章内容 [英] Extract News article content from stored .html pages

查看:21
本文介绍了从存储的 .html 页面中提取新闻文章内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从 html 文件中读取文本并进行一些分析.这些 .html 文件是新闻文章.

I am reading text from html files and doing some analysis. These .html files are news articles.

代码:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

现在我只需要文章内容,而不是其他文本,如广告、标题等.如何在 python 中相对准确地做到这一点?

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

我知道一些工具,比如 Jsoup(一个 java api)和 bolier 但是我想在 python 中这样做.我可以使用 bs4 找到一些技术,但是仅限于一种类型的页面.我有来自众多来源的新闻页面.此外,还缺乏任何示例代码示例.

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

我正在寻找与此完全相同的内容 http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf 在 python 中.

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

为了更好的理解,请写一个示例代码来提取以下链接的内容http://www.nytimes.com/2015/05/19/health/study-finds-密集乳房组织不总是 a-high-cancer-risk.html?src=me&ref=general

To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

推荐答案

Python 中也有这方面的库 :)

There are libraries for this in Python too :)

既然你提到了 Java,就有一个用于boilerpipe 的 Python 包装器,允许你直接在 python 脚本中使用它:https://github.com/misja/python-boilerpipe

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

如果你想使用纯 python 库,有两个选项:

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

https://github.com/grangier/python-goose

在这两者中,我更喜欢 Goose,但是请注意,它的最新版本有时会出于某种原因无法提取文本(我的建议是现在使用 1.0.22 版本)

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

这是使用 Goose 的示例代码:

here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

这篇关于从存储的 .html 页面中提取新闻文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆