从存储的.html页面中提取新闻文章内容 [英] Extract News article content from stored .html pages

查看：199 发布时间：2020/9/20 6:11:59 python urllib2 bs4

本文介绍了从存储的.html页面中提取新闻文章内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在从html文件中读取文本并进行一些分析.这些.html文件是新闻文章.

I am reading text from html files and doing some analysis. These .html files are news articles.

代码:

 html = open(filepath,'r').read()
 raw = nltk.clean_html(html)  
 raw.unidecode(item.decode('utf8'))

现在我只想要文章的内容，而不是广告，标题等文本的其余部分.我该如何在python中相对准确地做到这一点?

Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?

我知道一些工具，例如Jsoup(一种Java API)和 bolier ，但是我想在python中这样做.我可以使用 bs4 找到一些技巧，但是仅限一种类型的页面.我有来自众多来源的新闻页面.此外，还缺少任何示例代码示例.

I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.

我正在寻找完全像这样的东西

I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.

为了更好地理解，请编写示例代码以提取以下链接的内容

To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general

推荐答案

Python中也有此库:)

There are libraries for this in Python too :)

自从您提到Java以来，就有了一个样板程序的Python包装器，可让您在python脚本中直接使用它: https://github.com/misja/python-boilerpipe

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

如果要使用纯python库，则有2个选项:

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

和

https://github.com/grangier/python-goose

在这两者中，我更喜欢Goose，但是请注意，由于某些原因，它的最新版本有时无法提取文本(我的建议是现在使用1.0.22版)

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

这是使用Goose的示例代码:

here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

这篇关于从存储的.html页面中提取新闻文章内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从存储的.html页面中提取新闻文章内容 [英] Extract News article content from stored .html pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从存储的.html页面中提取新闻文章内容 [英] Extract News article content from stored .html pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭