从存储的.html页面中提取新闻文章内容 [英] Extract News article content from stored .html pages
问题描述
我正在从html文件中读取文本并进行一些分析.这些.html文件是新闻文章.
I am reading text from html files and doing some analysis. These .html files are news articles.
代码:
html = open(filepath,'r').read()
raw = nltk.clean_html(html)
raw.unidecode(item.decode('utf8'))
现在我只想要文章的内容,而不是广告,标题等文本的其余部分.我该如何在python中相对准确地做到这一点?
Now I just want the article content and not the rest of the text like advertisements, headings etc. How can I do so relatively accurately in python?
我知道一些工具,例如Jsoup(一种Java API)和 bolier ,但是我想在python中这样做.我可以使用 bs4 找到一些技巧,但是仅限一种类型的页面.我有来自众多来源的新闻页面.此外,还缺少任何示例代码示例.
I know some tools like Jsoup(a java api) and bolier but I want to do so in python. I could find some techniques using bs4 but there limited to one type of page. And I have news pages from numerous sources. Also, there is dearth of any sample code example present.
I am looking for something exactly like this http://www.psl.cs.columbia.edu/wp-content/uploads/2011/03/3463-WWWJ.pdf in python.
To better understand, please write a sample code to extract the content of the following link http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general
推荐答案
Python中也有此库:)
There are libraries for this in Python too :)
自从您提到Java以来,就有了一个样板程序的Python包装器,可让您在python脚本中直接使用它: https://github.com/misja/python-boilerpipe
Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe
如果要使用纯python库,则有2个选项:
If you want to use purely python libraries, there are 2 options:
https://github.com/buriy/python-readability
和
https://github.com/grangier/python-goose
在这两者中,我更喜欢Goose,但是请注意,由于某些原因,它的最新版本有时无法提取文本(我的建议是现在使用1.0.22版)
Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)
这是使用Goose的示例代码:
here's a sample code using Goose:
from goose import Goose
from requests import get
response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text
这篇关于从存储的.html页面中提取新闻文章内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!