从新闻文章中提取评论 [英] Extracting comments from news articles

查看:67
本文介绍了从新闻文章中提取评论的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题类似于在这里提出的问题:
https://stackoverflow.com / questions / 14599485 / news-website-comment-analysis
我正在尝试从任何新闻文章中提取评论。例如。我在这里有一个新闻网址:
http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/
我正在尝试在python中使用BeautifulSoup提取注释。但是,似乎注释部分嵌入在iframe中或通过javascript加载。通过Firebug查看源代码不会显示注释部分的源代码。但是通过浏览器的查看源功能可以显式查看评论的源。如何提取评论,特别是当评论来自新闻网页中嵌入的其他URL时?



这是我到目前为止所做的,尽管不是很多:

 进口urllib2从bs4进口
进口BeautifulSoup

开瓶器= urllib2。 build_opener()


url =('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')


urlContent = opener.open(url).read()
汤= BeautifulSoup(urlContent)
标题= soup.title.text

打印标题
body = soup.findAll('body')
outfile = open( brain.txt, w +)
for i in body:
i = i .text.encode('ascii','ignore')
outfile.write(i +'\n')

在我需要做的事情或如何做方面的任何帮助将不胜感激。

解决方案

位于 iframe 内。检查 id = dsq2 的框架。



现在 iframe 具有 src 属性链接到实际站点带有注释。



在漂亮的汤中: css_soup.select(#dsq2)并从src属性获取网址。



要获取实际的注释,从src获取页面后,可以使用以下css选择器: .post-message p



如果要加载更多评论,则单击更多评论按钮后,似乎正在发送此消息:



http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum = cnn& order = popular& cursor = 2%3A0%3A0& api_key = E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F b

My question is similar to the one asked here: https://stackoverflow.com/questions/14599485/news-website-comment-analysis I am trying to extract comments from any news article. E.g. i have a news url here: http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/ I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?

This is what i have done till now although this is not much:

    import urllib2
    from bs4 import BeautifulSoup

    opener = urllib2.build_opener()


    url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')


urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text

print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
    i=i.text.encode('ascii','ignore')
    outfile.write(i +'\n')

Any help in what I need to do or how to go about it will be much appreciated.

解决方案

its inside an iframe. check for a frame with id="dsq2".

now the iframe has a src attr which is a link to the actual site that has the comments.

so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.

to get the actual comments, after you get the page from src you can use this css selector: .post-message p

and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:

http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

这篇关于从新闻文章中提取评论的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆