使用 Scrapy (Python) 抓取网络数据(在线新闻评论) [英] Web data scraping (online news comments) with Scrapy (Python)

查看:54
本文介绍了使用 Scrapy (Python) 抓取网络数据(在线新闻评论)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从在线新闻中抓取网络评论数据,纯粹是为了研究.而且我注意到我必须了解 Scrapy ...

I want to scrape web comments data from online news purely for research. And I noticed that I have to learn about Scrapy...

通常,我使用 Python 进行编程.我虽然它会很容易学习.但是我遇到了一些问题.

Usually, I do programming with Python. I though it will be easy to learn. But I got some problems.

我想在 http://news.yahoo.com/congress-wary--but-unlikely-to-blow-up-obama-s-iran-deal-230545228.html.

但问题是有一个按钮(>查看评论(452))来查看评论.此外,我想做的是抓取该新闻中的所有评论.不幸的是,我必须单击另一个按钮(查看更多评论)才能查看其他 10 条评论.

But the problem is there is a button (>View Comments (452)) to see the comments. In addition, what I want to do is scraping all the comments in that news. Unfortunately, I have to click another button (View more comments) to see other 10 comments more.

我该如何处理这个问题?

How can I handle this problem?

我完成的代码如下.抱歉代码太差.

The code that I've done is as below. Sorry for too poor code.

#############################################
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tutorial.items import DmozItem

class DmozSpider(BaseSpider):
   name = "dmoz"
   allowed_domains = ["news.yahoo.com"]

   start_urls = ["http://news.yahoo.com/blogs/oddnews/driver-offended-by-%E2%80%9Cwh0-r8x%E2`%80%9D-license-plate-221720503.html",]

   def parse(self, response):
       sel = Selector(response)
       sites = sel.xpath('//div/p')
       items = []
       for site in sites:
           item = DmozItem()
           item['title'] = site.xpath('/text()').extract()
           items.append(item)
       return items  

你可以看到还需要做多少来解决我的问题.但我得快点……反正我会尽力的.

You can see that how much left to be done to solve my problem. But I have to be hurry.. I will do my best anyway.

推荐答案

既然你看起来像先试后问的类型(这是一件非常好的事情),我不会给你答案,而是一个(非常详细)如何找到答案的指南.

Since you seem like the try-first ask-question later type (that's a very good thing), I won't give you an answer, but a (very detailed) guide on how to find the answer.

问题是,除非您是 yahoo 开发人员,否则您可能无法访问要抓取的源代码.也就是说,您并不确切知道站点是如何构建的,以及您作为用户对它的请求是如何在服务器端处理的.但是,您可以调查客户端并尝试模拟它.我喜欢为此使用 Chrome 开发人员工具,但您可以使用其他工具,例如 FF firebug.

The thing is, unless you are a yahoo developer, you probably don't have access to the source code you're trying to scrape. That is to say, you don't know exactly how the site is built and how your requests to it as a user are being processed on the server-side. You can, however, investigate the client-side and try to emulate it. I like using Chrome Developer Tools for this, but you can use others such as FF firebug.

所以首先我们需要弄清楚发生了什么.所以它的工作方式是,你点击显示评论",它会加载前十个评论,然后你每次都需要继续点击接下来的十个评论.但是请注意,所有这些点击都不会将您带到不同的链接,而是生动地获取评论,这是一个非常简洁的用户界面,但对于我们的案例需要更多的工作.我可以立即说出两件事:

So first off we need to figure out what's going on. So the way it works, is you click on the 'show comments' it loads the first ten, then you need to keep clicking for the next ten comments each time. Notice, however, that all this clicking isn't taking you to a different link, but lively fetches the comments, which is a very neat UI but for our case requires a bit more work. I can tell two things right away:

  1. 他们使用 JavaScript 加载评论(因为我一直在同一页面上).
  2. 每次您点击时,它们都会通过 AJAX 调用动态加载它们(这意味着不是将评论与页面一起加载并仅显示给您,每次点击都会向数据库发出另一个请求).

现在让我们右键单击并检查该按钮上的元素.它实际上只是一个带有文本的简单跨度:

Now let's right-click and inspect element on that button. It's actually just a simple span with text:

<span>View Comments (2077)</span>

通过查看,我们仍然不知道它是如何生成的,或者在点击时它会做什么.美好的.现在,保持 devtools 窗口打开,让我们点击它.这打开了前十.但事实上,有人要求我们去取它们.chrome devtools 记录的请求.我们查看了 devtools 的网络选项卡,看到了很多令人困惑的数据.等等,这是一个有意义的:

By looking at that we still don't know how that's generated or what it does when clicked. Fine. Now, keeping the devtools window open, let's click on it. This opened up the first ten. But in fact, a request was being made for us to fetch them. A request that chrome devtools recorded. We look in the network tab of the devtools and see a lot of confusing data. Wait, here's one that makes sense:

http://news.yahoo.com/_xhr/contentcomments/get_comments/?content_id=42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc&_device=full&count=10&sortBy=highestRated&isNext=true&offset=20&pageNumber=2&_media.modules.content_comments.switches._enable_view_others=1&_media.modules.content_comments.switches._enable_mutecommenter=1&enable_collapsed_comment=1

看到了吗?_xhr 然后是 get_comments.这很有意义.在浏览器中转到该链接给了我一个 JSON 对象(看起来像一个 python 字典),其中包含该请求获取的所有十个评论.现在这就是您需要效仿的请求,因为它可以提供您想要的东西.首先让我们将其转换为人类可以阅读的一些正常请求:

See? _xhr and then get_comments. That makes a lot of sense. Going to that link in the browser gave me a JSON object (looks like a python dictionary) containing all the ten comments which that request fetched. Now that's the request you need to emulate, because that's the one that gives you what you want. First let's translate this to some normal reqest that a human can read:

go to this url: http://news.yahoo.com/_xhr/contentcomments/get_comments/
include these parameters: {'_device': 'full',
          '_media.modules.content_comments.switches._enable_mutecommenter': '1',
          '_media.modules.content_comments.switches._enable_view_others': '1',
          'content_id': '42f7f6e0-7bae-33d3-aa1d-3dfc7fb5cdfc',
          'count': '10',
          'enable_collapsed_comment': '1',
          'isNext': 'true',
          'offset': '20',
          'pageNumber': '2',
          'sortBy': 'highestRated'}

现在这只是一个反复试验的问题.但是,这里有几点需要注意:

Now it's just a matter of trial-and-error. However, a few things to note here:

  1. 显然,计数决定了您获得的评论数量.我尝试将其更改为 100 以查看会发生什么并收到了错误的请求.很好地告诉我为什么 - 偏移量应该是总行数的倍数".所以现在我们了解如何使用偏移量

  1. Obviously the count is what decides how many comments you're getting. I tried changing it to 100 to see what happens and got a bad request. And it was nice enough to tell me why - "Offset should be multiple of total rows". So now we understand how to use offset

content_id 可能是您正在阅读的文章的标识.这意味着您需要以某种方式从原始页面中获取它.试着挖掘一下,你会找到的.

The content_id is probably something that identifies the article you are reading. Meaning you need to fetch that from the original page somehow. Try digging around a little, you'll find it.

此外,您显然不想一次获取 10 条评论,因此找到一种以某种方式获取总评论数的方法可能是个好主意(或者找出页面如何获取它,或者只是从文章本身中获取它)

Also, you obviously don't want to fetch 10 comments at a time, so it's probably a good idea to find a way to fetch the number of total comments somehow (either find out how the page gets it, or just fetch it from within the article itself)

使用开发工具,您可以访问所有客户端脚本.因此,通过挖掘,您会发现指向/get_comments/的链接保存在名为 YUI 的 javascript 对象中.然后,您可以尝试了解它是如何发出请求的,并尝试模仿它(尽管您自己可能会弄清楚)

Using the devtools you have access to all client-side scripts. So by digging you can find that that link to /get_comments/ is kept within a javascript object named YUI. You can then try to understand how it is making the request, and try to emulate that (though you can probably figure it out yourself)

您可能需要克服一些安全措施.例如,您可能需要原始文章中的会话密钥才能访问评论.这用于防止直接访问网站的某些部分.我不会用细节来打扰你,因为在这种情况下它似乎不是问题,但你确实需要注意它,以防它出现.

You might need to overcome some security measures. For example, you might need a session-key from the original article before you can access the comments. This is used to prevent direct access to some parts of the sites. I won't trouble you with the details, because it doesn't seem like a problem in this case, but you do need to be aware of it in case it shows up.

最后,您必须解析 JSON 对象(python 具有出色的内置工具),然后解析您获得的 html 注释(您可能需要查看 BeautifulSoup).

Finally, you'll have to parse the JSON object (python has excellent built-in tools for that) and then parse the html comments you are getting (for which you might want to check out BeautifulSoup).

如您所见,这需要一些工作,但尽管我已经写了这么多,但这也不是一项极其复杂的任务.

As you can see, this will require some work, but despite all I've written, it's not an extremely complicated task either.

所以不要惊慌.

这只是挖掘和挖掘直到找到金子的问题(而且,拥有一些基本的 WEB 知识也无妨).然后,如果您遇到障碍并且真的无法再进一步,请回到 SO,再次询问.有人会帮助你.

It's just a matter of digging and digging until you find gold (also, having some basic WEB knowledge doesn't hurt). Then, if you face a roadblock and really can't go any further, come back here to SO, and ask again. Someone will help you.

祝你好运!

这篇关于使用 Scrapy (Python) 抓取网络数据(在线新闻评论)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆