可以scrapy用于刮去从正在使用AJAX网站动态内容? [英] Can scrapy be used to scrape dynamic content from websites that are using AJAX?

查看:161
本文介绍了可以scrapy用于刮去从正在使用AJAX网站动态内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近一直在学习Python和我倾我的手到构建Web刮刀。这没有什么花哨可言;它的唯一目的是让数据从一个投注网站,并已将此数据放入Excel中。

I have recently been learning Python and am dipping my hand into building a web-scraper. It's nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel.

大多数的问题都是可以解决的,我身边有一个很好的小混乱。不过我打一个巨大的障碍了一个问题。如果网站加载的马的表格,并列出当前投注的价格这些信息是不是在任何源文件。线索是,这个数据是实时有时候,有明显被一些远程服务器上更新的数字。在我的电脑上的HTML只是有,他们的服务器通过一切,我需要有趣的数据推了一个洞。

Most of the issues are solvable and I'm having a good little mess around. However I'm hitting a massive hurdle over one issue. If a site loads a table of horses and lists current betting prices this information is not in any source file. The clue is that this data is live sometimes, with the numbers being updated obviously from some remote server. The HTML on my PC simply has a hole where their servers are pushing through all the interesting data that I need.

现在我的经验与动态网页含量低,所以这个东西是什么我无法左右让我的头。

Now my experience with dynamic web content is low, so this thing is something I'm having trouble getting my head around.

我认为Java或JavaScript是一个关键,这会弹出频繁。

I think Java or Javascript is a key, this pops up often.

刮刀是一个简单的赔率比较引擎。有些网站有API的,但我需要这对那些不这样做。我使用的scrapy库与Python 2.7

The scraper is simply a odds comparison engine. Some sites have APIs but I need this for those that don't. I'm using the scrapy library with Python 2.7

我很抱歉,如果这个问题太开放的。总之,我的问题是:如何才能scrapy用来刮去这一动态数据,这样我可以用它?所以,我可以凑这个赔率在实时数据?

I do apologize if this question is too open-ended. In short, my question is: how can scrapy be used to scrape this dynamic data so that I can use it? So that I can scrape this betting odds data in real-time?

干杯的人:)

推荐答案

基于WebKit的浏览器(如谷歌Chrome或Safari浏览器),内置了开发工具。在Chrome浏览器,你可以打开它菜单 - >工具 - >开发工具。该网​​络选项卡,可以看到几乎所有的请求和响应的所有信息:

Webkit based browsers (like Google Chrome or Safari) has built-in developer tools. In Chrome you can open it Menu->Tools->Developer Tools. The Network tab allows you to see all information about every request and response:

在图片的底部,你可以看到,我已经过滤的要求降低到 XHR - 这些由JavaScript code提出的要求

In the bottom of the picture you can see that I've filtered request down to XHR - these are requests made by javascript code.

提示:日志是每次加载一个页面,在图片的底部时间清零,黑点按钮,将preserve日志

分析请求和响应可以模拟从您的网络爬虫这些请求并提取有价值的数据之后。在许多情况下,它会更容易获得比解析HTML数据,因为该数据不包含presentation逻辑和格式化通过JavaScript的code进行访问。

After analyzing requests and responses you can simulate these requests from your web-crawler and extract valuable data. In many cases it will be easier to get your data than parsing HTML, because that data does not contain presentation logic and is formatted to be accessed by javascript code.

火狐也有类似的扩展,它被称为萤火虫。有人会说,萤火虫更是如虎添翼,但我喜欢的WebKit的简单。

Firefox has similar extension, it is called firebug. Some will argue that firebug is even more powerful but I like the simplicity of webkit.

这篇关于可以scrapy用于刮去从正在使用AJAX网站动态内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆