解析Ajax响应检索在Scrapy最终URL内容? [英] Parsing ajax responses to retrieve final url content in Scrapy?

查看:234
本文介绍了解析Ajax响应检索在Scrapy最终URL内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下问题:

我的刮板开始于基地的网址。此页面包含创建通过AJAX调用另一个下拉列表中选择下拉列表中,而这个级联2-3次,直到它拥有所有需要去最终页面,在这里我想刮的实际内容的信息。

My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is.

而不是点击的事情(和不必使用硒或类似​​),我用的是网页暴露JSON API来模仿这种行为,所以不是点击下拉菜单我只需发送一个请求,并读取包含的信息用于数组JSON响应生成下一个下拉列表中的内容,而做到这一点,直到我有一个项目的最终URL。此URL带我去,我要真正分析最终的项目页面。

Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that contain the array of information used to generate the next dropdown's contents, and do this until I have the final URL for one item. This URL takes me to the final item page that I want to actually parse.

我很困惑如何使用Scrapy获得最终的网址为下拉框每个组合。我写了使用的urllib所用一吨的循环,只是通过链接的每个组合重复的履带,而是Scrapy似乎有点不同。我离开的urllib和LXML因为Scrapy似乎是一个更易于维护的解决方案,这是容易与Django的项目整合。

I am confused about how to use Scrapy to get the "final" url for every combination of dropdown boxes. I wrote a crawler using urllib that used a ton of loops to just iterate through every combination of url, but Scrapy seems to be a bit different. I moved away from urllib and lxml because Scrapy seemed like a more maintainable solution, which is easier to integrate with Django projects.

从本质上讲,我试图迫使Scrapy把我一路上生成因为我读的JSON响应的内容有一定的路径,只有真正解析最后一页链中获得真正的内容。它需要做的每一个可能的页面,我很想并行化这样的事情是有效率的(和使用Tor,但这些都是后来的问题)。

Essentially, I am trying to force Scrapy to take a certain path that I generate along the way as I read the contents of the json responses, and only really parse the last page in the chain to get real content. It needs to do this for every possible page, and I would love to parallelize it so things are efficient (and use Tor, but these are later issues).

我希望我已经很好的解释这一点,让我知道,如果你有任何问题。非常感谢你的帮助!

I hope I have explained this well, let me know if you have any questions. Thank you so much for your help!

编辑:添加了一个例子

[base url]/?location=120&section=240

收益:

<departments>
<department id="62" abrev="SIG" name="name 1"/>
<department id="63" abrev="ENH" name="name 2"/>
<department id="64" abrev="GGTH" name="name 3"/>
...[more]
</departments>

然后我抓住了部门ID,将其添加到URL像这样:

Then I grab the department id, add it to the url like so:

[base url]/?location=120&section=240&department_id=62

收益:

<courses>
<course id="1" name="name 1"/>
<course id="2" name="name 2"/>
</courses>

此继续进行,直到我最终上市的实际链接。

This continues until I end up with the actual link to the listing.

这基本上是什么,这看起来像在页面上(虽然在我的情况下,有一个最后的提交按钮,发送我,我想分析的实际上市的形式): http://roshanbh.com.np/dropdown/

This is essentially what this looks like on the page (though in my case, there is a final "submit" button on the form that sends me to the actual listing that I want to parse): http://roshanbh.com.np/dropdown/

所以,我需要刮的下拉列表中的每个组合,使我得到的所有可能的列表页面的一些方法。走在AJAX XML响应生成最终上市的URL的中间步骤是搞乱了我。

So, I need some way of scraping every combination of the dropdowns so that I get all the possible listing pages. The intermediate step of walking the ajax xml responses to generate final listing URLs is messing me up.

推荐答案

您可以使用链开始为主体回调函数回调函数,说你实现扩展BaseSpider蜘蛛,这样写你解析功能:

You can use a chain of callback functions starting for the main callback function, say you're implementing a spider extending BaseSpider, write your parse function like this:

...

def parse(self, response):
  #other code
  yield Request (url=self.baseurl, callback=self.first_dropdown)

def first_dropdown (self, response):
  ids=self.parse_first_response()   #Code for parsing the first dropdown content
  for (i in ids):
    req_url=response.url+"/?location="+i
    yield Request (url=req_url, callback=self.second_dropdown)

def second_dropdown (self, response):
  ids=self.parse_second_response()   #Code for parsing the second dropdown contents
  url=self.base_url
  for (i in ids):
    req_url=response.url+"&section="+i
    yield Request (url=req_url, callback=self.third_dropdown)

...

最后一个回调函数将需要提取数据的code。

the last callback function will have the code needed to extract your data.

要小心,你问尝试输入的所有可能的组合,这可能导致你的大量请求的速度非常快。

Be careful, you're asking to try all possible combinations of input and this can lead you to an high number of requests very fast.

这篇关于解析Ajax响应检索在Scrapy最终URL内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆