使用AJAX的抓取网站 [英] Scraping site that uses AJAX

查看:56
本文介绍了使用AJAX的抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里阅读了一些相关的帖子,但找不到答案.

I've read some relevant posts here but couldn't figure an answer.

我正在尝试检索包含评论的网页.访问网站时,最初只有10条评论,并且用户每次滚动时都应按显示更多"来获得10条评论(这还会在网站的地址末尾添加#add10 )直到评论列表的末尾.实际上,用户可以通过在网站地址的末尾添加#add1000 (其中1000个是其他评论数)来获得完整的评论列表.问题是,我只能在蜘蛛中使用 site_url#add1000 来获得前10条评论,就像 site_url 一样,因此这种方法行不通.

I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.

我也找不到一种方法来发出适当的请求,以模仿站点中的原始请求.原始AJAX网址的格式为" domain/ajaxlst?par1 = x& par2 = y ",我尝试了所有这些方法:

I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:

Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all) 
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
        headers={all_headers}, cookies={all_cookies})

但是每次遇到404错误时,都会出现.谁能解释我在做什么错?

But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?

推荐答案

您需要的是无头浏览器,因为请求模块无法很好地处理AJAX.

What you need is a headless browser for this since request module can not handle AJAX well.

这样的无头浏览器之一是.

One of such headless browser is selenium.

即)

driver.find_element_by_id("show more").click() # This is just an example case

这篇关于使用AJAX的抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆