使用AJAX的抓取网站 [英] Scraping site that uses AJAX
问题描述
我在这里阅读了一些相关的帖子,但找不到答案.
I've read some relevant posts here but couldn't figure an answer.
我正在尝试检索包含评论的网页.访问网站时,最初只有10条评论,并且用户每次滚动时都应按显示更多"来获得10条评论(这还会在网站的地址末尾添加#add10 )直到评论列表的末尾.实际上,用户可以通过在网站地址的末尾添加#add1000 (其中1000个是其他评论数)来获得完整的评论列表.问题是,我只能在蜘蛛中使用 site_url#add1000 来获得前10条评论,就像 site_url 一样,因此这种方法行不通.
I'm trying to crawl a web page with reviews. When site is visited there are only 10 reviews at first and a user should press "Show more" to get 10 more reviews (that also adds #add10 to the end of site's address) every time when he scrolls down to the end of reviews list. Actually, a user can get full review list by adding #add1000 (where 1000 is a number of additional reviews) to the end of the site's address. The problem is that I get only first 10 reviews using site_url#add1000 in my spider just like with site_url so this approach doesn't work.
我也找不到一种方法来发出适当的请求,以模仿站点中的原始请求.原始AJAX网址的格式为" domain/ajaxlst?par1 = x& par2 = y ",我尝试了所有这些方法:
I also can't find a way to make an appropriate Request imitating the origin one from the site. Origin AJAX url is of the form 'domain/ajaxlst?par1=x&par2=y' and I tried all of this:
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all)
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers})
Request(url='domain/ajaxlst?par1=x&par2=y', callback=self.parse_all,
headers={all_headers}, cookies={all_cookies})
但是每次遇到404错误时,都会出现.谁能解释我在做什么错?
But every time I'm getting a 404 Error. Can anyone explain what I'm doing wrong?
推荐答案
您需要的是无头浏览器,因为请求模块无法很好地处理AJAX.
What you need is a headless browser for this since request module can not handle AJAX well.
这样的无头浏览器之一是硒.
One of such headless browser is selenium.
即)
driver.find_element_by_id("show more").click() # This is just an example case
这篇关于使用AJAX的抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!