使用scrapy烦恼与JavaScript的方法__doPostBack [英] Troubles using scrapy with javascript __doPostBack method
问题描述
尝试自动抓取从公共搜索的搜索结果,但遇到了一些麻烦。该URL的格式为
Trying to automatically grab the search results from a public search, but running into some trouble. The URL is of the form
http://www.website.com/search.aspx?keyword=#&&page=1&sort=Sorting
当我点击通过网页,访问这个网页后,稍微转变为
As I click through the pages, after visiting this page, it changes slightly to
http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page=2
问题存在,如果我再尝试就可以直接访问的第二个链接不首先参观的第一个环节,我重定向到的第一个链接。我在这个当前的尝试定义在scrapy start_urls一个长长的清单。
Problem being, if I then try to directly visit the second link without first visiting the first link, I am redirected to the first link. My current attempt at this is defining a long list of start_urls in scrapy.
class websiteSpider(BaseSpider):
name = "website"
allowed_domains = ["website.com"]
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
start_urls = [(baseUrl+str(i)) for i in range(1,1000)]
目前此code只是结束了一遍又一遍参观的第一页。我觉得这可能是简单,但我不太知道怎么去解决这个问题。
Currently this code simply ends up visiting the first page over and over again. I feel like this is probably straightforward, but I don't quite know how to get around this.
更新:
取得了一些进展调查这一点,并发现该网站通过使用__doPostBack(ARG1,ARG2)发送POST请求previous页面更新每一页。我现在的问题是,究竟是如何模仿我用这个scrapy POST请求。我知道如何使一个POST请求,但究竟是如何通过它,我想要的参数。
UPDATE: Made some progress investigating this and found that the site updates each page by sending a POST request to the previous page using __doPostBack(arg1, arg2). My question now is how exactly do I mimic this POST request using scrapy. I know how to make a POST request, but not exactly how to pass it the arguments I want.
第二次更新:
我已经作出了很大的进步!我想......我通过实例和文档看上去并最终拼凑这个版本我认为应该做的伎俩:
SECOND UPDATE: I've been making a lot of progress! I think... I looked through examples and documentation and eventually slapped together this version of what I think should do the trick:
def start_requests(self):
baseUrl = "http://www.website.com/search.aspx?keyword=#&&sort=Sorting&page="
target = 'ctl00$empcnt$ucResults$pagination'
requests = []
for i in range(1, 5):
url = baseUrl + str(i)
argument = str(i+1)
data = {'__EVENTTARGET': target, '__EVENTARGUMENT': argument}
currentPage = FormRequest(url, data)
requests.append(currentPage)
return requests
我们的想法是,这种对待POST请求一样的格式,并更新相应。然而,当我真正尝试运行此我得到以下回溯(S)(浓缩为简单起见):
The idea is that this treats the POST request just like a form and updates accordingly. However, when I actually try to run this I get the following traceback(s) (Condensed for brevity):
2013-03-22 04:03:03-0400 [guru] ERROR: Unhandled error on engine.crawl()
dfd.addCallbacks(request.callback or spider.parse, request.errback)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 280, in addCallbacks
assert callable(callback)
exceptions.AssertionError:
2013-03-22 04:03:03-0400 [-] ERROR: Unhandled error in Deferred:
2013-03-22 04:03:03-0400 [-] Unhandled Error
Traceback (most recent call last):
Failure: scrapy.exceptions.IgnoreRequest: Skipped (request already seen)
要更直指这是什么职位已变成改变的问题。
Changing question to be more directed at what this post has turned into.
思考?
P.S。当第二个错误发生scrapy无法cleany关机,我要送一个SIGINT两次得到的东西实际上包。
P.S. When the second errors happen scrapy is unable to cleany shutdown and I have to send a SIGINT twice to get things to actually wrap up.
推荐答案
FormRequest
没有在构造函数中的 FORMDATA <位置参数/ code>:
FormRequest
doesn't have a positional argument in the constructor for formdata
:
class FormRequest(Request):
def __init__(self, *args, **kwargs):
formdata = kwargs.pop('formdata', None)
所以你确实有说 FORMDATA =
:
requests.append(FormRequest(url, formdata=data))
这篇关于使用scrapy烦恼与JavaScript的方法__doPostBack的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!