Scrapy爬行不能在ASPX网站上运行 [英] Scrapy crawling not working on ASPX website

查看:640
本文介绍了Scrapy爬行不能在ASPX网站上运行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在抓住马德里大会的网站,用aspx构建,我不知道如何模拟链接上的点击,我需要从中获取相应的政治家。我试过这个:

I'm scraping the Madrid Assembly's website, built in aspx, and I have no idea how to simulate clicks on the links where I need to get the corresponding politicians from. I tried this:

import scrapy

class AsambleaMadrid(scrapy.Spider):

name        =   "Asamblea_Madrid"
start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

def parse(self, response):

    for id in response.css('div#moduloBusqueda div.sangria div.sangria ul li a::attr(id)'):
        target                  =   id.extract()
        url                     =   "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"

        formdata=   {'__EVENTTARGET': target,
                     '__VIEWSTATE': '/wEPDwUBMA9kFgJmD2QWAgIBD2QWBAIBD2QWAgIGD2QWAmYPZBYCAgMPZBYCAgMPFgIeE1ByZXZpb3VzQ29udHJvbE1vZGULKYgBTWljcm9zb2Z0LlNoYXJlUG9pbnQuV2ViQ29udHJvbHMuU1BDb250cm9sTW9kZSwgTWljcm9zb2Z0LlNoYXJlUG9pbnQsIFZlcnNpb249MTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49NzFlOWJjZTExMWU5NDI5YwFkAgMPZBYMAgMPZBYGBSZnXzM2ZWEwMzEwXzg5M2RfNGExOV85ZWQxXzg4YTEzM2QwNjQyMw9kFgJmD2QWAgIBDxYCHgtfIUl0ZW1Db3VudAIEFghmD2QWAgIBDw8WBB4PQ29tbWFuZEFyZ3VtZW50BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkHgRUZXh0BTRHcnVwbyBQYXJsYW1lbnRhcmlvIFBvcHVsYXIgZGUgbGEgQXNhbWJsZWEgZGUgTWFkcmlkZGQCAQ9kFgICAQ8PFgQfAgUeR3J1cG8gUGFybGFtZW50YXJpbyBTb2NpYWxpc3RhHwMFHkdydXBvIFBhcmxhbWVudGFyaW8gU29jaWFsaXN0YWRkAgIPZBYCAgEPDxYEHwIFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkHwMFL0dydXBvIFBhcmxhbWVudGFyaW8gUG9kZW1vcyBDb211bmlkYWQgZGUgTWFkcmlkZGQCAw9kFgICAQ8PFgQfAgUhR3J1cG8gUGFybGFtZW50YXJpbyBkZSBDaXVkYWRhbm9zHwMFIUdydXBvIFBhcmxhbWVudGFyaW8gZGUgQ2l1ZGFkYW5vc2RkBSZnX2MxNTFkMGIxXzY2YWZfNDhjY185MWM3X2JlOGUxMTZkN2Q1Mg9kFgRmDxYCHgdWaXNpYmxlaGQCAQ8WAh8EaGQFJmdfZTBmYWViMTVfOGI3Nl80MjgyX2ExYjFfNTI3ZDIwNjk1ODY2D2QWBGYPFgIfBGhkAgEPFgIfBGhkAhEPZBYCAgEPZBYEZg9kFgICAQ8WAh8EaBYCZg9kFgQCAg9kFgQCAQ8WAh8EaGQCAw8WCB4TQ2xpZW50T25DbGlja1NjcmlwdAW7AWphdmFTY3JpcHQ6Q29yZUludm9rZSgnVGFrZU9mZmxpbmVUb0NsaWVudFJlYWwnLDEsIDEsICdodHRwOlx1MDAyZlx1MDAyZnd3dy5hc2FtYmxlYW1hZHJpZC5lc1x1MDAyZkVTXHUwMDJmUXVlRXNMYUFzYW1ibGVhXHUwMDJmQ29tcG9zaWNpb25kZWxhQXNhbWJsZWFcdTAwMmZMb3NEaXB1dGFkb3MnLCAtMSwgLTEsICcnLCAnJykeGENsaWVudE9uQ2xpY2tOYXZpZ2F0ZVVybGQeKENsaWVudE9uQ2xpY2tTY3JpcHRDb250YWluaW5nUHJlZml4ZWRVcmxkHgxIaWRkZW5TY3JpcHQFIVRha2VPZmZsaW5lRGlzYWJsZWQoMSwgMSwgLTEsIC0xKWQCAw8PFgoeCUFjY2Vzc0tleQUBLx4PQXJyb3dJbWFnZVdpZHRoAgUeEEFycm93SW1hZ2VIZWlnaHQCAx4RQXJyb3dJbWFnZU9mZnNldFhmHhFBcnJvd0ltYWdlT2Zmc2V0WQLrA2RkAgEPZBYCAgUPZBYCAgEPEBYCHwRoZBQrAQBkAhcPZBYIZg8PFgQfAwUPRW5nbGlzaCBWZXJzaW9uHgtOYXZpZ2F0ZVVybAVfL0VOL1F1ZUVzTGFBc2FtYmxlYS9Db21wb3NpY2lvbmRlbGFBc2FtYmxlYS9Mb3NEaXB1dGFkb3MvUGFnZXMvUmVsYWNpb25BbGZhYmV0aWNhRGlwdXRhZG9zLmFzcHhkZAICDw8WBB8DBQZQcmVuc2EfDgUyL0VTL0JpZW52ZW5pZGFQcmVuc2EvUGFnaW5hcy9CaWVudmVuaWRhUHJlbnNhLmFzcHhkZAIEDw8WBB8DBRpJZGVudGlmaWNhY2nDs24gZGUgVXN1YXJpbx8OBTQvRVMvQXJlYVVzdWFyaW9zL1BhZ2luYXMvSWRlbnRpZmljYWNpb25Vc3Vhcmlvcy5hc3B4ZGQCBg8PFgQfAwUGQ29ycmVvHw4FKGh0dHA6Ly9vdXRsb29rLmNvbS9vd2EvYXNhbWJsZWFtYWRyaWQuZXNkZAIlD2QWAgIDD2QWAgIBDxYCHwALKwQBZAI1D2QWAgIHD2QWAgIBDw8WAh8EaGQWAgIDD2QWAmYPZBYCAgMPZBYCAgUPDxYEHgZIZWlnaHQbAAAAAAAAeUABAAAAHgRfIVNCAoABZBYCAgEPPCsACQEADxYEHg1QYXRoU2VwYXJhdG9yBAgeDU5ldmVyRXhwYW5kZWRnZGQCSQ9kFgICAg9kFgICAQ9kFgICAw8WAh8ACysEAWQYAgVBY3RsMDAkUGxhY2VIb2xkZXJMZWZ0TmF2QmFyJFVJVmVyc2lvbmVkQ29udGVudDMkVjRRdWlja0xhdW5jaE1lbnUPD2QFKUNvbXBvc2ljacOzbiBkZSBsYSBBc2FtYmxlYVxMb3MgRGlwdXRhZG9zZAVHY3RsMDAkUGxhY2VIb2xkZXJUb3BOYXZCYXIkUGxhY2VIb2xkZXJIb3Jpem9udGFsTmF2JFRvcE5hdmlnYXRpb25NZW51VjQPD2QFGkluaWNpb1xRdcOpIGVzIGxhIEFzYW1ibGVhZJ',
                     '__EVENTVALIDATION': '/wEWCALIhqvYAwKh2YVvAuDF1KUDAqCK1bUOAqCKybkPAqCKnbQCAqCKsZEJAvejv84Dtkx5dCFr3QGqQD2wsFQh8nP3iq8',
                     '__VIEWSTATEGENERATOR': 'BAB98CB3',
                     '__REQUESTDIGEST': '0x476239970DCFDABDBBDF638A1F9B026BD43022A10D1D757B05F1071FF3104459B4666F96A47B4845D625BCB2BE0D88C6E150945E8F5D82C189B56A0DA4BC859D'}

        yield scrapy.FormRequest(url=url, formdata= formdata, callback=self.takeEachParty)


def takeEachParty(self, response):

     print response.css('ul.listadoVert02 ul li::text').extract()

进入网站的源代码,我可以看到链接的样子,以及它们如何发送JavaScript查询。这是我需要访问的链接之一:

Going into the source code of the website, I can see how links look like, and how they send the JavaScript query. This is one of the links I need to access:

<a id="ctl00_m_g_36ea0310_893d_4a19_9ed1_88a133d06423_ctl00_Repeater1_ctl00_lnk_Grupo" href="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl00$lnk_Grupo&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, true))">Grupo Parlamentario Popular de la Asamblea de Madrid</a>

我一直在读这么多篇文章,但问题可能是我的无知。

I have been reading so many articles about, but probably the problem is my ignorance in respect.

提前致谢。

已编辑:

解决方案:我终于做到了!将Padraic Cunningham的帮助代码翻译成Scrapy方式。正如我为Scrapy指定的问题,我想发布结果,以防有人遇到与我相同的问题。

SOLUTION: I finally did it! Translating the very helpul code from Padraic Cunningham into Scrapy way. As I specified the issue for Scrapy, I want to post the result just in case someone has the same problem as I had.

所以这就是:

import scrapy
import js2xml

class AsambleaMadrid(scrapy.Spider):

     name        =   "AsambleaMadrid"
     start_urls  =   ['http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx']

    def parse(self, response):

         source  =   response
         hrefs   =   response.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href").extract()
         form_data = self.validate(source)
         for ref in hrefs:
             # js2xml allows us to parse the JS function and params, and so to grab the __EVENTTARGET
             js_xml            = js2xml.parse(ref)
             _id               = js_xml.xpath(
                            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[0]
             form_data["__EVENTTARGET"] = _id.text

             url_diputado    =   'http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx'
             # The proper way to send a POST in scrapy is by using the FormRequest
             yield scrapy.FormRequest(url=url_diputado, formdata=form_data, callback=self.extract_parties, method='POST')

     def validate(self, source):
         # these fields are the minimum required as cannot be hardcoded
         data = {"__VIEWSTATEGENERATOR": source.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0].extract(),
             "__EVENTVALIDATION": source.xpath("//*[@id='__EVENTVALIDATION']/@value")[0].extract(),
             "__VIEWSTATE": source.xpath("//*[@id='__VIEWSTATE']/@value")[0].extract(),
             " __REQUESTDIGEST": source.xpath("//*[@id='__REQUESTDIGEST']/@value")[0].extract()}
         return data

     def extract_parties(self, response):
         source      =   response
         name        =   source.xpath("//ul[@class='listadoVert02']/ul/li/a/text()").extract()
         print name

我希望很清楚。再次感谢大家!

I hope is clear. Thanks everybody, again!

推荐答案

如果您查看以chrome或firebug发布到表单的数据,您会看到有很多在post请求中传递的字段,有一些是必不可少的,必须从原始页面解析,解析来自 div.sangria ul li 标签的ID不是发布的实际数据略有不同,发布的内容是Javascript函数, WebForm_DoPostBackWithOptions ,它位于 href 而不是 id 属性:

If you look at the data posted to the form in chrome or firebug you can see there are many fields passed in the post request, there are a few that are essential and must be parsed from the original page, parsing the ids from the div.sangria ul li a tags is not sufficient as the actual data posted is slightly different, what is posted is in the Javascript function, WebForm_DoPostBackWithOptions which is in the href not the id attribute:

href='javascript:WebForm_DoPostBackWithOptions(new 
 WebForm_PostBackOptions("ctl00$m$g_36ea0310_893d_4a19_9ed1_88a133d06423$ctl00$Repeater1$ctl03$lnk_Grupo", "", true, "", "", false, true))'>

有时所有下划线都被美元符号替换,因此很容易做 str。替换以使它们按正确的顺序排列但在这种情况下不是真的,我们可以使用正则表达式进行解析,但我喜欢 js2xml lib,它可以将javascript函数及其args解析为xml树。

Sometimes all the underscores are replaced with dollar signs so it is easy to do a str.replace to get them in the correct order but not really in this case, we could use a regex to parse but I like the js2xml lib which can parse a javascript function and its args into an xml tree.

以下使用请求的代码向您展示如何获取来自初始请求的数据并转到您想要的所有页面:

The following code using requests shows you how can get the data from the initial request and get to all the pages you want:

import requests
from  lxml import html
import js2xml

post = "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx"


def validate(xml):
    # these fields are the minimum required as cannot be hardcoded
    data = {"__VIEWSTATEGENERATOR": xml.xpath("//*[@id='__VIEWSTATEGENERATOR']/@value")[0],
            "__EVENTVALIDATION": xml.xpath("//*[@id='__EVENTVALIDATION']/@value")[0],
            "__VIEWSTATE": xml.xpath("//*[@id='__VIEWSTATE']/@value")[0],
            " __REQUESTDIGEST": xml.xpath("//*[@id='__REQUESTDIGEST']/@value")[0]}
    return data



with requests.Session() as s:
    # make initial requests to get the links/hrefs and the from fields
    r = s.get(
        "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
    xml = html.fromstring(r.content)
    hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
    form_data = validate(xml)
    for h in hrefs:
        js_xml = js2xml.parse(h)
        _id = js_xml.xpath(
            "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
            0]
        form_data["__EVENTTARGET"] = _id.text
        r = s.post(post, data=form_data)
        xml = html.fromstring(r.content)
        print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))

如果我们运行上面的代码,我们会看到所有teh锚标签的不同文本输出:

If we run the code above we see the different text output from all teh anchor tags:

In [2]: with requests.Session() as s:
   ...:         r = s.get(
   ...:             "http://www.asambleamadrid.es/ES/QueEsLaAsamblea/ComposiciondelaAsamblea/LosDiputados/Paginas/RelacionAlfabeticaDiputados.aspx")
   ...:         xml = html.fromstring(r.content)
   ...:         hrefs = xml.xpath("//*[@id='moduloBusqueda']//div[@class='sangria']/ul/li/a/@href")
   ...:         form_data = validate(xml)
   ...:         for h in hrefs:
   ...:                 js_xml = js2xml.parse(h)
   ...:                 _id = js_xml.xpath(
   ...:                     "//identifier[@name='WebForm_PostBackOptions']/following-sibling::arguments/string[starts-with(.,'ctl')]")[
   ...:                     0]
   ...:                 form_data["__EVENTTARGET"] = _id.text
   ...:                 r = s.post(post, data=form_data)
   ...:                 xml = html.fromstring(r.content)
   ...:                 print(xml.xpath("//ul[@class='listadoVert02']/ul/li/a/text()"))
   ...:         
[u'Abo\xedn Abo\xedn, Sonsoles Trinidad', u'Adrados Gautier, M\xaa Paloma', u'Aguado Del Olmo, M\xaa Josefa', u'\xc1lvarez Padilla, M\xaa Nadia', u'Arribas Del Barrio, Jos\xe9 M\xaa', u'Ballar\xedn Valc\xe1rcel, \xc1lvaro C\xe9sar', u'Berrio Fern\xe1ndez-Caballero, M\xaa In\xe9s', u'Berzal Andrade, Jos\xe9 Manuel', u'Cam\xedns Mart\xednez, Ana', u'Carballedo Berlanga, M\xaa Eugenia', 'Cifuentes Cuencas, Cristina', u'D\xedaz Ayuso, Isabel Natividad', u'Escudero D\xedaz-Tejeiro, Marta', u'Fermosel D\xedaz, Jes\xfas', u'Fern\xe1ndez-Quejo Del Pozo, Jos\xe9 Luis', u'Garc\xeda De Vinuesa Gardoqui, Ignacio', u'Garc\xeda Mart\xedn, Mar\xeda Bego\xf1a', u'Garrido Garc\xeda, \xc1ngel', u'G\xf3mez Ruiz, Jes\xfas', u'G\xf3mez-Angulo Rodr\xedguez, Juan Antonio', u'Gonz\xe1lez Gonz\xe1lez, Isabel Gema', u'Gonz\xe1lez Jim\xe9nez, Bartolom\xe9', u'Gonz\xe1lez Taboada, Jaime', u'Gonz\xe1lez-Mo\xf1ux V\xe1zquez, Elena', u'Gonzalo L\xf3pez, Rosal\xeda', 'Izquierdo Torres, Carlos', u'Li\xe9bana Montijano, Pilar', u'Mari\xf1o Ortega, Ana Isabel', u'Moraga Valiente, \xc1lvaro', u'Mu\xf1oz Abrines, Pedro', u'N\xfa\xf1ez Guijarro, Jos\xe9 Enrique', u'Olmo Fl\xf3rez, Luis Del', u'Ongil Cores, M\xaa Gador', 'Ortiz Espejo, Daniel', u'Ossorio Crespo, Enrique Mat\xedas', 'Peral Guerra, Luis', u'P\xe9rez Baos, Ana Isabel', u'P\xe9rez Garc\xeda, David', u'Pla\xf1iol De Lacalle, Regina M\xaa', u'Redondo Alcaide, M\xaa Isabel', u'Roll\xe1n Ojeda, Pedro', u'S\xe1nchez Fern\xe1ndez, Alejandro', 'Sanjuanbenito Bonal, Diego', u'Serrano Guio, Jos\xe9 Tom\xe1s', u'Serrano S\xe1nchez-Capuchino, Alfonso Carlos', 'Soler-Espiauba Gallo, Juan', 'Toledo Moreno, Lucila', 'Van-Halen Acedo, Juan']
[u'Andaluz Andaluz, M\xaa Isabel', u'Ardid Jim\xe9nez, M\xaa Isabel', u'Carazo G\xf3mez, M\xf3nica', u'Casares D\xedaz, M\xaa Luc\xeda Inmaculada', u'Cepeda Garc\xeda De Le\xf3n, Jos\xe9 Carmelo', 'Cruz Torrijos, Diego', u'Delgado G\xf3mez, Carla', u'Franco Pardo, Jos\xe9 Manuel', u'Freire Campo, Jos\xe9 Manuel', u'Gabilondo Pujol, \xc1ngel', 'Gallizo Llamas, Mercedes', u"Garc\xeda D'Atri, Ana", u'Garc\xeda-Rojo Garrido, Pedro Pablo', u'G\xf3mez Montoya, Rafael', u'G\xf3mez-Chamorro Torres, Jos\xe9 \xc1ngel', u'Gonz\xe1lez Gonz\xe1lez, M\xf3nica Silvana', u'Leal Fern\xe1ndez, M\xaa Isaura', u'Llop Cuenca, M\xaa Pilar', 'Lobato Gandarias, Juan', u'L\xf3pez Ruiz, M\xaa Carmen', u'Manguan Valderrama, Eva M\xaa', u'Maroto Illera, M\xaa Reyes', u'Mart\xednez Ten, Carmen', u'Mena Romero, M\xaa Carmen', u'Moreno Navarro, Juan Jos\xe9', u'Moya Nieto, Encarnaci\xf3n', 'Navarro Lanchas, Josefa', 'Nolla Estrada, Modesto', 'Pardo Ortiz, Josefa Dolores', u'Quintana Viar, Jos\xe9', u'Rico Garc\xeda-Hierro, Enrique', u'Rodr\xedguez Garc\xeda, Nicol\xe1s', u'S\xe1nchez Acera, Pilar', u'Sant\xedn Fern\xe1ndez, Pedro', 'Segovia Noriega, Juan', 'Vicente Viondi, Daniel', u'Vinagre Alc\xe1zar, Agust\xedn']
['Abasolo Pozas, Olga', 'Ardanuy Pizarro, Miguel', u'Beirak Ulanosky, Jazm\xedn', u'Camargo Fern\xe1ndez, Ra\xfal', 'Candela Pokorna, Marco', 'Delgado Orgaz, Emilio', u'D\xedaz Rom\xe1n, Laura', u'Espinar Merino, Ram\xf3n', u'Espinosa De La Llave, Mar\xeda', u'Fern\xe1ndez Rubi\xf1o, Eduardo', u'Garc\xeda G\xf3mez, M\xf3nica', 'Gimeno Reinoso, Beatriz', u'Guti\xe9rrez Benito, Eduardo', 'Huerta Bravo, Raquel', u'L\xf3pez Hern\xe1ndez, Isidro', u'L\xf3pez Rodrigo, Jos\xe9 Manuel', u'Mart\xednez Abarca, Hugo', u'Morano Gonz\xe1lez, Jacinto', u'Ongil L\xf3pez, Miguel', 'Padilla Estrada, Pablo', u'Ruiz-Huerta Garc\xeda De Viedma, Lorena', 'Salazar-Alonso Revuelta, Cecilia', u'San Jos\xe9 P\xe9rez, Carmen', u'S\xe1nchez P\xe9rez, Alejandro', u'Serra S\xe1nchez, Isabel', u'Serra S\xe1nchez, Clara', 'Sevillano De Las Heras, Elena']
[u'Aguado Crespo, Ignacio Jes\xfas', u'\xc1lvarez Cabo, Daniel', u'Gonz\xe1lez Pastor, Dolores', u'Iglesia Vicente, M\xaa Teresa De La', 'Lara Casanova, Francisco', u'Marb\xe1n De Frutos, Marta', u'Marcos Arias, Tom\xe1s', u'Meg\xedas Morales, Jes\xfas Ricardo', u'N\xfa\xf1ez S\xe1nchez, Roberto', 'Reyero Zubiri, Alberto', u'Rodr\xedguez Dur\xe1n, Ana', u'Rubio Ruiz, Juan Ram\xf3n', u'Ruiz Fern\xe1ndez, Esther', u'Sol\xeds P\xe9rez, Susana', 'Trinidad Martos, Juan', 'Veloso Lozano, Enrique', u'Zafra Hern\xe1ndez, C\xe9sar']

你可以为蜘蛛添加完全相同的逻辑,我只是用过的请求向您展示了一个有效的例子。您还应该知道并非每个asp.net站点的行为都相同,您可能需要重新验证每个帖子,如此相关回答

You can add the exact same logic to your spider, I just used requests to show you a working example. You should also be aware that not every asp.net site behaves the same, you may have to re-validate for every post as in this related answer.

这篇关于Scrapy爬行不能在ASPX网站上运行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆