运行使用带有硒的scrapy创建的解析器时遇到问题 [英] Trouble running a parser created using scrapy with selenium

查看:23
本文介绍了运行使用带有硒的scrapy创建的解析器时遇到问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经用 Python scrapy 和 selenium 编写了一个抓取工具来从网站上抓取一些titles.在我的刮板中定义的 css 选择器 是完美的.我希望我的刮板继续点击下一页并解析每个页面中嵌入的信息.它在第一页上做得很好,但是当涉及到硒部分的作用时,刮板会一遍又一遍地点击同一个链接.

I've written a scraper in Python scrapy in combination with selenium to scrape some titles from a website. The css selectors defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again.

由于这是我第一次使用 selenium 和scrapy,我不知道要继续成功.任何修复都将受到高度赞赏.

As this is my first time to work with selenium along with scrapy, I don't have any idea to move on successfully. Any fix will be highly appreciated.

如果我这样尝试,那么它运行顺利(选择器没有任何问题):

If I try like this then it works smoothly (there is nothing wrong with selectors):

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def parse(self,response):
        self.driver.get(response.url)

        while True:
            for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
                name = elem.find_element_by_css_selector("div[id^='arrowex']").text
                print(name)

            try:
                self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
                self.wait.until(EC.staleness_of(elem))
            except TimeoutException:break

但我的目的是让我的脚本以这种方式运行:

But my intention is to make my script run this way:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()  
        self.wait.until(EC.staleness_of(elem))


    def parse(self,response):
        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
            except TimeoutException:break

这些是登陆页面上可见的标题(让你知道我在追求什么):

These are the titles visible on that landing page (to let you know what I'm after):

INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST

我不愿意从那个站点获取数据,所以除了我上面尝试过的任何其他方法对我来说都没用.我的唯一目的是找到与我在第二种方法中尝试的方式相关的任何解决方案.

推荐答案

您的初始代码几乎是正确的,但其中缺少一个关键部分.您始终使用相同的响应对象.响应对象需要来自最新的页面源.

Your initial code was almost correct with one key piece missing from it. You were using the same response object always. The response object needs to be from the latest page source.

此外,您还在单击下一页时一次又一次地浏览链接,每次都将其重置为第 1 页.这就是您获得第 1 页和第 2 页(最大)的原因.您只需要在解析阶段获取一次 url,然后让下一页点击发生

Also you were browsing the link again and again in click next page which was resetting it to page 1 every time. That is why you get page 1 and 2 (max). You need to get the url only once in the parse stage and then let the next page click to happen

以下是运行良好的最终代码

Below is final code working fine

class IncomeTaxSpider(scrapy.Spider):
    name = "taxspider"

    start_urls = [
        'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
    ]

    def __init__(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 10)

    def click_nextpage(self,link):
        # self.driver.get(link)
        elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))

        #It keeeps clicking on the same link over and over again

        self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
        self.wait.until(EC.staleness_of(elem))


    def parse(self, response):
        self.driver.get(response.url)

        while True:
            for item in response.css("h1.faqsno-heading"):
                name = item.css("div[id^='arrowex']::text").extract_first()
                yield {"Name": name}

            try:
                self.click_nextpage(response.url) #initiate the method to do the clicking
                response = response.replace(body=self.driver.page_source)
            except TimeoutException:break

修改之后就完美了

这篇关于运行使用带有硒的scrapy创建的解析器时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆