将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表 [英] Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

查看:43
本文介绍了将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Python、Scrapy 和 Selenium 非常陌生.因此,您可以提供的任何帮助将不胜感激.

I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated.

我希望能够将我从 Selenium 获得的 HTML 作为页面源并将其处理为 Scrapy Response 对象.主要原因是能够将 Selenium Webdriver 页面源中的 URL 添加到 Scrapy 将解析的 URL 列表中.

I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse.

再次感谢您的帮助.

作为第二个问题,有人知道如何查看 Scrapy 发现和抓取的 URL 列表中的 URL 列表吗?

As a quick second question, does anyone know how to view the list of URLs that are in or were in the list of URLs Scrapy found and scraped?

谢谢!

*******编辑*******这是我正在尝试做的一个例子.我无法弄清楚第 5 部分.

*******EDIT******* Here is an example of what I am trying to do. I can't figure out part 5.

class AB_Spider(CrawlSpider):
    name = "ab_spider"
    allowed_domains = ["abcdef.com"]
    #start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
    #, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
    start_urls = ["https://www.abcdef.com/page/12345"]

    def parse_abcs(self, response):
        sel = Selector(response)
        URL = response.url

        #part 1: check if a certain element is on the webpage
        last_chk = sel.xpath('//ul/li[@last_page="true"]')
        a_len = len(last_chk)

        #Part 2: if not, then get page via selenium webdriver
        if a_len == 0:
            #OPEN WEBDRIVER AND GET PAGE
            driver = webdriver.Firefox()
            driver.get(response.url)    

        #Part 3: run script to ineract with page until certain element appears
        while a_len == 0:
            print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"

            #scroll down one time
            driver.execute_script("window.scrollTo(0, 1000000000);")

            #get page source and check if last page is there
            selen_html = driver.page_source
            hxs = Selector(text=selen_html)
            last_chk = hxs.xpath('//ul/li[@last_page="true"]')

            a_len = len(last_chk)

        driver.close()

        #Part 4: extract the URLs in the selenium webdriver URL 
        all_URLS = hxs.xpath('a/@href').extract()

        #Part 5: all_URLS add to the Scrapy URLs to be scraped

推荐答案

Just yield 从方法请求实例并提供回调:

Just yield Request instances from the method and provide a callback:

class AB_Spider(CrawlSpider):
    ...

    def parse_abcs(self, response):
        ...

        all_URLS = hxs.xpath('a/@href').extract()

        for url in all_URLS:
            yield Request(url, callback=self.parse_page)

    def parse_page(self, response):
        # Do the parsing here

这篇关于将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆