将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表 [英] Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

查看：43 发布时间：2021/7/16 22:17:40 python selenium web-scraping scrapy scrapy-spider

本文介绍了将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我对 Python、Scrapy 和 Selenium 非常陌生.因此，您可以提供的任何帮助将不胜感激.

I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated.

我希望能够将我从 Selenium 获得的 HTML 作为页面源并将其处理为 Scrapy Response 对象.主要原因是能够将 Selenium Webdriver 页面源中的 URL 添加到 Scrapy 将解析的 URL 列表中.

I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse.

再次感谢您的帮助.

作为第二个问题，有人知道如何查看 Scrapy 发现和抓取的 URL 列表中的 URL 列表吗?

As a quick second question, does anyone know how to view the list of URLs that are in or were in the list of URLs Scrapy found and scraped?

谢谢！

*******编辑*******这是我正在尝试做的一个例子.我无法弄清楚第 5 部分.

*******EDIT******* Here is an example of what I am trying to do. I can't figure out part 5.

class AB_Spider(CrawlSpider):
    name = "ab_spider"
    allowed_domains = ["abcdef.com"]
    #start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
    #, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
    start_urls = ["https://www.abcdef.com/page/12345"]

    def parse_abcs(self, response):
        sel = Selector(response)
        URL = response.url

        #part 1: check if a certain element is on the webpage
        last_chk = sel.xpath('//ul/li[@last_page="true"]')
        a_len = len(last_chk)

        #Part 2: if not, then get page via selenium webdriver
        if a_len == 0:
            #OPEN WEBDRIVER AND GET PAGE
            driver = webdriver.Firefox()
            driver.get(response.url)    

        #Part 3: run script to ineract with page until certain element appears
        while a_len == 0:
            print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"

            #scroll down one time
            driver.execute_script("window.scrollTo(0, 1000000000);")

            #get page source and check if last page is there
            selen_html = driver.page_source
            hxs = Selector(text=selen_html)
            last_chk = hxs.xpath('//ul/li[@last_page="true"]')

            a_len = len(last_chk)

        driver.close()

        #Part 4: extract the URLs in the selenium webdriver URL 
        all_URLS = hxs.xpath('a/@href').extract()

        #Part 5: all_URLS add to the Scrapy URLs to be scraped

将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表 [英] Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表 [英] Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭