将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表 [英] Pass Selenium HTML string to Scrapy to add URLs to Scrapy list of URLs to scrape
问题描述
我对 Python、Scrapy 和 Selenium 非常陌生.因此,您可以提供的任何帮助将不胜感激.
I'm very new to Python, Scrapy and Selenium. Thus, any help you could provide would be most appreciated.
我希望能够将我从 Selenium 获得的 HTML 作为页面源并将其处理为 Scrapy Response 对象.主要原因是能够将 Selenium Webdriver 页面源中的 URL 添加到 Scrapy 将解析的 URL 列表中.
I'd like to be able to take HTML I've obtained from Selenium as the page source and processes it into a Scrapy Response object. The main reason is to be able to add the URLs in the Selenium Webdriver page source to the list of URLs Scrapy will parse.
再次感谢您的帮助.
作为第二个问题,有人知道如何查看 Scrapy 发现和抓取的 URL 列表中的 URL 列表吗?
As a quick second question, does anyone know how to view the list of URLs that are in or were in the list of URLs Scrapy found and scraped?
谢谢!
*******编辑*******这是我正在尝试做的一个例子.我无法弄清楚第 5 部分.
*******EDIT******* Here is an example of what I am trying to do. I can't figure out part 5.
class AB_Spider(CrawlSpider):
name = "ab_spider"
allowed_domains = ["abcdef.com"]
#start_urls = ["https://www.kickstarter.com/projects/597507018/pebble-e-paper-watch-for-iphone-and-android"
#, "https://www.kickstarter.com/projects/801465716/03-leagues-under-the-sea-the-seaquestor-flyer-subm"]
start_urls = ["https://www.abcdef.com/page/12345"]
def parse_abcs(self, response):
sel = Selector(response)
URL = response.url
#part 1: check if a certain element is on the webpage
last_chk = sel.xpath('//ul/li[@last_page="true"]')
a_len = len(last_chk)
#Part 2: if not, then get page via selenium webdriver
if a_len == 0:
#OPEN WEBDRIVER AND GET PAGE
driver = webdriver.Firefox()
driver.get(response.url)
#Part 3: run script to ineract with page until certain element appears
while a_len == 0:
print "ELEMENT NOT FOUND, USING SELENIUM TO GET THE WHOLE PAGE"
#scroll down one time
driver.execute_script("window.scrollTo(0, 1000000000);")
#get page source and check if last page is there
selen_html = driver.page_source
hxs = Selector(text=selen_html)
last_chk = hxs.xpath('//ul/li[@last_page="true"]')
a_len = len(last_chk)
driver.close()
#Part 4: extract the URLs in the selenium webdriver URL
all_URLS = hxs.xpath('a/@href').extract()
#Part 5: all_URLS add to the Scrapy URLs to be scraped
推荐答案
Just yield
Request
instances from the method and provide a callback:
class AB_Spider(CrawlSpider):
...
def parse_abcs(self, response):
...
all_URLS = hxs.xpath('a/@href').extract()
for url in all_URLS:
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
# Do the parsing here
这篇关于将 Selenium HTML 字符串传递给 Scrapy 以将 URL 添加到 Scrapy 要抓取的 URL 列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!