运行使用带有硒的scrapy创建的解析器时遇到问题 [英] Trouble running a parser created using scrapy with selenium
问题描述
我已经用 Python scrapy 和 selenium 编写了一个抓取工具来从网站上抓取一些titles
.在我的刮板中定义的 css 选择器
是完美的.我希望我的刮板继续点击下一页并解析每个页面中嵌入的信息.它在第一页上做得很好,但是当涉及到硒部分的作用时,刮板会一遍又一遍地点击同一个链接.
I've written a scraper in Python scrapy in combination with selenium to scrape some titles
from a website. The css selectors
defined within my scraper is flawless. I wish my scraper to keep on clicking on the next page and parse the information embedded in each page. It is doing fine for the first page but when it comes to play the role for selenium part the scraper keeps clicking on the same link over and over again.
由于这是我第一次使用 selenium 和scrapy,我不知道要继续成功.任何修复都将受到高度赞赏.
As this is my first time to work with selenium along with scrapy, I don't have any idea to move on successfully. Any fix will be highly appreciated.
如果我这样尝试,那么它运行顺利(选择器没有任何问题):
If I try like this then it works smoothly (there is nothing wrong with selectors):
class IncomeTaxSpider(scrapy.Spider):
name = "taxspider"
start_urls = [
'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
]
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def parse(self,response):
self.driver.get(response.url)
while True:
for elem in self.wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"h1.faqsno-heading"))):
name = elem.find_element_by_css_selector("div[id^='arrowex']").text
print(name)
try:
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
self.wait.until(EC.staleness_of(elem))
except TimeoutException:break
但我的目的是让我的脚本以这种方式运行:
But my intention is to make my script run this way:
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
class IncomeTaxSpider(scrapy.Spider):
name = "taxspider"
start_urls = [
'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
]
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def click_nextpage(self,link):
self.driver.get(link)
elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))
#It keeeps clicking on the same link over and over again
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
self.wait.until(EC.staleness_of(elem))
def parse(self,response):
while True:
for item in response.css("h1.faqsno-heading"):
name = item.css("div[id^='arrowex']::text").extract_first()
yield {"Name": name}
try:
self.click_nextpage(response.url) #initiate the method to do the clicking
except TimeoutException:break
这些是登陆页面上可见的标题(让你知道我在追求什么):
These are the titles visible on that landing page (to let you know what I'm after):
INDIA INCLUSION FOUNDATION
INDIAN WILDLIFE CONSERVATION TRUST
VATSALYA URBAN AND RURAL DEVELOPMENT TRUST
我不愿意从那个站点获取数据,所以除了我上面尝试过的任何其他方法对我来说都没用.我的唯一目的是找到与我在第二种方法中尝试的方式相关的任何解决方案.
推荐答案
您的初始代码几乎是正确的,但其中缺少一个关键部分.您始终使用相同的响应对象.响应对象需要来自最新的页面源.
Your initial code was almost correct with one key piece missing from it. You were using the same response object always. The response object needs to be from the latest page source.
此外,您还在单击下一页时一次又一次地浏览链接,每次都将其重置为第 1 页.这就是您获得第 1 页和第 2 页(最大)的原因.您只需要在解析阶段获取一次 url,然后让下一页点击发生
Also you were browsing the link again and again in click next page which was resetting it to page 1 every time. That is why you get page 1 and 2 (max). You need to get the url only once in the parse stage and then let the next page click to happen
以下是运行良好的最终代码
Below is final code working fine
class IncomeTaxSpider(scrapy.Spider):
name = "taxspider"
start_urls = [
'https://www.incometaxindia.gov.in/Pages/utilities/exempted-institutions.aspx',
]
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def click_nextpage(self,link):
# self.driver.get(link)
elem = self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div[id^='arrowex']")))
#It keeeps clicking on the same link over and over again
self.wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "input[id$='_imgbtnNext']"))).click()
self.wait.until(EC.staleness_of(elem))
def parse(self, response):
self.driver.get(response.url)
while True:
for item in response.css("h1.faqsno-heading"):
name = item.css("div[id^='arrowex']::text").extract_first()
yield {"Name": name}
try:
self.click_nextpage(response.url) #initiate the method to do the clicking
response = response.replace(body=self.driver.page_source)
except TimeoutException:break
修改之后就完美了
这篇关于运行使用带有硒的scrapy创建的解析器时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!