Scrapy - 抓取时发现的抓取链接 [英] Scrapy - Scraping links found while scraping
问题描述
我只能假设这是在 Scrapy 中要做的最基本的事情之一,但我就是不知道如何去做.基本上,我抓取一页来获取包含本周更新的 url 列表.然后我需要一一进入这些网址并从中获取信息.我目前已经设置了两个刮板,并且它们可以完美地手动工作.所以我首先从第一个刮刀上刮取 url,然后将它们硬编码为第二个刮刀上的 start_urls[].
I can only presume this is one of the most basic things to do in Scrapy but I just cannot work out how to do it. Basically, I scrape one page to get a list of urls that contain updates for the week. I then need to go into these urls one by one and scrape the information from them. I currently have both scrapers set up and they work perfectly manually. So I first scrape the urls from the first scraper then hard code them as the start_urls[] on the second scraper.
最好的方法是什么?是否像在抓取文件中调用另一个函数一样简单,该函数接受一个 url 列表并在那里进行抓取?
What is the best way to do it? Is it as simple as calling another function in the scraper file that takes a list of urls and does the scraping there?
这是获取url列表的scraper:
This is the scraper that gets the list of urls:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [ .....
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
urls = []
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
urls.append("http://myurl{}.com/{}".format(dates[0]['value'], dates[1]['value']))
这是一个抓取器,然后一一浏览网址:
This is the scraper that then goes through the urls one by one:
class Planning(scrapy.Spider):
name = "planning"
start_urls = [
...
]
def parse(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
for link in pages.find_all('a'):
url = 'myurl.com/{}'.format(link['href'])
resultTable = soup.find("table", { "class" : "apas_tbl" })
然后我将 resultTable 保存到一个文件中.目前,我获取 urls 列表的输出并将其复制到另一个刮刀中.
I then saved resultTable into a file. At the moment I take the output of the urls list and copy it into the other scraper.
推荐答案
对于使用 parse 找到的每个链接,您都可以请求它并使用其他函数解析内容:
For every link that you found with parse you can request it and parse the content with the other function:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = [ .....
]
def parse(self, response):
rows = response.css('table.apas_tbl tr').extract()
urls = []
for row in rows[1:]:
soup = BeautifulSoup(row, 'lxml')
dates = soup.find_all('input')
url = "http://myurl{}.com/{}".format(dates[0]['value'], dates[1]['value'])
urls.append(url)
yield scrapy.Request(url, callback=self.parse_page_contents)
def parse_page_contents(self, response):
rows = response.xpath('//div[@id="apas_form"]').extract_first()
soup = BeautifulSoup(rows, 'lxml')
pages = soup.find(id='apas_form_text')
for link in pages.find_all('a'):
url = 'myurl.com/{}'.format(link['href'])
resultTable = soup.find("table", { "class" : "apas_tbl" })
这篇关于Scrapy - 抓取时发现的抓取链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!