使用scrapy从值列表中抓取网站 [英] Crawl website from list of values using scrapy
本文介绍了使用scrapy从值列表中抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个NPI列表,我想从npidb.org中抓取提供程序的名称 NPI值存储在一个csv文件中.
I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file.
我可以通过将URL粘贴到代码中来手动完成此操作.但是,如果我有每个要提供者名称的NPI列表,则无法弄清楚该怎么做.
I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names.
这是我当前的代码:
import scrapy
from scrapy.spider import BaseSpider
class MySpider(BaseSpider):
name = "npidb"
def start_requests(self):
urls = [
'https://npidb.org/npi-lookup/?npi=1366425381',
'https://npidb.org/npi-lookup/?npi=1902873227',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-1]
filename = 'npidb-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
推荐答案
好吧,这取决于csv文件的结构,但是如果它在单独的行中包含npi,则可以执行类似的操作
Well, it depends on the structure of your csv file, but if it contains the npis in separate lines, you could do something like
def start_requests(self):
with open('npis.csv') as f:
for line in f:
yield scrapy.Request(
url='https://npidb.org/npi-lookup/?npi={}'.format(line.strip()),
callback=self.parse
)
这篇关于使用scrapy从值列表中抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文