使用scrapy从值列表中抓取网站 [英] Crawl website from list of values using scrapy

查看:122
本文介绍了使用scrapy从值列表中抓取网站的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个NPI列表,我想从npidb.org中抓取提供程序的名称 NPI值存储在一个csv文件中.

I have a list of NPIs which I want to scrape the names of the providers for from npidb.org The NPI values are stored in a csv file.

我可以通过将URL粘贴到代码中来手动完成此操作.但是,如果我有每个要提供者名称的NPI列表,则无法弄清楚该怎么做.

I am able to do it manually by pasting the URLs in the code. However, I am unable to figure out how to do it if I have a list of NPIs for each of which I want the provider names.

这是我当前的代码:

import scrapy
from scrapy.spider import BaseSpider



class MySpider(BaseSpider):
    name = "npidb"

    def start_requests(self):
        urls = [

            'https://npidb.org/npi-lookup/?npi=1366425381',
            'https://npidb.org/npi-lookup/?npi=1902873227',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename = 'npidb-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

推荐答案

好吧,这取决于csv文件的结构,但是如果它在单独的行中包含npi,则可以执行类似的操作

Well, it depends on the structure of your csv file, but if it contains the npis in separate lines, you could do something like

def start_requests(self):
    with open('npis.csv') as f:
        for line in f:
            yield scrapy.Request(
                url='https://npidb.org/npi-lookup/?npi={}'.format(line.strip()), 
                callback=self.parse
            )

这篇关于使用scrapy从值列表中抓取网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆