Scrapy-将Excel .csv导入为start_url [英] Scrapy - Importing Excel .csv as start_url

查看:99
本文介绍了Scrapy-将Excel .csv导入为start_url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在构建一个刮板,该刮板会导入一个.csv excel文件,该文件包含一排约2,400个网站(每个网站都在其自己的列中),并将其用作start_url。我不断收到此错误,表示我正在传递列表而不是字符串。我认为这可能是由于我的列表中基本上只有一个真正代表该行的长列表。如何克服这个问题,并基本上将.csv中的每个网站作为自己的单独字符串放入列表中?

So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. I keep getting this error saying that I am passing in a list and not a string. I think this may be caused by the fact that my list basically just has one reallllllly long list in it that represents the row. How can I overcome this and basically put each website from my .csv as its own seperate string within the list?

raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
    exceptions.TypeError: Request url must be str or unicode, got list:


import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv

with open('websites.csv', 'rbU') as csv_file:
  data = csv.reader(csv_file)
  scrapurls = []
  for row in data:
    scrapurls.append(row)

class DanishSpider(scrapy.Spider):
  name = "dmoz"
  allowed_domains = []
  start_urls = scrapurls

  def parse(self, response):
    for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
      item = DanishItem()
      item['website'] = response
      item['favicon'] = sel.xpath('./@href').extract()
      yield item

谢谢!

Joey

推荐答案

仅为 start_urls 生成列表是不起作用的,因为它清楚地写在草率的文档

Just generating a list for start_urls does not work as it is clearly written in Scrapy documentation.

来自文档:


首先生成用于爬网第一个URL的初始请求,然后指定要使用从这些请求下载的响应进行调用的回调函数。

You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.

要执行的第一个请求是通过调用
start_requests()方法获得的,该方法(默认情况下)生成 Request 对于
,在 start_urls parse 方法中指定为$ b的URL $ b用于请求的回调函数。

The first requests to perform are obtained by calling the start_requests() method which (by default) generates Request for the URLs specified in the start_urls and the parse method as callback function for the Requests.

我宁愿这样做:

def get_urls_from_csv():
    with open('websites.csv', 'rbU') as csv_file:
        data = csv.reader(csv_file)
        scrapurls = []
        for row in data:
            scrapurls.append(row)
        return scrapurls


class DanishSpider(scrapy.Spider):

    ...

    def start_requests(self):
        return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]

这篇关于Scrapy-将Excel .csv导入为start_url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆