Scrapy-将Excel .csv导入为start_url [英] Scrapy - Importing Excel .csv as start_url
问题描述
因此,我正在构建一个刮板,该刮板会导入一个.csv excel文件,该文件包含一排约2,400个网站(每个网站都在其自己的列中),并将其用作start_url。我不断收到此错误,表示我正在传递列表而不是字符串。我认为这可能是由于我的列表中基本上只有一个真正代表该行的长列表。如何克服这个问题,并基本上将.csv中的每个网站作为自己的单独字符串放入列表中?
So I'm building a scraper that imports a .csv excel file which has one row of ~2,400 websites (each website is in its own column) and using these as the start_url. I keep getting this error saying that I am passing in a list and not a string. I think this may be caused by the fact that my list basically just has one reallllllly long list in it that represents the row. How can I overcome this and basically put each website from my .csv as its own seperate string within the list?
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
exceptions.TypeError: Request url must be str or unicode, got list:
import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse
from tutorial.items import DanishItem
from scrapy.http import Request
import csv
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
class DanishSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = []
start_urls = scrapurls
def parse(self, response):
for sel in response.xpath('//link[@rel="icon" or @rel="shortcut icon"]'):
item = DanishItem()
item['website'] = response
item['favicon'] = sel.xpath('./@href').extract()
yield item
谢谢!
Joey
推荐答案
仅为 start_urls
生成列表是不起作用的,因为它清楚地写在草率的文档。
Just generating a list for start_urls
does not work as it is clearly written in Scrapy documentation.
来自文档:
首先生成用于爬网第一个URL的初始请求,然后指定要使用从这些请求下载的响应进行调用的回调函数。
You start by generating the initial Requests to crawl the first URLs, and specify a callback function to be called with the response downloaded from those requests.
要执行的第一个请求是通过调用
start_requests()
方法获得的,该方法(默认情况下)生成 Request
对于
,在 start_urls
和 parse
方法中指定为$ b的URL $ b用于请求的回调函数。
The first requests to perform are obtained by calling the
start_requests()
method which (by default) generates Request
for
the URLs specified in the start_urls
and the parse
method as
callback function for the Requests.
我宁愿这样做:
def get_urls_from_csv():
with open('websites.csv', 'rbU') as csv_file:
data = csv.reader(csv_file)
scrapurls = []
for row in data:
scrapurls.append(row)
return scrapurls
class DanishSpider(scrapy.Spider):
...
def start_requests(self):
return [scrapy.http.Request(url=start_url) for start_url in get_urls_from_csv()]
这篇关于Scrapy-将Excel .csv导入为start_url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!