将.csv文件从URL读入Python 3.x - _csv.Error:iterator应该返回字符串,而不是字节(在文本模式下打开文件) [英] Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

查看:2874
本文介绍了将.csv文件从URL读入Python 3.x - _csv.Error:iterator应该返回字符串,而不是字节(在文本模式下打开文件)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在努力解决这个简单的问题太久,所以我想我会要求帮助。我试图阅读从国家医药图书馆ftp网站到Python 3.3.2(在Windows 7上)的期刊文章列表。日记文章位于.csv文件中。

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.

我试过下面的代码:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]

它会导致以下错误:

Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

我认为我应该使用字符串而不是字节?

I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.

推荐答案

问题依赖于 urllib 返回字节。作为证据,您可以尝试下载csv文件与您的浏览器,并打开它作为一个常规文件,问题已经消失。

The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.

一个类似的问题已解决这里

A similar problem was addressed here.

可以通过适当的编码解码字节到字符串。例如:

It can be solved decoding bytes to strings with the appropriate encoding. For example:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8'))  # with the appropriate encoding 
data = [row for row in csvfile]


b $ b

最后一行也可以是 data = list(csvfile),可以更容易阅读。

顺便说一句,由于csv文件非常大,它可能会减慢和消耗内存。

By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.

编辑:
使用由Steven Rumbalski提议的编解码器,因此没有必要读取整个文件进行解码。内存消耗减少和速度增加。

Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.

import csv
import urllib.request
import codecs

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
    print(line)  # do something with line

请注意,该列表不是出于同样的原因创建的。

Note that the list is not created either for the same reason.

这篇关于将.csv文件从URL读入Python 3.x - _csv.Error:iterator应该返回字符串,而不是字节(在文本模式下打开文件)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆