将 .csv 文件从 URL 读入 Python 3.x - _csv.Error:迭代器应该返回字符串,而不是字节(您是否以文本模式打开文件?) [英] Read .csv file from URL into Python 3.x - _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

查看:23
本文介绍了将 .csv 文件从 URL 读入 Python 3.x - _csv.Error:迭代器应该返回字符串,而不是字节(您是否以文本模式打开文件?)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在为这个简单的问题苦苦挣扎太久,所以我想寻求帮助.我正在尝试将 National Library of Medicine ftp 站点中的期刊文章列表读入 Python 3.3.2(在 Windows 7 上).期刊文章位于 .csv 文件中.

I've been struggling with this simple problem for too long, so I thought I'd ask for help. I am trying to read a list of journal articles from National Library of Medicine ftp site into Python 3.3.2 (on Windows 7). The journal articles are in a .csv file.

我尝试了以下代码:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream)
data = [row for row in csvfile]

它导致以下错误:

Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
data = [row for row in csvfile]
File "<pyshell#4>", line 1, in <listcomp>
data = [row for row in csvfile]
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

我认为我应该使用字符串而不是字节?对这个简单问题的任何帮助,以及对出了什么问题的解释,将不胜感激.

I presume I should be working with strings not bytes? Any help with the simple problem, and an explanation as to what is going wrong would be greatly appreciated.

推荐答案

问题依赖于 urllib 返回字节.作为证明,您可以尝试使用浏览器下载 csv 文件并将其作为常规文件打开,问题就消失了.

The problem relies on urllib returning bytes. As a proof, you can try to download the csv file with your browser and opening it as a regular file and the problem is gone.

解决了类似的问题 这里.

可以通过适当的编码将字节解码为字符串来解决.例如:

It can be solved decoding bytes to strings with the appropriate encoding. For example:

import csv
import urllib.request

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(ftpstream.read().decode('utf-8'))  # with the appropriate encoding 
data = [row for row in csvfile]

最后一行也可以是:data = list(csvfile),这样更容易阅读.

The last line could also be: data = list(csvfile) which can be easier to read.

顺便说一下,由于 csv 文件非常大,它可能会很慢并且很消耗内存.也许最好使用生成器.

By the way, since the csv file is very big, it can slow and memory-consuming. Maybe it would be preferable to use a generator.

使用 Steven Rumbalski 提出的编解码器,因此没有必要读取整个文件进行解码.内存消耗减少,速度提高.

Using codecs as proposed by Steven Rumbalski so it's not necessary to read the whole file to decode. Memory consumption reduced and speed increased.

import csv
import urllib.request
import codecs

url = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/file_list.csv"
ftpstream = urllib.request.urlopen(url)
csvfile = csv.reader(codecs.iterdecode(ftpstream, 'utf-8'))
for line in csvfile:
    print(line)  # do something with line

请注意,该列表也不是出于同样的原因创建的.

Note that the list is not created either for the same reason.

这篇关于将 .csv 文件从 URL 读入 Python 3.x - _csv.Error:迭代器应该返回字符串,而不是字节(您是否以文本模式打开文件?)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆