使用numpy.genfromtxt读取包含逗号的字符串的csv文件 [英] Using numpy.genfromtxt to read a csv file with strings containing commas
问题描述
我正在尝试使用numpy.genfromtxt
读取csv文件,但是某些字段是包含逗号的字符串.字符串用引号引起来,但是numpy不能将引号识别为定义了单个字符串.例如,使用"t.csv"中的数据:
I am trying to read in a csv file with numpy.genfromtxt
but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':
2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0
代码
np.genfromtxt('t.csv', delimiter=',')
产生错误:
ValueError:检测到一些错误! 第2行(获得4列而不是3列)
ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)
我正在寻找的数据结构是:
The data structure I am looking for is:
array([['2012', 'Louisville KY', '3.5'],
['2011', 'Lexington, KY', '4.0']],
dtype='|S13')
查看文档,我看不到任何解决方法.有没有办法用numpy做到这一点,还是只需要使用csv
模块读取数据,然后将其转换为numpy数组?
Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv
module and then convert it to a numpy array?
推荐答案
您可以使用 pandas 为此,请使用a>(正在成为处理科学python中的数据帧(异构数据)的默认库).它是 read_csv
可以处理的.从文档中:
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
quotechar:字符串
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
默认值为"
.一个例子:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
这里的窍门是,您还必须使用skipinitialspace=True
来处理逗号分隔符后的空格.
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
除了功能强大的csv阅读器之外,我还强烈建议将熊猫与您拥有的异构数据一起使用(以numpy给出的示例输出均为字符串,尽管您可以使用结构化数组).
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
这篇关于使用numpy.genfromtxt读取包含逗号的字符串的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!