使用numpy.genfromtxt读取包含逗号的字符串的csv文件 [英] Using numpy.genfromtxt to read a csv file with strings containing commas

查看:396
本文介绍了使用numpy.genfromtxt读取包含逗号的字符串的csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用numpy.genfromtxt读取csv文件,但是某些字段是包含逗号的字符串.字符串用引号引起来,但是numpy不能将引号识别为定义了单个字符串.例如,使用"t.csv"中的数据:

I am trying to read in a csv file with numpy.genfromtxt but some of the fields are strings which contain commas. The strings are in quotes, but numpy is not recognizing the quotes as defining a single string. For example, with the data in 't.csv':

2012, "Louisville KY", 3.5
2011, "Lexington, KY", 4.0

代码

np.genfromtxt('t.csv', delimiter=',')

产生错误:

ValueError:检测到一些错误! 第2行(获得4列而不是3列)

ValueError: Some errors were detected ! Line #2 (got 4 columns instead of 3)

我正在寻找的数据结构是:

The data structure I am looking for is:

array([['2012', 'Louisville KY', '3.5'],
       ['2011', 'Lexington, KY', '4.0']], 
      dtype='|S13')

查看文档,我看不到任何解决方法.有没有办法用numpy做到这一点,还是只需要使用csv模块读取数据,然后将其转换为numpy数组?

Looking over the documentation, I don't see any options to deal with this. Is there a way do to it with numpy, or do I just need to read in the data with the csv module and then convert it to a numpy array?

推荐答案

您可以使用 pandas (正在成为处理科学python中的数据帧(异构数据)的默认库).它是 read_csv 可以处理的.从文档中:

You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv can handle this. From the docs:

quotechar:字符串

quotechar : string

The character to used to denote the start and end of a quoted item. Quoted items 
can include the delimiter and it will be ignored.

默认值为".一个例子:

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: s="""year, city, value
   ...: 2012, "Louisville KY", 3.5
   ...: 2011, "Lexington, KY", 4.0"""

In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
   year           city  value
0  2012  Louisville KY    3.5
1  2011  Lexington, KY    4.0

这里的窍门是,您还必须使用skipinitialspace=True来处理逗号分隔符后的空格.

The trick here is that you also have to use skipinitialspace=True to deal with the spaces after the comma-delimiter.

除了功能强大的csv阅读器之外,我还强烈建议将熊猫与您拥有的异构数据一起使用(以numpy给出的示例输出均为字符串,尽管您可以使用结构化数组).

Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).

这篇关于使用numpy.genfromtxt读取包含逗号的字符串的csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆