用scipy / numpy在Python中解析字母数字CSV的最终方法 [英] Definitive way to parse alphanumeric CSVs in Python with scipy/numpy

查看:150
本文介绍了用scipy / numpy在Python中解析字母数字CSV的最终方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在试图找到一个好的,灵活的方式来解析Python文件中的CSV文件,但没有一个标准选项似乎符合账单。我试着写我自己的,但我认为在numpy / scipy和csv模块中存在的一些组合可以做我需要的,所以我不想重塑轮子。

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.

我想要的标准功能是能够指定分隔符,指定是否有标题,要跳过多少行,注释分隔符,忽略哪些列等。缺少能够解析CSV文件,以一种优雅的方式处理字符串数据和数字数据。我的许多CSV文件都有包含字符串(不必长度相同)和数字数据的列。我想能够为这个数字数据的numpy数组功能,但也能够访问字符串。例如,假设我的文件看起来像这样(假设列是制表符分隔的):

I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):

# my file
name  favorite_integer  favorite_float1  favorite_float2  short_description
johnny  5  60.2  0.52  johnny likes fruitflies
bob 1  17.52  0.001  bob, bobby, robert

data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')

以两种方式访问​​数据:

I'd like to be able to access data in two ways:


  1. 作为一个值矩阵:对我来说,重要的是得到一个numpy.array我可以很容易地转置和访问数字列。在这种情况下,我想要能够做如下:

  1. As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:

floats_and_ints = data.matrix

floats_and_ints [:, 0]#访问整数

floats_and_ints [:, 1:3]#访问一些浮动
transpose(floats_and_ints)#etc ..

作为一个类似字典的对象,我不必知道标题的顺序:我也想访问数据标题顺序。例如,我想做:

As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:

data ['favorite_float1']#获取列标题$ b $的所有值bfavorite_float1

data ['name']#获取所有行的名称

我不想知道,favorite_float1是表中的第二列,因为这可能会改变。

I don't want to have to know that favorite_float1 is the second column in the table, since this might change.

对我来说,重要的是能够遍历行,并按名称访问字段。例如:

It's also important for me to be able to iterate through the rows and access the fields by name. For example:

for row in data:
  # print names and favorite integers of all 
  print "Name: ", row["name"], row["favorite_int"]



< 1)建议一个numpy.array,但就我所知,这不能很好地处理字符串,并要求我提前指定数据类型以及标题标签。

The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.

(2)中的表示建议了一个字典列表,这是我一直在使用的。但是,这对于有两个字符串字段但其余列是数字的csv文件真的很糟糕。对于数字值,你真的想要能够访问矩阵表示,并将其操作为numpy.array。

The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.

有csv的组合/ numpy / scipy功能,允许两个世界的灵活性?

Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.

总之,主要功能是:


  1. 指定分隔符的标准能力,要跳过的行数,要忽略的列等。

  2. 获取数据的numpy.array /矩阵表示形式的能力,

  3. 可通过标题名称提取列和行(如上例所示)
  1. Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
  2. The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
  3. The ability to extract columns and rows by header name (as in the above example)


推荐答案

查看 pandas numpy 之上构建。
这里是一个小例子:

Have a look at pandas which is build on top of numpy. Here is a small example:

In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]: 
        favorite_integer  favorite_float1  favorite_float2        short_description
name                                                                               
johnny                 5            60.20            0.520  johnny likes fruitflies
bob                    1            17.52            0.001       bob, bobby, robert
In [9]: df.describe()
Out[9]: 
       favorite_integer  favorite_float1  favorite_float2
count          2.000000         2.000000         2.000000
mean           3.000000        38.860000         0.260500
std            2.828427        30.179317         0.366988
min            1.000000        17.520000         0.001000
25%            2.000000        28.190000         0.130750
50%            3.000000        38.860000         0.260500
75%            4.000000        49.530000         0.390250
max            5.000000        60.200000         0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]: 
name
johnny    60.20
bob       17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]: 
              short_description  mean_favorite
name                                          
johnny  johnny likes fruitflies      21.906667
bob          bob, bobby, robert       6.173667

这篇关于用scipy / numpy在Python中解析字母数字CSV的最终方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆