在numpy中读取CSV文件,其中分隔符为“,” [英] Reading CSV files in numpy where delimiter is ","

查看:2424
本文介绍了在numpy中读取CSV文件,其中分隔符为“,”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个格式如下的CSV文件:

I've got a CSV file with a format that looks like this:


FieldName1,FieldName2,FieldName3 ,FieldName4

04/13/2010 14:45:07.008,7.59484916392,10,6.552373

04/13/2010 14 :45:22.010,6.55478493312,9,3.5378543

...

"FieldName1", "FieldName2", "FieldName3", "FieldName4"
"04/13/2010 14:45:07.008", "7.59484916392", "10", "6.552373"
"04/13/2010 14:45:22.010", "6.55478493312", "9", "3.5378543"
...

请注意,CSV文件中每行的开头和结尾都有双引号字符,字符串用于定界每行内的字段。 CSV文件中的字段数可能因文件而异。

Note that there are double quote characters at the start and end of each line in the CSV file, and the "," string is used to delimit fields within each line. The number of fields in the CSV file can vary from file to file.

当我尝试通过以下方式读取到numpy:

import numpy as np

data = np.genfromtxt(csvfile,dtype = None,delimiter =',',names = True) / code>

所有数据将以字符串值读取,由双引号字符包围。

When I try to read this into numpy via:
import numpy as np
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True)
all the data gets read in as string values, surrounded by double-quote characters. Not unreasonable, but not much use to me as I then have to go back and convert every column to its correct type

当我使用 delimiter = ',',一切都按我想要的方式工作,除了,第一个和最后一个字段。由于行的开始和行尾字符是单个双引号字符,所以它不会被视为第一个和最后一个字段的有效分隔符,因此它们将被读入。 04/13/2010 14:45:07.008 6.552373 - 请注意前导和尾随双引号字符分别。由于这些冗余字符,numpy假定第一和最后字段都是字符串类型;我不想这样。

When I use delimiter='","' instead, everything works as I'd like, except for the 1st and last fields. As the start of line and end of line characters are a single double-quote character, this isn't seen as a valid delimiter for the 1st and last fields, so they get read in as e.g. "04/13/2010 14:45:07.008 and 6.552373" - note the leading and trailing double-quote characters respectively. Because of these redundant characters, numpy assumes the 1st and last fields are both String types; I don't want that to be the case

有没有办法指示numpy在我想要的格式的文件中读取,而不必去返回并在初始读取之后修复numpy数组的结构?

Is there a way of instructing numpy to read in files formatted in this fashion as I'd like, without having to go back and "fix" the structure of the numpy array after the initial read?

推荐答案

基本问题是NumPy不会理解清除引号的概念(而 csv 模块)。当你说 delimiter =',',你告诉NumPy,列分隔符是字面上是一个带引号的逗号,即引号是在逗号周围,而不是值,所以你得到的额外的报价他的第一列和最后一列。

The basic problem is that NumPy doesn't understand the concept of stripping quotes (whereas the csv module does). When you say delimiter='","', you're telling NumPy that the column delimiter is literally a quoted comma, i.e. the quotes are around the comma, not the value, so the extra quotes you get on he first and last columns are expected.

查看函数docs,我想你需要设置转换器参数为您删除引号(默认值不):

Looking at the function docs, I think you'll need to set the converters parameter to strip quotes for you (the default does not):

import re
import numpy as np

fieldFilter = re.compile(r'^"?([^"]*)"?$')
def filterTheField(s):
    m = fieldFilter.match(s.strip())
    if m:
        return float(m.group(1))
    else:
        return 0.0 # or whatever default

#...

# Yes, sorry, you have to know the number of columns, since the NumPy docs
# don't say you can specify a default converter for all columns.
convs = dict((col, filterTheField) for col in range(numColumns))
data = np.genfromtxt(csvfile, dtype=None, delimiter=',', names=True, 
    converters=convs)

或放弃 np .genfromtxt()并让 csv.csvreader 一次一行地提供文件的内容,作为字符串列表,元素并构建矩阵:

Or abandon np.genfromtxt() and let csv.csvreader give you the file's contents a row at a time, as lists of strings, then you just iterate through the elements and build the matrix:

reader = csv.csvreader(csvfile)
result = np.array([[float(col) for col in row] for row in reader])
# BTW, column headings are in reader.fieldnames at this point.

编辑:好吧,所以看起来你的文件不是全部​​浮动。在这种情况下,您可以在 genfromtxt 案例中根据需要设置 convs ,或者创建一个转换函数向量 csv.csvreader case:

Okay, so it looks like your file isn't all floats. In that case, you can set convs as needed in the genfromtxt case, or create a vector of conversion functions in the csv.csvreader case:

reader = csv.csvreader(csvfile)
converters = [datetime, float, int, float]
result = np.array([[conv(col) for col, conv in zip(row, converters)] 
    for row in reader])
# BTW, column headings are in reader.fieldnames at this point.

编辑2:好的,变量列数...你的数据源只是想让生活困难。幸运的是,我们可以使用 magic ...

EDIT 2: Okay, variable column count... Your data source just wants to make life difficult. Luckily, we can just use magic...

reader = csv.csvreader(csvfile)
result = np.array([[magic(col) for col in row] for row in reader])

...其中 magic()只是一个名字,我从一个函数的顶部下来。 (Psyche!)

... where magic() is just a name I got off the top of my head for a function. (Psyche!)

在最坏的情况下,它可能是:

At worst, it could be something like:

def magic(s):
    if '/' in s:
        return datetime(s)
    elif '.' in s:
        return float(s)
    else:
        return int(s)

也许NumPy有一个函数,字符串,并返回具有正确类型的单个元素。 numpy.fromstring()看起来很接近,但它可能会将时间戳记中的空格解释为列分隔符。

Maybe NumPy has a function that takes a string and returns a single element with the right type. numpy.fromstring() looks close, but it might interpret the space in your timestamps as a column separator.

PS csvreader 的一个缺点是我看到的是它不丢弃注释;真实 csv 文件没有注释。

P.S. One downside with csvreader I see is that it doesn't discard comments; real csv files don't have comments.

这篇关于在numpy中读取CSV文件,其中分隔符为“,”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆