使用numpy dtype和转换器在子列中拆分csv列 [英] Split csv column in subcolumns using numpy dtype and converters

查看:76
本文介绍了使用numpy dtype和转换器在子列中拆分csv列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件,其中某些列包含包含错误值的测量值.我想使用numpy genfromtxt将所有内容导入python,并使用dtype格式化数组.假设我有以下格式的csv文件:

I have a csv file with some columns containing a measured value including error values. I want to import everything to python using numpy genfromtxt and format my array by using dtype. Let's assume I have a csv file in this format:

# Name, Time, Intensity
Sample1, 300, 1000+-5
Sample2, 300, 1500+-2

我想解析整个文件,然后将值和不确定性分为强度列的两个子列.我定义了两个dtype:

I want to parse the whole file and split value and uncertainty into two subcolumns of the column Intensity. I defined two dtypes:

import numpy as np
TypeValErr = np.dtype([("value", np.int32), ("error", np.int32)])
TypeCSV=np.dtype({"names": ["name", "time", "intensity"],
                  "formats": ["U32", np.int32, TypeValErr],
                  "titles": ["Name", "Time", "Intensity"]})

使用此dtypes,我首先自己创建一个测试数组:

Using this dtypes, I first create just a test array by myself:

Intensity = np.array([(2000, 12)], dtype=TypeValErr)
CSVentry = np.array([("Sample3", 300, Intensity)], dtype=TypeCSV)

print(CSVentry)

哪个给了我期望的输出

[('Sample3', 300, (2000, 12))]

在下一步中,我想使用此dtype导入CSV.由于强度"列的格式错误,因此我想使用转换器将输出转换为正确的格式:

In the next step, I want to import the CSV using this dtype. As the Intensity column has the wrong format, I want to use a converter to convert the output into the right format:

def convertToValErrArr(txt):
    splitted = txt.split("+-")
    return np.array([(splitted[0], splitted[1])], dtype=TypeValErr)

print(np.array([("Sample3", 300, convertToValErrArr("1800+-7"))], dtype=TypeCSV))

输出再次给出了预期的结果

The output gives again the expected

[('Sample3', 300, (1800, 7))]

但是最后,导入本身会引发错误.这是我的代码:

But finally, the import itself throws an error. Here is my code:

ConvertFunc = lambda x: convertToValErrArr(x)

file = np.genfromtxt("test.csv",
                     delimiter=",",
                     autostrip=True,
                     dtype=TypeCSV,
                     skip_header=1,
                     converters={2: lambda x: convertToValErrArr(str(x))})

这是我的错误:

Traceback (most recent call last):
  File "csvimport.py", line 28, in <module>
    converters={2: lambda x: convertToValErrArr(str(x))})
  File "/usr/lib/python3.6/site-packages/numpy/lib/npyio.py", line 1896, in genfromtxt
    rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
ValueError: size of tuple must match number of fields.

我没看错.genfromtxt是否以不同的方式处理数据?我希望有人有主意!非常感谢.

I don't see the mistake. Is genfromtxt processing the data in a different way? I hope, somebody has an idea! Thanks a lot.

推荐答案

使用dtype和4列,它可以工作(嵌套dtype和全部)

With your dtype, and 4 columns, it works (nested dtype and all)

In [58]: TypeValErr = np.dtype([("value", np.int32), ("error", np.int32)])
    ...: TypeCSV=np.dtype({"names": ["name", "time", "intensity"],
    ...:                   "formats": ["U32", np.int32, TypeValErr],
    ...:                   "titles": ["Name", "Time", "Intensity"]})
    ...: 
In [59]: txt=b"""# Name, Time, Intensity
    ...: Sample1, 300, 1000, 5
    ...: Sample2, 300, 1500, 2"""
In [60]: 
In [60]: data=np.genfromtxt(txt.splitlines(), dtype=TypeCSV, delimiter=',',skip_header=True)
In [61]: data
Out[61]: 
array([('Sample1', 300, (1000, 5)), ('Sample2', 300, (1500, 2))], 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

因此,它可以采用简单的值列表,例如 ['Sample1',300,1000,5] 并将它们映射到嵌套元组上需要保存此dtype:('Sample1',300,(1000,5)).

So it is able to take a flat list of values, e.g. ['Sample1', 300, 1000, 5] and map them on the nested tuples need to save this dtype: ('Sample1', 300, (1000, 5)).

但是转换器不会将 ['Sample1','300','1000 + -5'] 转换为 ['Sample1','300',(1000,5)] ,或者如果这样做,则不适合后续使用.

But the converter does not turn ['Sample1', '300', '1000+-5'] into ['Sample1', '300', (1000, 5)], or if it does it isn't the right thing for subsequent use.

dtype_flat 是:

In [70]: np.lib.npyio.flatten_dtype(TypeCSV)
Out[70]: [dtype('<U32'), dtype('int32'), dtype('int32'), dtype('int32')]

因此,您的嵌套dtype的生成顺序如下:

So your nested dtype is produced with an sequence like this:

In [75]: rows=np.array(('str',1,2, 3),dtype=[('',_) for _ in np.lib.npyio.flatten_dtype(TypeCSV)])
In [76]: rows.view(TypeCSV)
Out[76]: 
array(('str', 1, (2, 3)), 
      dtype=[(('Name', 'name'), '<U32'), (('Time', 'time'), '<i4'), (('Intensity', 'intensity'), [('value', '<i4'), ('error', '<i4')])])

实际上,在错误行之前有对此效果的评论

In fact there's a comment to that effect just before the error line

    if len(dtype_flat) > 1:
        # Nested dtype, eg [('a', int), ('b', [('b0', int), ('b1', 'f4')])]
        # First, create the array using a flattened dtype:
        # [('a', int), ('b1', int), ('b2', float)]
        # Then, view the array using the specified dtype.
        if 'O' in (_.char for _ in dtype_flat):
        ...
        else:
            rows = np.array(data, dtype=[('', _) for _ in dtype_flat])
            output = rows.view(dtype)

此时,

data 是已通过转换器传递的行"元组的列表:

data at this point is a list of 'row` tuples, which have already been passed through the converters:

rows = list(
        zip(*[[conv._strict_call(_r) for _r in map(itemgetter(i), rows)]
              for (i, conv) in enumerate(converters)]))

简化了转换过程,

In [84]: converters = [str, int, int, int]
In [85]: row = ['one','1','2','3']
In [86]: [conv(r) for conv, r in zip(converters, row)]
Out[86]: ['one', 1, 2, 3]

但实际上更接近:

In [87]: rows = [row,row]
In [88]: rows
Out[88]: [['one', '1', '2', '3'], ['one', '1', '2', '3']]
In [89]: from operator import itemgetter
In [90]: [[conv(r) for r in map(itemgetter(i), rows)] for (i, conv) in enumerate(converters)]
Out[90]: [['one', 'one'], [1, 1], [2, 2], [3, 3]]
In [91]: list(zip(*_))
Out[91]: [('one', 1, 2, 3), ('one', 1, 2, 3)]

长而短的是 converters 不能将一列拆分为2列或更多列.为此,分割,转换然后映射到dtype的过程以错误的顺序进行.我在一开始所演示的内容可能很简单-将文件逐行通过文本处理行.它将用指定的分隔符替换 +-.然后,该文件将具有与您的dtype配合使用的正确列数.

So the long and short is that converters cannot split a column into 2 or more columns. The process of splitting, converting, and then mapping onto the dtype occurs in the wrong order for this. What I demonstrated at the start is probably easist - pass your file, line by line through a text processing line. It would replace the +- with the specified delimiter. Then the file will have the correct number of columns to work with your dtype.

这篇关于使用numpy dtype和转换器在子列中拆分csv列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆