numpy genfromtxt - 缺失数据与坏数据 [英] numpy genfromtxt - missing data vs bad data

查看:73
本文介绍了numpy genfromtxt - 缺失数据与坏数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 numpy genfromtxt,我需要识别丢失的数据和错误的数据.根据用户输入,我可能想要删除错误值或引发错误.本质上,我想将丢失的数据和错误的数据视为同一件事.

I'm using numpy genfromtxt, and I need to identify both missing data and bad data. Depending on user input, I may want to drop bad value or raise error. Essentially, I want to treat missing and bad data as the same thing.

假设我有一个这样的文件,其中列的数据类型为date、int、float"

Say I have a file like this, where the columns are of data types "date, int, float"

date,id,value
2017-12-4,0,       # BAD. missing data
2017-12-4,1,XYZ    # BAD. value should be float, not string. 
2017-12-4,2,1.0    # good
2017-12-4,3,1.0    # good
2017-12-4,4,1.0    # good

我想检测两者.所以,我这样做

I would like to detect both. So, I do this

dtype=(np.dtype('<M8[D]'), np.dtype('int64'), np.dtype('float64'))
result = np.genfromtxt(filename, delimiter=',', dtype=dtype, names=True, usemask=True, usecols=('date', 'id', 'value'))

结果就是这样

masked_array(data=[(datetime.date(2017, 12, 4), 0, --),
               (datetime.date(2017, 12, 4), 1, nan),
               (datetime.date(2017, 12, 4), 2, 1.0),
               (datetime.date(2017, 12, 4), 3, 1.0),
               (datetime.date(2017, 12, 4), 4, 1.0)],
         mask=[(False, False,  True), (False, False, False),
               (False, False, False), (False, False, False),
               (False, False, False)],
   fill_value=('NaT', 999999, 1.e+20),
        dtype=[('date', '<M8[D]'), ('id', '<i8'), ('value', '<f8')])

我认为 masked_array 的全部意义在于它可以处理丢失的数据和坏数据.但在这里,它只处理丢失的数据.

I thought the whole point of masked_array is that it can handle missing data AND bad data. But here, it's only handling missing data.

result['value'].mask

返回

array([ True, False, False, False, False])

坏"数据实际上仍然进入了数组,如 nan.我希望面具能给我 True True False False False.

The "bad" data actually still got into the array, as nan. I was hoping the mask would give me True True False False False.

为了让我意识到我们在第二行有一个错误的值,我需要做一些额外的工作,比如检查 nan.

In order for me to realize we have a bad value on the 2nd row, I need to do additional work, like check for nan.

another_mask = np.isnan(result['value'])
good_result = result['value'][~another_mask]

终于回来了

masked_array(data=[1.0, 1.0, 1.0],
         mask=[False, False, False],
   fill_value=1e+20)

这行得通,但我觉得我做错了什么.maskedArray 的重点是查找丢失和错误的数据,但我不知何故仅使用它来查找丢失的数据.我需要自己检查才能找到错误数据.感觉很丑,不是pythonic.

That works, but I feel like I'm doing something wrong. The whole point of maskedArray is to find missing AND bad data, but I'm somehow only using it to find missing data. And I need my own check to find bad data. Feels ugly and not-pythonic.

有没有办法同时找到两者?

Is there a way to find both at the same time?

推荐答案

玩转一个简单的输入:

In [143]: txt='''1,2
     ...: 3,nan
     ...: 4,foo
     ...: 5,
     ...: '''.splitlines()
In [144]: txt
Out[144]: ['1,2', '3,nan', '4,foo', '5,']

通过将特定字符串指定为缺失"(它可能是一个列表?),我可以屏蔽"它以及空白:

By specifying a specific string as 'missing' (it may be a list?), I can 'mask' it, along with blank:

In [146]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
       usemask=True, usecols=1)
Out[146]: 
masked_array(data=[2.0, nan, --, --],
             mask=[False, False,  True,  True],
       fill_value=1e+20)

看起来它用 float 转换了所有值,但是根据字符串(或缺少字符串)生成了掩码:

It looks like it converted all values with float, but generated the mask based on the strings (or lack there of):

In [147]: _.data
Out[147]: array([ 2., nan, nan, nan])

我可以用特定值替换这两种类型的缺失".由于它正在进行 float 转换,所以填充必须是 100'100':

I can replace both types of 'missing' with a specific value. Since it's doing a float conversion, the fill has to be 100 or '100':

In [151]: np.genfromtxt(txt,delimiter=',', missing_values='foo', 
    usecols=1, filling_values=100)
Out[151]: array([  2.,  nan, 100., 100.])

在更复杂的情况下,我可以想象为列编写转换器.我只涉足过这个功能.

In a more complex case I can imagine writing a converter for the column. I've only dabbled in that feature.

这些参数的文档很少,因此弄清楚哪些组合有效以及以什么顺序起作用,需要反复试验(或大量代码挖掘).

The documentation for these parameters is slim, so figuring out what combinations work, and in what order, is a matter of trial-and-error (or a lots of code digging).

后续问题中的更多详细信息:numpy genfromtxt -如何检测错误的 int 输入值

More details in the follow up question: numpy genfromtxt - how to detect bad int input values

这篇关于numpy genfromtxt - 缺失数据与坏数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆