在numpy中强制使用非数字字符到NA(当读取csv到pandas数据帧时) [英] forcing non-numeric characters to NAs in numpy (when reading a csv to a pandas dataframe)

查看:1248
本文介绍了在numpy中强制使用非数字字符到NA(当读取csv到pandas数据帧时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有记录,其中字段(称为 INDATUMA UTDATUMA )应包括范围在20010101和20141231(显而易见的原因)。为了允许缺失值,但保留精确到最近的日期,我将它们存储为浮点数(np.float64)。我希望这将强制偶尔错误的字段(想想2oo41oo9)到 NA s,而是打破了在pandas 0.18.0或IOPro 1.7.2中的导入。 / p>

有没有文件的选项可以使用什么?否则?



熊猫尝试的关键线是

  import numpy as np 
import pandas as pd
treatments = pd.read_table(filename,usecols = [0,3,4,6],engine ='c',dtype = {'LopNr':np .uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})

与错误 ValueError:无效的文字为float():2003o730



在IOPro中,以防万一:

  import iopro 
adapter = iopro.text_adapter(filename,parser ='csv ',':'f8',4:'f8',6:'','''' object'})
all_treatments.append(adapter [[0,3,4,6]] [:])

但是这也打破了 iopro.lib.errors.DataTypeError:无法将记录1字段3的标记2003o730转换为float64.Reason:unknown



数据文件以

开头

  LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD 
1562 21001 046 20030707 20030711 I489A I489A I509 2 10
1562 21001 046 2003o730 20030801 I501 I501 I489A DG001 2 10


解决方案

您可以使用 read_table

  def converter(num):
try:
return np.float(num)
除外:
return np.nan

#define each column
converters = {'INDATUMA':converter,'UTDATUMA':converter}

df = pd.read_table(filename,converters = converters)
print df
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD \
0 1562 21001 46 20030707 20030711 I489A I489A I509 2
1 1562 21001 46 NaN 20030801 I501 I501 I489A DG001

EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
0 10 NaN NaN NaN NaN NaN
1 2 10 NaN NaN NaN NaN

或使用参数 errors ='coerce'进行后处理 to_numeric

  df ['INDATUMA'] = pd.to_numeric(df ['INDATUMA'],errors =' coerce')
0 20030707
1 NaN
名称:INDATUMA,dtype:float64


I have records where fields (called INDATUMA and UTDATUMA) are supposed to comprise numbers in the range of 20010101 and 20141231 (for the obvious reason). To allow missing values but retain precision up to the nearest dates, I would store them as floats (np.float64). I was hoping this would force the occasionally misformatted field (think of 2oo41oo9) to NAs, but instead breaks the import both in pandas 0.18.0 or IOPro 1.7.2.

Is there an undocumented option what could use? Or else?

The key line for the pandas attempt is

import numpy as np
import pandas as pd
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})

With the eror ValueError: invalid literal for float(): 2003o730.

I tried the following in IOPro, just in case:

import iopro
adapter = iopro.text_adapter(filename, parser='csv',delimiter='\t',output='dataframe',infer_types=False)
adapter.set_field_types({0: 'u4',3:'f8', 4:'f8',6:'object'})
all_treatments.append(adapter[[0,3,4,6]][:])

But this also breaks with iopro.lib.errors.DataTypeError: Could not convert token "2003o730" at record 1 field 3 to float64.Reason: unknown

The datafile starts as

LopNr   SJUKHUS MVO INDATUMA    UTDATUMA    HDIA    DIAGNOS OP  PVARD   EKOD1   EKOD2   EKOD3   EKOD4   EKOD5   ICD
1562    21001   046 20030707    20030711    I489A   I489A I509      2                       10
1562    21001   046 2003o730    20030801    I501    I501 I489A  DG001   2                       10

解决方案

You can use parameter converters in read_table:

def converter(num):
    try:
        return np.float(num)
    except:
        return np.nan

#define each column
converters={'INDATUMA': converter, 'UTDATUMA': converter}

df = pd.read_table(filename, converters=converters)
print df
   LopNr  SJUKHUS  MVO  INDATUMA  UTDATUMA   HDIA DIAGNOS     OP  PVARD  \
0   1562    21001   46  20030707  20030711  I489A   I489A   I509      2   
1   1562    21001   46       NaN  20030801   I501    I501  I489A  DG001   

   EKOD1  EKOD2  EKOD3  EKOD4  EKOD5  ICD  
0     10    NaN    NaN    NaN    NaN  NaN  
1      2     10    NaN    NaN    NaN  NaN  

Or post-processing with parameter errors='coerce' of to_numeric:

df['INDATUMA'] = pd.to_numeric(df['INDATUMA'], errors='coerce')
0    20030707
1         NaN
Name: INDATUMA, dtype: float64

这篇关于在numpy中强制使用非数字字符到NA(当读取csv到pandas数据帧时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆