在numpy中强制使用非数字字符到NA(当读取csv到pandas数据帧时) [英] forcing non-numeric characters to NAs in numpy (when reading a csv to a pandas dataframe)
问题描述
我有记录,其中字段(称为 INDATUMA
和 UTDATUMA
)应包括范围在20010101和20141231(显而易见的原因)。为了允许缺失值,但保留精确到最近的日期,我将它们存储为浮点数(np.float64)。我希望这将强制偶尔错误的字段(想想2oo41oo9)到 NA
s,而是打破了在pandas 0.18.0或IOPro 1.7.2中的导入。 / p>
有没有文件的选项可以使用什么?否则?
熊猫尝试的关键线是
import numpy as np
import pandas as pd
treatments = pd.read_table(filename,usecols = [0,3,4,6],engine ='c',dtype = {'LopNr':np .uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})
与错误 ValueError:无效的文字为float():2003o730
。
在IOPro中,以防万一:
import iopro
adapter = iopro.text_adapter(filename,parser ='csv ',':'f8',4:'f8',6:'','''' object'})
all_treatments.append(adapter [[0,3,4,6]] [:])
但是这也打破了 iopro.lib.errors.DataTypeError:无法将记录1字段3的标记2003o730转换为float64.Reason:unknown
数据文件以
开头 LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
1562 21001 046 20030707 20030711 I489A I489A I509 2 10
1562 21001 046 2003o730 20030801 I501 I501 I489A DG001 2 10
您可以使用 read_table
:
def converter(num):
try:
return np.float(num)
除外:
return np.nan
#define each column
converters = {'INDATUMA':converter,'UTDATUMA':converter}
df = pd.read_table(filename,converters = converters)
print df
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD \
0 1562 21001 46 20030707 20030711 I489A I489A I509 2
1 1562 21001 46 NaN 20030801 I501 I501 I489A DG001
EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
0 10 NaN NaN NaN NaN NaN
1 2 10 NaN NaN NaN NaN
或使用参数 errors ='coerce'进行后处理
to_numeric
:
df ['INDATUMA'] = pd.to_numeric(df ['INDATUMA'],errors =' coerce')
0 20030707
1 NaN
名称:INDATUMA,dtype:float64
I have records where fields (called INDATUMA
and UTDATUMA
) are supposed to comprise numbers in the range of 20010101 and 20141231 (for the obvious reason). To allow missing values but retain precision up to the nearest dates, I would store them as floats (np.float64). I was hoping this would force the occasionally misformatted field (think of 2oo41oo9) to NA
s, but instead breaks the import both in pandas 0.18.0 or IOPro 1.7.2.
Is there an undocumented option what could use? Or else?
The key line for the pandas attempt is
import numpy as np
import pandas as pd
treatments = pd.read_table(filename,usecols=[0,3,4,6], engine='c', dtype={'LopNr':np.uint32,'INDATUMA':np.float64,'UTDATUMA':np.float64,'DIAGNOS':object})
With the eror ValueError: invalid literal for float(): 2003o730
.
I tried the following in IOPro, just in case:
import iopro
adapter = iopro.text_adapter(filename, parser='csv',delimiter='\t',output='dataframe',infer_types=False)
adapter.set_field_types({0: 'u4',3:'f8', 4:'f8',6:'object'})
all_treatments.append(adapter[[0,3,4,6]][:])
But this also breaks with iopro.lib.errors.DataTypeError: Could not convert token "2003o730" at record 1 field 3 to float64.Reason: unknown
The datafile starts as
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
1562 21001 046 20030707 20030711 I489A I489A I509 2 10
1562 21001 046 2003o730 20030801 I501 I501 I489A DG001 2 10
You can use parameter converters
in read_table
:
def converter(num):
try:
return np.float(num)
except:
return np.nan
#define each column
converters={'INDATUMA': converter, 'UTDATUMA': converter}
df = pd.read_table(filename, converters=converters)
print df
LopNr SJUKHUS MVO INDATUMA UTDATUMA HDIA DIAGNOS OP PVARD \
0 1562 21001 46 20030707 20030711 I489A I489A I509 2
1 1562 21001 46 NaN 20030801 I501 I501 I489A DG001
EKOD1 EKOD2 EKOD3 EKOD4 EKOD5 ICD
0 10 NaN NaN NaN NaN NaN
1 2 10 NaN NaN NaN NaN
Or post-processing with parameter errors='coerce'
of to_numeric
:
df['INDATUMA'] = pd.to_numeric(df['INDATUMA'], errors='coerce')
0 20030707
1 NaN
Name: INDATUMA, dtype: float64
这篇关于在numpy中强制使用非数字字符到NA(当读取csv到pandas数据帧时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!