防止大 pandas 将字符串中的"NA"解释为NaN [英] Prevent pandas from interpreting 'NA' as NaN in a string

查看:167
本文介绍了防止大 pandas 将字符串中的"NA"解释为NaN的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

pandas read_csv()方法解释为"NA'为nan(不是数字),而不是有效的字符串.

The pandas read_csv() method interprets 'NA' as nan (not a number) instead of a valid string.

在下面的简单情况下,请注意,第1行第2列(基于零的计数)的输出为'nan'而不是'NA'.

In the simple case below note that the output in row 1, column 2 (zero based count) is 'nan' instead of 'NA'.

sample.tsv (制表符分隔)

PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_EN​​D SP_BEG SP_END
5d8b N P60490 1146 1146 1146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118

PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 1 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118

read_sample.py

import pandas as pd

df = pd.read_csv(
    'sample.tsv',
    sep='\t',
    encoding='utf-8',
)

for df_tuples in df.itertuples(index=True):
    print(df_tuples)

输出

(0,u'5d8b',u'N',u'P60490',1,146,1,146,1,146)
(1,u'5d8b',nan,u'P80377',1,126,1,126,1,126)
(2,u'5d8b',u'O',u'P60491',1,118,1,118,1,118)

(0, u'5d8b', u'N', u'P60490', 1, 146, 1, 146, 1, 146)
(1, u'5d8b', nan, u'P80377', 1, 126, 1, 126, 1, 126)
(2, u'5d8b', u'O', u'P60491', 1, 118, 1, 118, 1, 118)

其他信息

用'CHAIN'列中的数据引号重写文件,然后使用quotechar参数quotechar='\''具有相同的结果.并且通过dtype参数dtype=dict(valid_cols)传递类型字典不会改变结果.

Additional Information

Re-writing the file with quotes for data in the 'CHAIN' column and then using the quotechar parameter quotechar='\'' has the same result. And passing a dictionary of types via the dtype parameter dtype=dict(valid_cols) does not change the result.

防止熊猫自动读取read_csv中的类型的答案建议首先使用numpy记录数组来解析文件,但由于现在可以指定列dtypes,因此不必这样做.

An old answer to Prevent pandas from automatically inferring type in read_csv suggests first using a numpy record array to parse the file, but given the ability to now specify column dtypes, this shouldn't be necessary.

请注意,按照iterrows文档中所述,itertuples()用于保留dtype:要在遍历行时保留dtype,最好使用itertuples()返回值的元组,并且通常更快.麻烦."

Note that itertuples() is used to preserve dtypes as described in the iterrows documentation: "To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns tuples of the values and which is generally faster as iterrows."

示例已在Python 2和3上使用pandas版本0.16.2、0.17.0和0.17.1进行了测试.

Example was tested on Python 2 and 3 with pandas version 0.16.2, 0.17.0, and 0.17.1.

是否有一种方法可以捕获有效的字符串"NA",而不是将其转换为nan?

Is there a way to capture a valid string 'NA' instead of it being converted to nan?

推荐答案

您可以使用参数keep_default_nana_values手动设置所有NA值

You could use parameters keep_default_na and na_values to set all NA values by hand docs:

import pandas as pd
from io import StringIO

data = """
PDB CHAIN SP_PRIMARY RES_BEG RES_END PDB_BEG PDB_END SP_BEG SP_END
5d8b N P60490 1 146 1 146 1 146
5d8b NA P80377 _ 126 1 126 1 126
5d8b O P60491 1 118 1 118 1 118
"""

df = pd.read_csv(StringIO(data), sep=' ', keep_default_na=False, na_values=['_'])

In [130]: df
Out[130]:
    PDB CHAIN SP_PRIMARY  RES_BEG  RES_END  PDB_BEG  PDB_END  SP_BEG  SP_END
0  5d8b     N     P60490        1      146        1      146       1     146
1  5d8b    NA     P80377      NaN      126        1      126       1     126
2  5d8b     O     P60491        1      118        1      118       1     118

In [144]: df.CHAIN.apply(type)
Out[144]:
0    <class 'str'>
1    <class 'str'>
2    <class 'str'>
Name: CHAIN, dtype: object

编辑

na值中的所有默认NA(自pandas 1.0.0起):

All default NA values from na-values (as of pandas 1.0.0):

默认的NaN识别值为['-1.#IND','1.#QNAN','1.#IND','-1.#QNAN','#N/AN/A',' #N/A','N/A','n/a','NA','','#NA','NULL','null','NaN','-NaN','nan' ,'-nan',''].

The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/A N/A', '#N/A', 'N/A', 'n/a', 'NA', '', '#NA', 'NULL', 'null', 'NaN', '-NaN', 'nan', '-nan', ''].

这篇关于防止大 pandas 将字符串中的"NA"解释为NaN的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆