这对于read_csv和数据值NA是否正确? [英] Is this correct behavior for read_csv and a data value of NA?

查看:122
本文介绍了这对于read_csv和数据值NA是否正确?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(我在GitHub上打开了问题.)

(I have opened an issue at GitHub.)

以下行为对我来说似乎不正确.似乎如果read_csv的默认值为na_values=False,则不应将包括"NA"在内的任何值解释为NaN,但事实并非如此.

The following behavior doesn't seem correct to me. It seems like if the default for read_csv is na_values=False then no values including 'NA' should be interpreted as NaN but this does not appear to be the case.

此行为在此帖子中已注意到(见评论) (@JianxunLi)的答案),其中"NA"实际上是北美".实际上,如果不将其更改为NaN,我将无法找到一种方法来阅读它,并且肯定应该有某种方法可以做到这一点.

This behavior was noticed in this post (see the comments to the answer by @JianxunLi), where 'NA' actually means 'North America'. I actually am unable to find a way to read this in without having it changed to NaN and there definitely should be some way to do this.

这是csv示例.

%more foo.txt
x,y
"NA",NA
"foo",foo

我在引号和外部都添加了"NA",以查看是否很重要,但是正如您在下面看到的那样,这似乎并不重要.

I'm including 'NA' both in quotes and outside to see if that matters, but as you can see below it doesn't seem to.

pd.read_csv('foo.txt')
Out[56]: 
     x    y
0  NaN  NaN
1  foo  foo

pd.read_csv('foo.txt',na_values=False)
Out[57]: 
     x    y
0  NaN  NaN
1  foo  foo

pd.read_csv('foo.txt',na_values='foo')
Out[58]: 
    x   y
0 NaN NaN
1 NaN NaN

似乎'NaN'的数据值与'NA'相同.

It appears that data values of 'NaN' are treated the same as 'NA'.

编辑以添加:尽管我觉得@Marius的答案似乎并不正确(默认行为,即似乎不是Marius的答案,但我认为我对@Marius的答案更了解)是对正在发生的事情的正确解释.

Edit to add: I think I am understanding this better based on @Marius's answer although it doesn't really seem right to me (the default behavior, that is, not Marius's answer which does seem to be a correct explanation of what is happening).

na_values=False    =>   NA and NaN are treated as NaN
na_values='foo'    =>   NA, NaN, and foo are treated as NaN

我想我可以理解这是数字列中的默认行为,但似乎这不是字符串列的默认行为.我也很难在没有看到Marius回答的情况下从文档中弄清楚这一点.

I guess I can understand this being default behavior in a number column but it doesn't seem like this should be the default for a string column. I also would have really struggled to figure this out from the documentation without seeing Marius's answer.

编辑以添加(2):

为了进行比较,我将其读入Stata和Excel中,并且在两种情况下都将'NA'视为纯文本,而不是NaN/missing.是否还有其他软件包或库的默认行为与此处的熊猫相同?

Also, for comparison, I read this into Stata and Excel and in both cased they treat 'NA' as plain text, not as NaN/missing. Is there any other package or library that would have the same default behavior as pandas here?

推荐答案

您需要keep_default_na=False,默认情况下,您将na_values中包含的所有字符串都添加到标准的NA字符串集中,例如NANaN:

You need keep_default_na=False, by default any strings you include in na_values are just added to the standard set of NA strings, e.g. NA, NaN:

pd.read_csv('foo.txt', keep_default_na=False)
Out[5]: 
     x    y
0   NA   NA
1  foo  foo

这篇关于这对于read_csv和数据值NA是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆