这对于read_csv和数据值NA是否正确? [英] Is this correct behavior for read_csv and a data value of NA?
问题描述
(我在GitHub上打开了问题.)
(I have opened an issue at GitHub.)
以下行为对我来说似乎不正确.似乎如果read_csv
的默认值为na_values=False
,则不应将包括"NA"在内的任何值解释为NaN,但事实并非如此.
The following behavior doesn't seem correct to me. It seems like if the default for read_csv
is na_values=False
then no values including 'NA' should be interpreted as NaN but this does not appear to be the case.
此行为在此帖子中已注意到(见评论) (@JianxunLi)的答案),其中"NA"实际上是北美".实际上,如果不将其更改为NaN,我将无法找到一种方法来阅读它,并且肯定应该有某种方法可以做到这一点.
This behavior was noticed in this post (see the comments to the answer by @JianxunLi), where 'NA' actually means 'North America'. I actually am unable to find a way to read this in without having it changed to NaN and there definitely should be some way to do this.
这是csv示例.
%more foo.txt
x,y
"NA",NA
"foo",foo
我在引号和外部都添加了"NA",以查看是否很重要,但是正如您在下面看到的那样,这似乎并不重要.
I'm including 'NA' both in quotes and outside to see if that matters, but as you can see below it doesn't seem to.
pd.read_csv('foo.txt')
Out[56]:
x y
0 NaN NaN
1 foo foo
pd.read_csv('foo.txt',na_values=False)
Out[57]:
x y
0 NaN NaN
1 foo foo
pd.read_csv('foo.txt',na_values='foo')
Out[58]:
x y
0 NaN NaN
1 NaN NaN
似乎'NaN'的数据值与'NA'相同.
It appears that data values of 'NaN' are treated the same as 'NA'.
编辑以添加:尽管我觉得@Marius的答案似乎并不正确(默认行为,即似乎不是Marius的答案,但我认为我对@Marius的答案更了解)是对正在发生的事情的正确解释.
Edit to add: I think I am understanding this better based on @Marius's answer although it doesn't really seem right to me (the default behavior, that is, not Marius's answer which does seem to be a correct explanation of what is happening).
na_values=False => NA and NaN are treated as NaN
na_values='foo' => NA, NaN, and foo are treated as NaN
我想我可以理解这是数字列中的默认行为,但似乎这不是字符串列的默认行为.我也很难在没有看到Marius回答的情况下从文档中弄清楚这一点.
I guess I can understand this being default behavior in a number column but it doesn't seem like this should be the default for a string column. I also would have really struggled to figure this out from the documentation without seeing Marius's answer.
编辑以添加(2):
为了进行比较,我将其读入Stata和Excel中,并且在两种情况下都将'NA'视为纯文本,而不是NaN/missing.是否还有其他软件包或库的默认行为与此处的熊猫相同?
Also, for comparison, I read this into Stata and Excel and in both cased they treat 'NA' as plain text, not as NaN/missing. Is there any other package or library that would have the same default behavior as pandas here?
推荐答案
您需要keep_default_na=False
,默认情况下,您将na_values
中包含的所有字符串都添加到标准的NA字符串集中,例如NA
,NaN
:
You need keep_default_na=False
, by default any strings you include in na_values
are just added to the standard set of NA strings, e.g. NA
, NaN
:
pd.read_csv('foo.txt', keep_default_na=False)
Out[5]:
x y
0 NA NA
1 foo foo
这篇关于这对于read_csv和数据值NA是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!