NaN和None有什么区别? [英] What is the difference between NaN and None?
问题描述
我正在使用pandas readcsv()
读取csv文件的两列,然后将值分配给字典.这些列包含数字和字母字符串.有时在某些情况下,一个单元格是空的.我认为,读入该词典条目的值应为None
,但应指定为nan
. None
当然可以更准确地描述一个空单元格,因为它具有空值,而nan
只是说读取的值不是数字.
I am reading two columns of a csv file using pandas readcsv()
and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None
but instead nan
is assigned. Surely None
is more descriptive of an empty cell as it has a null value, whereas nan
just says that the value read is not a number.
我的理解正确吗,None
和nan
之间的区别是什么?为什么要分配nan
而不是None
?
Is my understanding correct, what IS the difference between None
and nan
? Why is nan
assigned instead of None
?
此外,我的字典检查是否有空单元格一直使用numpy.isnan()
:
Also, my dictionary check for any empty cells has been using numpy.isnan()
:
for k, v in my_dict.iteritems():
if np.isnan(v):
但是这给了我一个错误,说我不能对v
使用此检查.我猜这是因为要使用整数或浮点变量,而不是字符串.如果是这样,如何检查v
的空单元格"/nan
大小写?
But this gives me an error saying that I cannot use this check for v
. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v
for an "empty cell"/nan
case?
推荐答案
NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.
我们在文档选择不适用代表" :
经过数年的生产使用,至少在我看来,考虑到NumPy和Python的总体状况,[NaN]已被证明是最好的决定. 无处不在将特殊值NaN(非数字)用作NA值,并且有API函数
notnull
,可用于dtypes来检测NA值.
...
因此,我选择了Pythonic的实用性胜过纯度"方法,并且将整数NA功能换成了更简单的方法,即在float和object数组中使用特殊值来表示NA,并在必须引入NA时将整数数组提升为float. /p>
After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions
isnull
andnotnull
which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic "practicality beats purity" approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.
注意:"gotcha ",将包含缺失数据的整数系列转换为浮点数.
在我看来,使用NaN(而不是None)的主要原因是它可以与numpy的float64 dtype一起存储,而不是效率较低的对象dtype,请参见
In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.
# without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])
In [13]: s_bad.dtype
Out[13]: dtype('O')
In [14]: s_good.dtype
Out[14]: dtype('float64')
杰夫对此评论(下):
np.nan
允许向量化操作;它是一个浮点值,而None
根据定义将强制对象类型,这实际上会禁用numpy中的所有效率.
np.nan
allows for vectorized operations; its a float value, whileNone
, by definition, forces object type, which basically disables all efficiency in numpy.
所以请重复3次:object ==坏,float ==好
说,与None和NaN相比,许多操作仍然可以正常工作(但可能不受支持,即有时可能会给令人惊讶结果):
Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):
In [15]: s_bad.sum()
Out[15]: 1
In [16]: s_good.sum()
Out[16]: 1.0
要回答第二个问题:
您应该使用 pd.isnull
和 pd.notnull
进行测试丢失数据(NaN).
To answer the second question:
You should be using pd.isnull
and pd.notnull
to test for missing data (NaN).
这篇关于NaN和None有什么区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!