使用包含的NA值索引Julia的DataArrays [英] Indexing Julia's DataArrays with included NA values

查看:218
本文介绍了使用包含的NA值索引Julia的DataArrays的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道为什么无法使用NA值索引Julia的DataArrays。
执行下面的snipped会导致错误(NAException(无法使用包含NA值的DataArray索引数组)):

I am wondering why indexing Julia's DataArrays with NA values is not possible. Excuting the snipped below results in an error(NAException("cannot index an array with a DataArray containing NA values")):

dm = data([1 4 7; 2 5 8; 3 1 9])
dm[dm .== 5] = NA

dm[dm .< 3] = 1  #Error
dm[(!isna(dm)) & (dm .< 3)] = 1  #Working

有一个解决方案可以忽略NA的带有 isna()的DataFrame,如回答此处。乍一看它的工作方式应该如此,忽略DataFrames中的NA是与DataArrays相同的方法,因为DataFrame的每一列都是DataArray,声明这里。但在我看来,在每种条件下忽略!isna()的缺失值并不是最佳解决方案。

There is a solutions to ignore NA's in a DataFrame with isna(), like answered here. At a first glance it works like it should and ignoring NA's in DataFrames is the same approach like for the DataArrays, because each column of a DataFrame is a DataArray, stated here. But in my opinion ignoring missing values with !isna() on each condition is not the best solution.

For我不清楚为什么如果包含NA,DataFrame模块会抛出错误。如果索引所需的布尔数组具有NA的值,则此值应转换为 false ,如MATLAB®或Pythons Pandas。在 indexing.jl <的DataArray模块源代码(如下所示)中/ a>,有一个显式函数来抛出NAException:

For me it's not clear why the DataFrame Module throws an error if NA's are included. If the boolean Array needed for indexing, has NA's values, this values should convert to false like MATLAB® or Pythons Pandas does. In the DataArray modules sourcecode(shown below) in indexing.jl, there is an explicit function to throw the NAException:

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end

如果您通过将NA设置为false来更改代码段...

If you change the snippet by setting the NA's to false ...

# Indexing with NA throws an error
function Base.to_index(A::DataArray)
    A[A.na] = false
    any(A.na) && throw(NAException("cannot index an array with a DataArray containing NA values"))
    Base.to_index(A.data)
end 

... dm [dm。< 3] = 1 就像它应该的那样(就像在MATLAB®或Pandas中一样)。

... dm[dm .< 3] = 1 works like it should(like in MATLAB® or Pandas).

对我来说,如果在索引中包含NA,则自动抛出错误是没有意义的。最低层应该是创建DataArray的参数,让用户选择是否忽略NA。有两个显着的原因:一方面,当你有具有大量索引和NA值的公式(例如计算气象网格模型)时,它不是非常适合编写和阅读代码,另一方面有明显的损失该时间测试显示的表现:

For me it make no sense to automatically throw error if NA's are included on indexing. There should leastwise be a parameter creating the DataArray to let the user choose if NA's are ignored. There are two siginificant reasons: On the one hand it's not very pleasent for writing and reading code, when you have formulas with a lot of indexing and NA values (e.g calculating meteorological grid models) and on the other hand there is a noticeable loss of performance, which this timetest is showing:

@timeit dm[(!isna(dm)) & (dm .< 3)] = 1  #14.55 µs per loop  
@timeit dm[dm .< 3] = 1  #754.79 ns per loop

开发人员使用此异常的原因是什么是否有另一种更简单的方法,因为!isna()忽略了DataArrays中的NA?

What is the reason that the developers make use of this exception and is there another simpler approach as the !isna() for ignoring NA's in DataArrays?

推荐答案

假设你有三只兔子。你想把雌兔放在与雄性不同的笼子里。你看第一只兔子,看起来像男性,所以你把它放在原处。你看第二只兔子,它看起来像一只雌性,所以你把它移到单独的笼子里。你真的不能好好看看第三只兔子。你应该怎么做?

Suppose you have three rabbits. You want to put the female rabbit(s) in a separate cage from the males. You look at the first rabbit, and it looks like a male, so you leave it where it is. You look at the second rabbit, and it looks like a female, so you move it to the separate cage. You can't really get a good look at the third rabbit. What should you do?

这取决于。也许你可以把留下未知性别的兔子抛在身后。但如果你因为你不想让它们制作小兔子而将兔子分开,那么你可能希望你的分析软件告诉你它不知道第三只兔子的性别。

It depends. Maybe you're fine with leaving the rabbit of unknown sex behind. But if you're separating out the rabbits because you don't want them to make baby rabbits, then you might want your analysis software to tell you that it doesn't know the sex of the third rabbit.

分析数据时经常出现这种情况。在大多数病理情况下,数据是系统而不是随机丢失的。如果你要调查一群人关于蓬松的兔子是多少以及是否应该多吃,你可以比较 mean(fluffiness [should_be_eaten_more])平均值(蓬松[!should_be_eaten_more])。但是,如果那些真的喜欢兔子的人被激怒而你正在谈论吃它们,那么他们可能会把第二个问题留空。如果你忽视这一点,你会低估那些不认为兔子应该被吃得更多的人的平均蓬松等级,这将是一个严重的错误。这就是为什么 fluffiness [!should_be_eaten_more] 会在缺少值的情况下抛出错误:这表明无论你想对数据做什么都可能没有给出正确的权利结果。这种情况很糟糕,以至于人们会写完整篇论文,例如: 这一个

Situations like this arise often when analyzing data. In the most pathological cases, data is missing systematically rather than at random. If you were to survey a bunch of people about how fluffy rabbits are and whether they should be eaten more, you could compare mean(fluffiness[should_be_eaten_more]) and mean(fluffiness[!should_be_eaten_more]). But, if people who really like rabbits are incensed that you're talking about eating them at all, they might leave that second question blank. If you ignore that, you will underestimate the mean fluffiness rating among people who don't think rabbits should be eaten more, which would be a grave mistake. This is why fluffiness[!should_be_eaten_more] will throw an error if there are missing values: It is a sign that whatever you are trying to do with your data may not give the right results. This situation is bad enough that people write entire papers about it, e.g. this one.

足够的兔子。在索引时,有可能(并且可能有一天)更简洁的方式来删除/保留所有缺失值,但由于上述原因,它总是显式的而不是隐式的。就性能而言,虽然 isna(x)& (x <3) vs x< 3 ,重复索引到数组的开销也很高,DataArrays会增加额外的开销。随着阵列变大,相对开销减少。如果这是您的代码中的瓶颈,那么最好的选择是以不同的方式编写代码。

Enough about rabbits. It is possible that there should be (and may someday be) a more concise way to drop/keep all missing values when indexing, but it will always be explicit rather than implicit for the reason described above. As far as performance goes, while there is a slowdown for isna(x) & (x < 3) vs x < 3, the overhead of repeatedly indexing into an array is also high, and DataArrays adds additional overhead on top of that. The relative overhead decreases as the array gets larger. If this is a bottleneck in your code, your best bet is to write it differently.

这篇关于使用包含的NA值索引Julia的DataArrays的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆