按因子列安全合并数据帧 [英] Safely merge data frames by factor columns

查看:76
本文介绍了按因子列安全合并数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因子可以帮助防止R中出现某些编程错误:您不能对使用不同级别的因子进行相等性检查,并且在执行大于/小于无序因子的检查时会受到警告.

Factors can help preventing some kinds of programming errors in R: You cannot perform equality check for factors that use different levels, and you are warned when performing greater/less than checks for unordered factors.

a <- factor(letters[1:3])
b <- factor(letters[1:3], levels=letters[4:1])
a == b
## Error in Ops.factor(a, b) : level sets of factors are different
a < a
## [1] NA NA NA
## Warning message:
## In Ops.factor(a, a) : < not meaningful for factors

但是,与我的预期相反,合并数据帧时未执行此检查:

However, contrary to my expectation, this check is not performed when merging data frames:

ad <- data.frame(x=a, a=as.numeric(a))
bd <- data.frame(x=b, b=as.numeric(b))
merge(ad, bd)
##   x a b
## 1 a 1 4
## 2 b 2 3
## 3 c 3 2

这些因素似乎只是对字符的强迫.

Those factors simply seem to be coerced to characters.

是否存在可以进行检查的安全合并"?您是否看到默认情况下不执行此检查的具体原因?

Is a "safe merge" available somewhere that would do the check? Do you see specific reasons for not doing this check by default?

示例(现实用例):假设两个空间数据集在社区中具有非常相似但又不完全相同的细分.数据集所指的时间点略有不同,并且某些公社在该时间段内已合并.每个数据集都有一个公社ID"列,甚至可能具有相同的名称.尽管此列的语义非常相似,但我不想(偶然地)在此公社ID列上合并数据集.相反,我在旧"和新"公社ID之间构造了一个匹配表.如果将公社ID编码为因素,则安全合并"将对合并操作进行正确性检查,而不会产生任何额外的(实现)成本和很少的计算成本.

Example (real-life use case): Assume two spatial data sets with very similar but not identical subdivision in, say, communes. The data sets refer to slightly different points in time, and some of the communes have merged during that time span. Each data set has a "commune ID" column, perhaps even named identically. While the semantics of this column are very similar, I wouldn't want to (accidentally) merge the data sets over this commune ID column. Instead, I construct a matching table between "old" and "new" commune IDs. If the commune IDs are encoded as factors, a "safe merge" would give a correctness check for the merge operation at no extra (implementation) cost and very little computational cost.

推荐答案

带有merge的安全防护"是by=参数.您可以确切设置您认为应该匹配的列.如果您将两个因子列匹配,R将使用这些值的标签将它们匹配.因此,无论因数的隐藏内部工作如何对这些值进行编码,"a"都将与"a"匹配.这就是用户所看到的,因此这就是将其合并的方式.就像数值一样,您可以选择合并具有完全不同范围的列(第一列的比例为1:10,第二列的比例为100:1000).设置by值后,R将执行要求的操作.而且,如果您未明确设置by参数,则R将在两个data.frames中找到所有共享的列名称,并使用该名称.

The "safe guard" with merge is the by= parameter. You can set exactly which columns you think should match. If you match up two factor columns, R will use the the labels for those values to match them up. So "a" will match with "a" regardless of how the hidden inner working of factor have coded those values. That's what a user sees, so that's how it will be merged. It's just like with numeric values, you can choose to merge on columns that have complete different ranges (the first column has 1:10, the second has 100:1000). When the by value is set, R will do what it's asked. And if you don't explicitly set the by parameter, then R will find all shared column names in the two data.frames and use that.

很多时候合并时,您并不总是期望匹配.有时您使用all.xall.y专门获取不匹配的记录.在这种情况下,取决于创建不同data.frame的方式,可能不知道它没有的级别.因此,尝试将它们合并并不是完全不合理的.

And many times when merging, you don't always expect matches. Sometimes you're using all.x or all.y to specifically get unmatched records. In this case, depending on how the different data.frames were created, one may not know about the levels it doesn't have. So it's not at all unreasonable to to try to merge them.

因此,基本上R会在合并过程中处理诸如字符之类的因素,因为它假设您已经知道两列属于同一列.

So basically R is treating factors like characters during merging, be cause it assumes that you already know that two columns belong together.

这篇关于按因子列安全合并数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆