查找在R中成为主键的变量组合 [英] Find variable combinations that makes Primary Key in R

查看:133
本文介绍了查找在R中成为主键的变量组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的玩具数据帧。

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)

如何获得最小数量的变量的组合,这些变量可以唯一地标识数据框中的观测值,即哪些变量可以共同构成主键

How can I get the combination of a minimum number of variables that uniquely identify the observations in the dataframe i.e which variables together can make the primary key?

我解决此问题的方法是找到变量的组合,其变量的不同值等于数据帧的观察次数。因此,在这种情况下,这些变量组合将给我8个观察结果。我随机尝试了一下,发现很少了:

The way I approached this problem is to find the combination of variables for which distinct values is equal to the number of observations of the data frame. So, those variable combinations that will give me 8 observation, in this case. I randomly tried that and found few:

df %>% distinct(var1, var2, var3)

df %>% distinct(var1, var2, var5)

df %>% distinct(var1, var3, var7)

因此vars123,vars125,vars137应该在此处指定主键。如何使用R以编程方式找到这些变量组合。而且,如果可能的话,应该更优先考虑字符,因子,日期和(也许)整数变量,因为双精度数不能成为主键。

So vars123, vars125, vars137 deserves to the Primary Key here. How can I find these variable combinations programmatically using R. Also, more preference should be given to character, factor, date, and (maybe) integer variables, if possible, as doubles should not make the Primary Key.

输出可能是列表或数据框,说明了组合 var1,var2,var3, var1,var2,var5, var1,var3,var7。

The output could be list or dataframe stating combinations "var1, var2, var3", "var1, var2, var5", "var1, var3, var7".

推荐答案

其他答案略有不同,但这是请求的表格输出:

A bit of a variation on the other answers, but here's the requested tabular output:

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)

然后使用最少数量的变量作为主键:

Then take the least number of variables required to be a primary key:

out[match(nrow(df), out$counts),]
#        vars counts
#12 var1,var6      8

这篇关于查找在R中成为主键的变量组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆