比较每 2 行并显示 R 中的不匹配 [英] Compare every 2 rows and show mismatches in R

查看:19
本文介绍了比较每 2 行并显示 R 中的不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我进行了很多搜索并自己尝试过,但找不到针对此特定问题的解决方案.

I have searched a lot and tried on my own too, but couldn't find solution for this particular problem.

对于每 2 行('key' 很常见),我必须在每一列中找到不匹配的内容,并以如下所示的有条理的方式突出显示它们.

For every 2 rows ('key' is common), I have to find mismatches in every column and highlight them in an organized way like below.

输出应采用以下格式:

COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...
COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...
COLUMN_NAME is not matching for records below:
PRINT COMPLETE RECORDS
...

输入数据(它是一个数据框):

key V1  V2  V3  V4  V5
a1  1   2   3   4   5
a1  1   3   9   4   5
a5  2   1   4   7   5
a5  2   1   4   7   6
a6  7   6   8   9   6
a6  7   6   3   9   6
a9  7   6   8   9   4
a9  7   6   8   9   3

输出:

V2 is not matching for records below:
key V1  V2  V3  V4  V5
a1  1   2   3   4   5
a1  1   3   9   4   5


V3 is not matching for records below:
key V1  V2  V3  V4  V5
a1  1   2   3   4   5
a1  1   3   9   4   5
a6  7   6   8   9   6
a6  7   6   3   9   6


V5 is not matching for records below:
key V1  V2  V3  V4  V5
a5  2   1   4   7   5
a5  2   1   4   7   6
a9  7   6   8   9   4
a9  7   6   8   9   3

我是 R 的初学者,所以请善待 :)

I'm a beginner in R, so please be nice :)

推荐答案

先用key分割你的数据框:

dfs <- split(df, df$key)  # presuming your data frame is named `df`

现在编写一个函数,获取一个数据框并比较第一行和第二行(为简单起见,我们不会检查数据框是否实际上有 2 行 - 这只是理所当然的):

now write a function taking a data frame and comparing first and second row (for simplicity, we're not going to check whether the data frame actually has 2 rows - that's just taken for granted):

chk <- function(x) sapply(x, function(u) u[1]==u[2])

现在将该函数应用于 split 的数据:

and now apply that function to the split'ed data:

matches <- sapply(dfs,chk)
## so `foo` is a matrix showing, for each variable and each ID, whether there is 
## a match or not
apply(matches, 1, function(x) colnames(matches)[which(!x)])
## and this one takes each row in `foo` and extracts the column name (i.e. key)
## for every TRUE-valued cell.  the result is a list - note that some of the
## elements will be empty

最后一行输出每个变量的不匹配对的名称(key列).

The last row outputs the names (key column) of the non-matching pairs of each variable.

现在是最后一步:

mm_keys <- apply(matches, 1, function(x) colnames(matches)[which(!x)])
# mm_keys stands for mismatching keys
lapply(mm_keys, function(x) subset(df, key %in% x))
# this one, called `mm_lines` below, takes each element from mm_keys
# .. and extracts (via `subset`) the corresponding lines from the original data frame

好的,您已经拥有了您想要的所有信息,但没有以很好的方式格式化.您也可以轻松做到这一点.

Ok by this you already have all information that you wanted but not formatted in a nice way. You can do that easily too.

mm_lines <- lapply(mm_keys, function(x) subset(df, key %in% x))
mm_lines <- mm_lines[sapply(mm_lines, nrow)>0]  
# leave out variables where there is no mismatch
# for understanding this, try what `sapply(mm_lines, nrow)` does
# and add labels the way you want:
names(mm_lines) <- paste(names(mm_lines), "IS NOT MATCHING FOR RECORDS BELOW:")

现在输出:

print(boo)
#$`V2 IS NOT MATCHING FOR RECORDS BELOW:`
#  key V1 V2 V3 V4 V5
#1  a1  1  2  3  4  5
#2  a1  1  3  9  4  5
#
#$`V3 IS NOT MATCHING FOR RECORDS BELOW:`
#  key V1 V2 V3 V4 V5
#1  a1  1  2  3  4  5
#2  a1  1  3  9  4  5
#5  a6  7  6  8  9  6
#6  a6  7  6  3  9  6
#
#$`V5 IS NOT MATCHING FOR RECORDS BELOW:`
#  key V1 V2 V3 V4 V5
#3  a5  2  1  4  7  5
#4  a5  2  1  4  7  6
#7  a9  7  6  8  9  4
#8  a9  7  6  8  9  3

既然你要求它,这里有一些东西可以在一行上完成,看起来有点像魔术师:

[edit]

Since you asked for it, here is something that does it with on one line and looks a bit more like magick:

boo <- (function(x) x[sapply(x, nrow)>0])(lapply(lapply(df, function(x) tapply(x, df$key, function(x) x[1]!=x[2])), function(x) subset(df, key %in% names(which(x)))))

并以您想要的方式将其写入文本文件(out.txt"):

And for writing it to a text file ("out.txt") the way you wanted:

sink("out.txt")
for(iii in seq_along(boo)){
  cat(names(boo)[iii], "IS NOT MATCHING FOR THE RECORDS BELOW:
")
  print(boo[[iii]])
  cat("
")
  }
sink(NULL)

这篇关于比较每 2 行并显示 R 中的不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆