合并并比较来自不同文件的不同列 [英] Merge and compare different columns from different files

查看:98
本文介绍了合并并比较来自不同文件的不同列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使我通常在excel中完成的过程自动化.此过程包括合并和比较不同的列. 例如:

I'm trying to automate a process I've normally done in excel. This process consists of merge and compare different columns. For example:

df1: 
sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN
sp|P424|LPPRC_HUMAN
sp|P474|LRC_HUMAN

df2: 
sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN
sp|P42704|LPPRC_HUMAN

df3: 
sp|P07437|TBB5_HUMAN
sp|P10788|CH70_HUMAN
sp|P42704|LPPRC_HUMAN

输出是这样的:

sp|P07437|TBB5_HUMAN  | sp|P07437|TBB5_HUMAN | sp|P07437|TBB5_HUMAN
sp|P10809|CH60_HUMAN  | sp|P10809|CH60_HUMAN |
                      |                      | sp|P10788|CH70_HUMAN
sp|P424|LPPRC_HUMAN   |                      |
sp|P474|LRC_HUMAN     |                      |
                      | sp|P42704|LPPRC_HUMAN| sp|P42704|LPPRC_HUMAN

我试图使用函数comparemerge

I was trying to use the function compare or mergelink but I don't have this result. Do you know another function that I can use in this case?

或多或少是类似于Venn Diagram的事情,这正是我在此之后所做的工作,目的是检查一切是否良好.

More or less is something like Venn Diagram, that is exactly what I do after this in order to check that everything is good.

您在这里,是一个可复制的示例:

Here you are and a reproducible example:

df1 = data.frame(TEST1=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN", "sp|P424|LPPRC_HUMAN"))

df2 = data.frame(TEST2=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN","   sp|P42704|LPPRC_HUMAN"))

df3 = data.frame(TEST3=c("sp|P07437|TBB5_HUMAN","sp|P10788|CH70_HUMAN",     "sp|P42704|LPPRC_HUMAN"))

非常感谢.

推荐答案

我正在使用数据的稍作修改的版本,以避免在数据中使用factor.假设这是复制/粘贴中的错误,我还修剪了多余的空白.

I'm using a slightly-modified version of your data, avoiding factors in the data. I also trimmed extra white-space, assuming it's a mistake in copy/paste.

df1 = data.frame(TEST1=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN", "sp|P424|LPPRC_HUMAN"),
                 stringsAsFactors = FALSE)
df2 = data.frame(TEST2=c("sp|P07437|TBB5_HUMAN","sp|P10809|CH60_HUMAN","   sp|P42704|LPPRC_HUMAN"),
                 stringsAsFactors = FALSE)
df3 = data.frame(TEST3=c("sp|P07437|TBB5_HUMAN","sp|P10788|CH70_HUMAN",     "sp|P42704|LPPRC_HUMAN"),
                 stringsAsFactors = FALSE)

由于这类问题很容易扩展到包括最初的data.frames数量之外,所以我通常更喜欢使用 data.frames列表,而不是显式data.frames,如果位于一切可能.

Since this kind of problem can easily extend to include more than the initial count of data.frames, I usually prefer to work with lists of data.frames, not explicit data.frames, if at all possible.

lst <- list(df1, df2, df3)

现在,这是一种获得所需结果的方法:

Now here's one method to get your desired results:

alltests <- unique(trimws(unlist(lst, recursive = TRUE)))
as.data.frame(
  setNames(lapply(lst, function(a) alltests[ match(alltests, a[,1]) ]),
           sapply(lst, names)),
  stringsAsFactors = FALSE
)
#                  TEST1                TEST2                TEST3
# 1 sp|P07437|TBB5_HUMAN sp|P07437|TBB5_HUMAN sp|P07437|TBB5_HUMAN
# 2 sp|P10809|CH60_HUMAN sp|P10809|CH60_HUMAN                 <NA>
# 3  sp|P424|LPPRC_HUMAN                 <NA>                 <NA>
# 4                 <NA>                 <NA>  sp|P424|LPPRC_HUMAN
# 5                 <NA>                 <NA> sp|P10809|CH60_HUMAN

这取决于(1)单列data.frame(尽管可以纠正); (2)唯一的列名.您建议的输出结果并不意味着要进行任何排序,因此我选择在此处不进行任何排序.使用alltests <- sort(unique(...))足够容易,尽管请注意,这是一种字母排序,而不是基于子字符串的数字部分.

This relies on (1) single-column data.frames (though that can be remedied); and (2) unique column names. Your suggested output did not imply any sort, so I opted to not do any sorting here; it's easy enough to use alltests <- sort(unique(...)), though note that it's an alphabetic sort, not based on the numeric portion of substrings.

这篇关于合并并比较来自不同文件的不同列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆