根据列合并文件 [英] Merging files on the basis of columns

查看:101
本文介绍了根据列合并文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有多个文件,其中有许多行和三列,并且需要在前两列匹配的基础上合并它们. File1

I have multiple files with many rows and three columns and need to merge them on the basis of first two columns match. File1

12   13   a
13   15   b
14   17   c
 4    9   d
 .    .   .
 .    .   .
 81   23  h

文件2

12   13   e
3    10   b
14   17   c
 4    9   j
 .    .   .
 .    .   .
 1   2    k

文件3

12   13   m
13   15   k
1     7   x
24    9   d
 .    .   .
 .    .   .
 1   2    h

,依此类推. 我想将它们合并以获得以下结果

and so on. I want to merge them to obtain the following result

12   13   a  e  m
13   15   b  k
14   17   c  c
 4    9   d  j
 3   10   b
24    9   d
 .    .   .
 .    .   .
 81   23  h
  1    2  k
  1    7  x

推荐答案

这些类型的问题通常首先想到的是merge,也许与Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames)一起使用.

The first thing that usually comes to mind with these types of problems is merge, perhaps in conjunction with a Reduce(function(x, y) merge(x, y, by = "somecols", all = TRUE), yourListOfDataFrames).

但是,merge并不总是最有效的功能,特别是因为它看起来像您想折叠"所有值以填充从左到右的行,这不是默认的merge行为

However, merge is not always the most efficient function, especially since it looks like you want to "collapse" all the values to fill in the rows from left to right, which would not be the default merge behavior.

相反,我建议您将所有内容堆叠到一个长的data.frame中,并在添加索引变量后对其进行整形.

Instead, I suggest you stack everything into one long data.frame and reshape it after you have added an index variable.

有两种方法:

  1. 使用mget将所有data.frame放入list.
  2. 使用rbind_all将该list转换为单个data.frame.
  3. 使用"dplyr"中mutate中的sequence(n())对数据进行分组并创建索引.
  4. 使用"tidyr"中的spread将长"格式转换为宽"格式.
  1. Use mget to put all of your data.frames into a list.
  2. Use rbind_all to convert that list into a single data.frame.
  3. Use sequence(n()) in mutate from "dplyr" to group the data and create an index.
  4. Use spread from "tidyr" to transform from a "long" format to a "wide" format.

library(dplyr)
library(tidyr)

combined <- rbind_all(mget(ls(pattern = "^file\\d")))
combined %>%
  group_by(V1, V2) %>% 
  mutate(time = sequence(n())) %>% 
  ungroup() %>% 
  spread(time, V3, fill = "")
# Source: local data frame [7 x 5]
# 
#   V1 V2 1 2 3
# 1  1  7 x    
# 2  3 10 b    
# 3  4  9 d j  
# 4 12 13 a e m
# 5 13 15 b k  
# 6 14 17 c c  
# 7 24  9 d  

选项2:"data.table"

  1. 使用mget将所有data.frame放入list.
  2. 使用rbindlist将该列表转换为单个data.table.
  3. 使用sequence(.N)来按组生成序列.
  4. 使用dcast.data.table将长" data.table转换为宽".
  1. Use mget to put all of your data.frames into a list.
  2. Use rbindlist to convert that list into a single data.table.
  3. Use sequence(.N) to generate your sequence by your groups.
  4. Use dcast.data.table to convert the "long" data.table into a "wide" one.

library(data.table)

dcast.data.table(
  rbindlist(mget(ls(pattern = "^file\\d")))[, 
    time := sequence(.N), by = list(V1, V2)],
  V1 + V2 ~ time, value.var = "V3", fill = "")
#    V1 V2 1 2 3
# 1:  1  7 x    
# 2:  3 10 b    
# 3:  4  9 d j  
# 4: 12 13 a e m
# 5: 13 15 b k  
# 6: 14 17 c c  
# 7: 24  9 d    


这两个答案均假设我们从以下示例数据开始:


Both of these answers assume we are starting with the following sample data:

file1 <- structure(
  list(V1 = c(12L, 13L, 14L, 4L), V2 = c(13L, 15L, 17L, 9L), 
       V3 = c("a", "b", "c", "d")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))
file2 <- structure(
  list(V1 = c(12L, 3L, 14L, 4L), V2 = c(13L, 10L, 17L, 9L), 
       V3 = c("e", "b", "c", "j")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))
file3 <- structure(
  list(V1 = c(12L, 13L, 1L, 24L), V2 = c(13L, 15L, 7L, 9L), 
       V3 = c("m", "k", "x", "d")), .Names = c("V1", "V2", "V3"), 
  class = "data.frame", row.names = c(NA, -4L))

这篇关于根据列合并文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆