R - 根据另一个单元格值,按组连接数据帧中的单元格 [英] R - Concatenate cell in dataframe, by group, depending on another cell value

查看:45
本文介绍了R - 根据另一个单元格值,按组连接数据帧中的单元格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下类型的数据集(第一行是标题):

I have a dataset of the following type (first row is the header):

  • content 始终是文本
  • merge 总是合乎逻辑的
  • content is always text
  • merge is always a logical
id1  id2  start_line end_line content           merge
A    B    1          1        "aaaa"            TRUE
A    B    4          4        "aa mm"           TRUE
A    B    5          5        "boool"           TRUE
A    B    6          6        "omw"             TRUE
C    D    6          6        "hear!"           TRUE
C    D    7          7        " me out!"        TRUE
C    D    21         21       "hello"           FALSE

问题:我需要按照一个非常具体的标准进行合并:

Problem: I need to merge following a very specific criteria:

  • 具有 merge = FALSE 的行必须保持原样
  • 具有相同id1、相同id2和连续start_line的行:
    • 需要附加在列content
    • end_line 值需要更改为最后一行
    • Rows that have merge = FALSE must remain as is
    • Rows that have: same id1, same id2 and consecutive start_line:
      • Need to be appended on the column content
      • The end_line value needs to change to the last row

      所以,预期的结果是:

      id1  id2  start_line end_line content             merge
      A    B    1          1        "aaaa"              TRUE
      A    B    4          6        "aa mm boool omw"   TRUE
      C    D    6          7        "hear!  me out!"    TRUE
      C    D    21         21       "hello"             FALSE
      

      在示例中注意:

      • 最小合并是两行(ID 示例:C-D,最初是第 6 行和第 7 行)
      • 可以合并多行(ids A-B 的例子,最初是第 2、3、4 行)

      我尝试了一系列非常大且效率低下的循环,它们只合并了两行.这就是为什么我不在这里发布我的尝试.

      I have attempted a very large, and inefficient series of loops, that only merge two lines. That is why I am not posting my attempt here.

      推荐答案

      使用 dplyr 你可以试试:

      library(dplyr)
      
      df %>%
       group_by(id1, id2, grp = cumsum(c(TRUE, diff(start_line) > 1))) %>%
       summarise(start_line = first(start_line), 
                 end_line = last(end_line), 
                 content = paste(content, collapse = " "), 
                  merge = any(merge))
      
      
      #  id1   id2     grp start_line end_line content         merge
      #  <chr> <chr> <int>      <int>    <int> <chr>           <lgl>
      #1 A     B         1          1        1 aaaa            TRUE 
      #2 A     B         2          4        6 aa mm boool omw TRUE 
      #3 C     D         2          6        7 hear!  me out!  TRUE 
      #4 C     D         3         21       21 hello           FALSE
      

      数据

      df <- structure(list(id1 = c("A", "A", "A", "A", "C", "C", "C"), id2 = c("B", 
      "B", "B", "B", "D", "D", "D"), start_line = c(1L, 4L, 5L, 6L, 
      6L, 7L, 21L), end_line = c(1L, 4L, 5L, 6L, 6L, 7L, 21L), content = c("aaaa", 
      "aa mm", "boool", "omw", "hear!", " me out!", "hello"), merge = c(TRUE, 
      TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)), class = "data.frame", 
      row.names = c(NA, -7L))
      

      这篇关于R - 根据另一个单元格值,按组连接数据帧中的单元格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆