R:按空格拆分数据帧行,删除公共元素,将不等长的列放在新的df中 [英] R: split data frame rows by space, remove common elements, put unequal length columns in new df

查看:584
本文介绍了R:按空格拆分数据帧行,删除公共元素,将不等长的列放在新的df中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有 df ,我需要用空格分割两行字符串,取消列表,然后在列表中找到反相交点和重用。我可以通过单独处理每一行来做强力。问题是可以有超过2行等等。我的工作解决方案到目前为止,但是必须有一个更简单的方法来访问每一行。谢谢!!

Suppose, I have df with two rows of strings that I need to split by space, unlist, then find anti-intersection and reuse in a list. I can do it brute force by working with each row individually. Problem is that there can be more than 2 rows etc. My working solution thus far is below, but there must be a simpler way of not accessing each line. Thanks!!

    df = structure(list(A = structure(1:2, .Label = c("R1", "R2"), class = "factor"), 
                        B = c("a b c d e f g o l", 
                              "b h i j k l m n o p q"
                        )), .Names = c("A", "B"), row.names = c(NA, -2L), class = "data.frame")

    dat1 = unlist(strsplit(df[1,2]," "))
    dat2 = unlist(strsplit(df[2,2]," "))

    f <- function (...) 
    {
      aux <- list(...)
      ind <- rep(1:length(aux), sapply(aux, length))
      x <- unlist(aux)
      boo <- !(duplicated(x) | duplicated(x, fromLast = T))
      split(x[boo], ind[boo])
    }

    excl = (f(dat1, dat2))
L <- list(excl[[1]],excl[[2]])

cfun <- function(L) {
  pad.na <- function(x,len) {
    c(x,rep("",len-length(x)))
  }
  maxlen <- max(sapply(L,length))
  print(maxlen)
  do.call(data.frame,lapply(L,pad.na,len=maxlen))
}

a = cfun(L)

我有什么:

    A   B
1   Food    a b c d e f g
2   HABA    b h i j k l m n o p q

我有什么:

    c..a....c....d....e....f....g.......... c..h....i....j....k....m....n....p....q..
1   a   h
2   c   i
3   d   j
4   e   k
5   f   m
6   g   n
7       p
8       q

编辑:目标是消除所有列中的常见元素。即如果行4中存在4,并在其他地方看到 - 删除。新测试集

The goal is to eliminate common elements from all columns. I.e. if "4" is present in row 1 and seen anywhere else - remove. New test set:

df1 = structure(list(A = structure(1:3, .Label = c("R1", "R2", "R3"
), class = "factor"), B = c("1 4 78 5 4 6 7 0", "2 3 76 8 2 1 8 0", 
"4 7 1 2")), .Names = c("A", "B"), row.names = c(NA, -3L), class = "data.frame")

建议代码的当前输出:

    a   b   c
1   4   2   4
2   78  3   7
3   5   76  2
4   4   8   NA
5   6   2   NA
6   7   8   NA
7   0   0   NA

2,4和7不应该在那里他们在超过1列被看到。底线 - 输出应由唯一的数字/元素组成,仅在任何列中。谢谢!!

2, 4, and 7 should not be there as they are seen in more than 1 column. Bottom line - output should consist of unique numbers/elements only in any columns. Thanks!!

推荐答案

以下是使用base R避免大量当前代码的一种方法

Here's one way using base R that avoids a lot of your current code

## split column B on the space character
s <- strsplit(df$B, " ")
## find the intersection of all s
r <- Reduce(intersect, s)
## iterate over s, removing the intersection characters in r
l <- lapply(s, function(x) x[!x %in% r])
## reset the length of each vector in l to the length of the longest vector
## then create the new data frame
setNames(as.data.frame(lapply(l, "length<-", max(lengths(l)))), letters[seq_along(l)])
#      a b
# 1    a h
# 2    c i
# 3    d j
# 4    e k
# 5    f m
# 6    g n
# 7 <NA> p
# 8 <NA> q

我认为这是你拍摄的?

注意 length()是R版本3.2.0的基本包中的一个新功能,它是一种更快更有效的替换在 sapply(x,length)列表中。

Note that lengths() is a new function in the base package of R version 3.2.0 that is a faster more efficient replacement for sapply(x, length) on a list.

这篇关于R:按空格拆分数据帧行,删除公共元素,将不等长的列放在新的df中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆