在R编程中连接来自不同列的备用字符 [英] Concatenate alternate characters from different columns in R programming

查看:60
本文介绍了在R编程中连接来自不同列的备用字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2列的df。我需要在Col3中合并Col1和Col2-用> a1-b1; a2-b2; a3-b3; ......分隔的替代文本...

I have a df with 2 columns. I need to combine Col1 and Col2 in Col3 - alternate text separated by ">" a1-b1;a2-b2;a3-b3;...

示例

|      Col1       |           Col2   |            Col3              |

| abcd > de > efg | ppppp > ppt > pp | abcd-ppppp > de-ppt > efg-pp | 

| hij > kl > iiii | aaa > bbb > hhh  | hij-aaa > kl-bbb > iiii-hhh  | 

| aa              | fff              | aa-fff                       | 

| a > bbb         |  pp > a          | a-pp > bbb-a                 | 

....

如何我在R编程中做到了吗?
谢谢

How can I do that in R programming? Thanks

推荐答案

这很难解决。将来,出于我们的理智考虑,请考虑如何输出数据。如果您生成了数据,但考虑进行下游分析,则可以轻松解决此问题。无论如何,这里都是解决方案。

This was a pain in the ass to solve. In the future, for our sanity please consider how you output your data. This could have been easily solved if, however the data was generated, you consider downstream analysis. Anyway enough whinging here is the solution.

让我们生成您的数据:

Col1 <- c("abcd > de > efg", "hij > kl > iiii", "aa", "a > bbb")
Col2 <- c("ppppp > ppt > pp", "aaa > bbb > hhh", "fff", "pp > a")
dat <- data.frame(Col1, Col2, stringsAsFactors = FALSE)

接下来使用 apply 剥离,分离并展平 Col1 Col2 并添加第一个分隔符-

Next using apply we strip, separate and flatten Col1 and Col2 and add the first separator -:

l1 <- apply(dat, 2, function(x) trimws(unlist(strsplit(x, split = ">"))))
l2 <- apply(l1, 1, function(x) paste0(x[1], "-", x[2]))

下一部分非常困难,经过大量的搜寻之后,我找到了一种解决方案(技巧),用数字矢量将字符列表分开。 / p>

The next part was surprisingly difficult, after much googling I found a solution (a hack) to split a list of characters by a numeric vector.

#thanks: https://techoverflow.net/2012/11/10/r-count-occurrences-of-character-in-string/
#gets occurrences of ">" for later use
countCharOccurrences <- function(char, s) {
  s2 <- gsub(char,"",s)
  return (nchar(s) - nchar(s2))
}

o <- countCharOccurrences(">", dat$Col1)+1
df <- as.data.frame(l2, stringsAsFactors = FALSE)

通过>的出现分割 df (即 o 的值):

Split df by the occurrences of ">" (i.e the values of o):

# Thanks to this SO answer:
# https://stackoverflow.com/questions/27132290/split-dataframe-by-row-number-in-r
l2a <- split(df, cumsum(c(TRUE,(1:nrow(df) %in% cumsum(o))[-nrow(df)])))

最后,我们折叠数据框列表并添加最后的分隔符>

Finally, we collapse list of dataframes and add the final separator >:

l3 <- lapply(l2a, function(x) paste(x[,1], collapse = " > "))

然后与您的起始数据框组合:

Then combine with your starting dataframe:

dat$Col3 <- l3

             Col1             Col2                         Col3
1 abcd > de > efg ppppp > ppt > pp abcd-ppppp > de-ppt > efg-pp
2 hij > kl > iiii  aaa > bbb > hhh  hij-aaa > kl-bbb > iiii-hhh
3              aa              fff                       aa-fff
4         a > bbb           pp > a                 a-pp > bbb-a

Tada!

编辑:我忘记了 l3 是对象列表。您需要使用 unlist 将其扁平化:

edit: I had forgotten l3 is a list of objects. You need to use unlist to flatten them like this:

dat$Col3 <- unlist(l3)

这篇关于在R编程中连接来自不同列的备用字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆