R Dataframe:按列的形式聚合字符串,按行分组 [英] R Dataframe: aggregating strings within column, across rows, by group

查看:1509
本文介绍了R Dataframe:按列的形式聚合字符串,按行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个特殊的问题,我有一个非常低效的解决方案。我有文本数据,出于各种原因,以随机的间隔在数据帧的行中断开。然而,已知基于数据帧中其他变量的独特组合的某些子集属于一起。例如,参见MWE展示结构和我的初始解决方案:

 #数据
df< - 读.table(text =page passage person index text
1 123 A 1 hello
1 123 A 2 my
1 123 A 3 name
1 123 A 4是
1 123 A 5个人
1 124 B 1 well
1 124 B 2 hello
1 124 B 3 guy,header = T,stringsAsFactors = F)

master< -data.frame()
for(i in 123:max(df $ passage)){
print(paste0('passage',i))
tempset < df [df $ passage == i,]
concat< - ''
for(j in 1:nrow(tempset)){
print(paste0('index',j))
concat< -paste(concat,tempset $ text [j])
}
tempdf< -data.frame(tempset $ page [1],tempset $ passage [1],tempset $人[1],concat,stringsAsFactors = FALSE)
master< -rbind(master,tempdf)
rm(concat,tempset,tempdf)
}
master
> master
tempset.page.1。 tempset.passage.1。 tempset.person.1。 concat
1 1 123一个你好我的名字是家伙
2 1 124 B好你好男人


$ b $在这个例子中,在我的真实情况下,段落是唯一的分组变量,所以并不完全必须随身携带其他的部分,尽管我希望它们在我的数据集中可用。



我目前的估计是,我设计的这个过程将花费几个小时的时间,以便在计算机上由R轻松处理的数据集。也许有其他功能或软件包可以获得一些效率,还是没有创建和删除这么多对象?



感谢您的帮助!

解决方案

这里有两种方法:



base R / p>

 聚合(
文本〜页+段+人,
data = df,
FUN = paste,collapse =''

dplyr / p>

 库(dplyr)
df%>%
group_by_(〜页,〜段,〜人)%>%
summarize_(text =〜paste(text,collapse =''))


I have what seems like a very inefficient solution to a peculiar problem. I have text data which, for various reasons, is broken across rows of a dataframe at random intervals. However, certain subsets of are known to belong together based on unique combinations of other variables in the dataframe. See, for example, a MWE demonstrating the structure and my initial solution:

# Data
df <- read.table(text="page passage  person index text
1  123   A   1 hello      
1  123   A   2 my
1  123   A   3 name
1  123   A   4 is
1  123   A   5 guy
1  124   B   1 well
1  124   B   2 hello
1  124   B   3 guy",header=T,stringsAsFactors=F)

master<-data.frame()
for (i in 123:max(df$passage)) {
  print(paste0('passage ',i))
  tempset <- df[df$passage==i,]
  concat<-''
  for (j in 1:nrow(tempset)) {
    print(paste0('index ',j))
    concat<-paste(concat, tempset$text[j])
  }
  tempdf<-data.frame(tempset$page[1],tempset$passage[1], tempset$person[1], concat, stringsAsFactors = FALSE)
  master<-rbind(master, tempdf)
  rm(concat, tempset, tempdf)
}
master
> master
  tempset.page.1. tempset.passage.1. tempset.person.1.                concat
1               1                123                 A  hello my name is guy
2               1                124                 B        well hello guy

In this example as in my real case, "passage" is the unique grouping variable, so it is not entirely necessary to take the other pieces along with it, although I'd like them available in my dataset.

My current estimates are that this procedure I have devise will take several hours for a dataset that is otherwise easily handled by R on my computer. Perhaps there are some efficiencies to be gained either by other functions or packages, or not creating and removing so many objects?

Thanks for any help here!

解决方案

Here are two ways:

base R

aggregate(
    text ~ page + passage + person, 
    data=df, 
    FUN=paste, collapse=' '
)

dplyr

library(dplyr)
df %>% 
    group_by_(~page, ~passage, ~person) %>%
    summarize_(text=~paste(text, collapse=' '))

这篇关于R Dataframe:按列的形式聚合字符串,按行分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆