数据表-在几列上应用相同的功能以创建新的数据表列 [英] Data table - apply the same function on several columns to create new data table columns
问题描述
我正在使用data.table包.我有一个数据表,代表用户在网站上的操作.假设每个用户都可以访问一个网站,并对该网站执行多项操作.我的原始数据表是动作(每行都是一个动作),我想将此信息汇总到一个新的数据表中,并按用户访问进行分组(每次访问都有唯一的ID).同一访问的操作共有一些字段,例如,用户名,用户状态,访问号码等.每次访问中至少有一个操作包含此信息(不一定是所有操作) ).我想为每次访问(=具有相同访问ID的一组操作)检索此字段的值,并将其设置为访问新数据"表中的访问.例如,如果我有以下原始数据表:
I am working with data.table package. I have a data table which represents users actions on a website. Let's say that every user can visit a website, and perform multiple actions on it. My original data table is of actions (every row is an action) and I want to aggregate this information into a new data table, grouped by user visits (every visit has a unique ID). There are some fields which are shared by the actions of the same visit - for example - the user name, the user status, the visit number etc. At least one of the actions of each visit contains this info (not necessarily all of the actions). I want to retrieve, for each visit (= group of actions with the same visit ID), the value of this field, and set it to the visit in the visits new data table. For example, if I have the following original data table:
VisitID ActionNum UserName UserStatus VisitNum ActionType
aaaaaaa 1 John Active 5 x
aaaaaaa 2 Active y
aaaaaaa 3 John 5 z
bbbbbbb 1 NonActive w
bbbbbbb 2 Dan 7 t
我想要一个访问数据表,如下所示:
I want to have a visits data table, as following:
VisitID UserName UserStatus VisitNum
aaaaaaa John Active 5
bbbbbbb Dan NonActive 7
我创建了一个对数据表的子集(仅访问行)和一个字段起作用的函数,并且该函数应应用于多个字段(UserName,UserStatus,VisitNum).
I created a function that works on subset of data table (only the rows of the visit) and a field, and this function should be applied on several fields (UserName, UserStatus, VisitNum).
getGeneralField<- function(visitDT,field){
vec = visitDT[,get(field)]
return (unique(vec[vec != ""])[1])
}
问题是,当by = VisitID时,每次在.SD上应用此功能的尝试都会导致某些事情与我计划的有所不同...最好的方法是什么?我使用!="以避免空白单元格.
The problem is that every trial to apply this function on .SD when by=VisitID results in something different than I planned... What is the best way to do it? I used !="" in order to avoid blank cells.
推荐答案
我们在.SDcols
中指定感兴趣的列(按"VisitID"分组),循环遍历.SDcols
(lapply(.SD, ...
)中的列并获取第一个非空白元素
We specify the columns of interest in .SDcols
, grouped by 'VisitID', loop through the columns in .SDcols
(lapply(.SD, ...
) and get the first non-blank element
dt[, lapply(.SD, function(x) x[nzchar(x)][1]), by = VisitID, .SDcols = 3:5]
这篇关于数据表-在几列上应用相同的功能以创建新的数据表列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!