从另一个数据帧更新数据帧 [英] Update dataframe from another dataframe
问题描述
我有2个表,超过500M行的交易和超过3M行的客户
data < data.frame(Trans = c(1,2,3,4,5),Cust01 = c(A,B,C,D,F),
Cust02 = c(S,E,,TE,F),Cust03 = c(F,,D,,F))
cust_type< -data.frame(Cust = c(A,B,C,D),Type = c(1,2,3 ))
dataresult< - data.frame(Trans = c(1,2,3,4,5),
Cust01 = c(A,B C,D,F),
Cust01Type = c(1,2,3,4,5),
Cust02 = c S,E,,TE,F),
Cust02Type = c(,,,,),
Cust03 = c(F,,D,,F),
Cust03Type = c(,,4,,))
我想以有效的方式将客户类型添加到数据。通常使用 sql
我将使用多个左连接,我尝试使用 dplyr
但永远。我还试图使用%中的%与逻辑返回,然后一个循环只是专注于真正的值。
有人知道一个更好的方法吗?
当你想要快速的表现时,没有什么比 data.table
package(yet)。由于您的交易数据现在处于宽格式,所以首先要将其转换为长格式。这将使它更容易处理。
library(data.table)#v1.9.5
trans_data< - 融合(setDT(data),id.vars =Trans,
variable.name =Cust,#set name variable column
variable.factor = TRUE,#设置为因子变量而不是一个字符变量
value.name =Cvalue)[!Cvalue ==]#set name value column&删除空案件
完成后,您可以加入两个数据表:
#设置您正在加入的密钥
setDT(trans_data,key =Cvalue)
setDT(cust_type, key =Cust)
#将客户类型加入到交易数据
trans_data [cust_type,Ctype:= Type]
这给出:
> trans_data
Trans Cust Cvalue Ctype
1:1 Cust01 A 1
2:2 Cust01 B 2
3:3 Cust01 C 3
4:4 Cust01 D 4
5:3 Cust03 D 4
6:2 Cust02 E NA
7:5 Cust01 F NA
8:5 Cust02 F NA
9:1 Cust03 F NA
10:5 Cust03 F NA
11:1 Cust02 S NA
12:4 Cust02 TE NA
如果要更改生成的 data.table
中的顺序,可以使用例如:
setorder(trans_data,Trans,Cust)
或全部同时使用:
trans_data< - trans_data [cust_type,Ctype:= Type] [order Trans,Cust)]
其中:
> trans_data
Trans Cust Cvalue Ctype
1:1 Cust01 A 1
2:1 Cust02 S NA
3:1 Cust03 F NA
4:2 Cust01 B 2
5:2 Cust02 E NA
6:3 Cust01 C 3
7:3 Cust03 D 4
8:4 Cust01 D 4
9:4 Cust02 TE NA
10:5 Cust01 F NA
11:5 Cust02 F NA
12:5 Cust03 F NA
注意:我使用了开发版本的 data.table
,它不再需要加载
功能 c code code code code $
I have 2 tables, "transactions" with over 500M rows and "Customers" over 3M rows
data <- data.frame(Trans = c(1,2,3,4,5), Cust01 = c("A","B","C","D","F"),
Cust02 = c("S","E","","TE","F"), Cust03 = c("F","","D","","F"))
cust_type <-data.frame(Cust = c("A","B","C","D"), Type = c("1","2","3","4"))
dataresult <- data.frame(Trans = c(1,2,3,4,5),
Cust01 = c("A","B","C","D","F"),
Cust01Type = c("1","2","3","4","5"),
Cust02 = c("S","E","","TE","F"),
Cust02Type = c("","","","",""),
Cust03 = c("F","","D","","F"),
Cust03Type = c("","","4","",""))
I would like to add the customer type to the data in an efficient way. Normally with sql
I will use multiple left join, I tried that with dplyr
but takes forever. I also tried to use %in%
with logic return and then a loop just to focus on the true values.
Does someone know a better way to do this?
When you want fast performance, nothing beats the data.table
package (yet). As your transaction data are now in wide format, the first step to do is convert it to long format. This will make it easier to process.
library(data.table) #v1.9.5
trans_data <- melt(setDT(data), id.vars = "Trans",
variable.name = "Cust", # set name variable column
variable.factor = TRUE, # set as a factor variable instead of a character variable
value.name = "Cvalue")[!Cvalue==""] # set name value column & remove empty cases
When you have done that, you can join the two datatables:
# set the keys by which you are joining
setDT(trans_data, key = "Cvalue")
setDT(cust_type, key = "Cust")
# join the customer type into the transaction data
trans_data[cust_type, Ctype:=Type]
this gives:
> trans_data
Trans Cust Cvalue Ctype
1: 1 Cust01 A 1
2: 2 Cust01 B 2
3: 3 Cust01 C 3
4: 4 Cust01 D 4
5: 3 Cust03 D 4
6: 2 Cust02 E NA
7: 5 Cust01 F NA
8: 5 Cust02 F NA
9: 1 Cust03 F NA
10: 5 Cust03 F NA
11: 1 Cust02 S NA
12: 4 Cust02 TE NA
If you want to change the order in the resulting data.table
, you can do that with for example:
setorder(trans_data, Trans, Cust)
or all at once with:
trans_data <- trans_data[cust_type, Ctype:=Type][order(Trans,Cust)]
which gives:
> trans_data
Trans Cust Cvalue Ctype
1: 1 Cust01 A 1
2: 1 Cust02 S NA
3: 1 Cust03 F NA
4: 2 Cust01 B 2
5: 2 Cust02 E NA
6: 3 Cust01 C 3
7: 3 Cust03 D 4
8: 4 Cust01 D 4
9: 4 Cust02 TE NA
10: 5 Cust01 F NA
11: 5 Cust02 F NA
12: 5 Cust03 F NA
Note: I used the development version of data.table
, with which it is not needed anymore to load the reshape2
package for the melt
function.
这篇关于从另一个数据帧更新数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!