在 Pig\R 中生成二进制变量 [英] Generating binary variables in Pig\R
问题描述
我正在研究在 pig 脚本或 R 脚本中生成虚拟变量或二进制变量的设计思路
I am working on the design thought for generating dummy or binary variable in pig script or R script
问题:输入到猪脚本:任意关系如下表
problem: Input to pig script: Any arbitrary relation say as below table
A B C
a1 b1 c1
a2 b2 c2
a1 b1 c3
假设我们必须根据 B,C 生成二进制列输出应该是
suppose we have to generate binary cols based on B,C output should be
A B C B.b1 B.b2 C.c1 C.c2 C.c3
a1 b1 c1 1 0 1 0 0
a2 b2 c2 0 1 0 1 0
a1 b1 c3 1 0 0 0 1
我认为编写 UDF 是正确的方法.但是我不确定如何定义 udf 的输出模式,因为列名是由用户提供的,我们不知道在关系中需要生成多少个不同的列.有人可以指导我作为高级设计来实现它.在 R 中是否可行我们是否有针对此统计问题的在线解决方案
I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem
推荐答案
你可以试试 cSplit_e
from library(splitstackshape)
in R
You could try cSplit_e
from library(splitstackshape)
in R
cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'),
'C', type='character', fill=0, 'binary')
# A B C B_b1 B_b2 C_c1 C_c2 C_c3
#1 a1 b1 c1 1 0 1 0 0
#2 a2 b2 c2 0 1 0 1 0
#3 a1 b1 c3 1 0 0 0 1
这篇关于在 Pig\R 中生成二进制变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!