在 Pig\R 中生成二进制变量 [英] Generating binary variables in Pig\R

查看:33
本文介绍了在 Pig\R 中生成二进制变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究在 pig 脚本或 R 脚本中生成虚拟变量或二进制变量的设计思路

I am working on the design thought for generating dummy or binary variable in pig script or R script

问题:输入到猪脚本:任意关系如下表

problem: Input to pig script: Any arbitrary relation say as below table

    A   B   C
    a1  b1  c1
    a2  b2  c2  
    a1  b1  c3

假设我们必须根据 B,C 生成二进制列输出应该是

suppose we have to generate binary cols based on B,C output should be

    A   B   C   B.b1    B.b2    C.c1    C.c2        C.c3
    a1  b1  c1  1        0       1       0       0
    a2  b2  c2  0        1       0       1       0
    a1  b1  c3  1        0       0       0       1

我认为编写 UDF 是正确的方法.但是我不确定如何定义 udf 的输出模式,因为列名是由用户提供的,我们不知道在关系中需要生成多少个不同的列.有人可以指导我作为高级设计来实现它.在 R 中是否可行我们是否有针对此统计问题的在线解决方案

I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem

推荐答案

你可以试试 cSplit_e from library(splitstackshape) in R

You could try cSplit_e from library(splitstackshape) in R

 cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'), 
          'C', type='character', fill=0, 'binary')
 #   A  B  C B_b1 B_b2 C_c1 C_c2 C_c3
 #1 a1 b1 c1    1    0    1    0    0
 #2 a2 b2 c2    0    1    0    1    0
 #3 a1 b1 c3    1    0    0    0    1

这篇关于在 Pig\R 中生成二进制变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆