在Pig \ R中生成二进制变量 [英] Generating binary variables in Pig\R

查看:134
本文介绍了在Pig \ R中生成二进制变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究在Pig脚本或R脚本中生成伪变量或二进制变量的设计思想

I am working on the design thought for generating dummy or binary variable in pig script or R script

问题: 输入猪脚本:任意关系如下表

problem: Input to pig script: Any arbitrary relation say as below table

    A   B   C
    a1  b1  c1
    a2  b2  c2  
    a1  b1  c3

假设我们必须基于B,C生成二进制cols 输出应该是

suppose we have to generate binary cols based on B,C output should be

    A   B   C   B.b1    B.b2    C.c1    C.c2        C.c3
    a1  b1  c1  1        0       1       0       0
    a2  b2  c2  0        1       0       1       0
    a1  b1  c3  1        0       0       0       1

我认为编写UDF是正确的方法.但是我不确定如何为udf定义输出模式,因为列名是由用户提供的,我们不知道在关系中需要生成多少个不同的cols. 有人可以指导我作为高级设计来实现它.在R中这样做是否可行,我们是否可以在线解决该统计问题

I think writing UDF would be right approach on it. However i am not sure as how to define the output schema for the udf as the column names are supplied by the user and we dont know in the relation how many distinct cols needs to be generated. Could somebody please guide me as a high level design to achieve it. is it feasible to do in R do we have some online solution for this statistical problem

推荐答案

您可以在R

 cSplit_e(cSplit_e(df, 'B', type='character', fill=0, 'binary'), 
          'C', type='character', fill=0, 'binary')
 #   A  B  C B_b1 B_b2 C_c1 C_c2 C_c3
 #1 a1 b1 c1    1    0    1    0    0
 #2 a2 b2 c2    0    1    0    1    0
 #3 a1 b1 c3    1    0    0    0    1

这篇关于在Pig \ R中生成二进制变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆