如何在Google BigQuery中为数千个类别创建虚拟变量列? [英] How to create dummy variable columns for thousands of categories in Google BigQuery?

查看:70
本文介绍了如何在Google BigQuery中为数千个类别创建虚拟变量列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个有两列的简单表格:UserID和Category,每个UserID可以重复几个类别,如下所示:

  UserID Category 
------ --------
1 A
1 B
2 C
3 A
3 c
3 b

我想要将这个表虚构:即创建一个输出表,每个类别都有一个唯一的列,由虚拟变量组成(0/1取决于用户ID是否属于该特定类别):

  UserID ABC 
------ - - -
1 1 1 0
2 0 0 1
3 1 1 1

我的问题是,我有几千个类别(不仅在这个例子中是3个),所以这不能有效使用CASE WHEN语句完成。



所以我的问题是:
$ b $ 1有没有办法 Google BigQuery中没有u的类别列唱几千个CASE WHEN语句。


<2>这是UDF功能运行良好的情况吗?看起来好像是这样,但我对BigQuery中的UDF不够熟悉以解决此问题。有人可以帮忙吗?

谢谢。

解决方案

您可以使用下面的技术



首先运行查询#1。它产生你需要运行的查询(查询#2)以获得你需要的结果。
请在考虑数以千计的类别之前,考虑Mosha的评论:o)

查询#1:

  SELECT'select UserID,'+ 
GROUP_CONCAT_UNQUOTED(
'sum(if(category =''+ STRING(category)+',1 ,0))作为'+ STRING(类别)

+'由YourTable组中的用户ID'
FROM(
SELECT类别
FROM YourTable
GROUP BY category

结果如下 - Query#2

  SELECT 
UserID,
SUM(IF(category =A,1,0))AS A,
SUM(IF(category =B,1,0))AS B,
SUM(IF(category =C,1,0))AS C
FROM
YourTable
GROUP BY
用户ID

当然有三类 - 你可以



查询#2的结果看起来和你期望的一样:

  UserID ABC 
1 1 1 0
2 0 0 1
3 1 1 1


I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

解决方案

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1    

这篇关于如何在Google BigQuery中为数千个类别创建虚拟变量列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆