如何在Google BigQuery中为数千个类别创建虚拟变量列？ [英] How to create dummy variable columns for thousands of categories in Google BigQuery?

查看：70 发布时间：2018/5/7 17:19:33 mysql sql google-bigquery dummy-variable

本文介绍了如何在Google BigQuery中为数千个类别创建虚拟变量列？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个有两列的简单表格：UserID和Category，每个UserID可以重复几个类别，如下所示：

  UserID Category 
 ------ -------- 
 1 A 
 1 B 
 2 C 
 3 A 
 3 c 
 3 b

我想要将这个表虚构：即创建一个输出表，每个类别都有一个唯一的列，由虚拟变量组成（0/1取决于用户ID是否属于该特定类别）：

  UserID ABC 
 ------  -   -   -  
 1 1 1 0 
 2 0 0 1 
 3 1 1 1

我的问题是，我有几千个类别（不仅在这个例子中是3个），所以这不能有效使用CASE WHEN语句完成。

所以我的问题是：
$ b $ 1有没有办法 Google BigQuery中没有u的类别列唱几千个CASE WHEN语句。

<2>这是UDF功能运行良好的情况吗？看起来好像是这样，但我对BigQuery中的UDF不够熟悉以解决此问题。有人可以帮忙吗？

谢谢。

解决方案

您可以使用下面的技术

首先运行查询＃1。它产生你需要运行的查询（查询＃2）以获得你需要的结果。
请在考虑数以千计的类别之前，考虑Mosha的评论：o）

查询＃1：

  SELECT'select UserID，'+ 
 GROUP_CONCAT_UNQUOTED（
'sum（if（category =''+ STRING（category）+'，1 ，0））作为'+ STRING（类别）
）
 +'由YourTable组中的用户ID'
 FROM（
 SELECT类别
 FROM YourTable 
 GROUP BY category 
）

结果如下 - Query＃2

  SELECT 
 UserID，
 SUM（IF（category =A，1,0））AS A，
 SUM（IF（category =B，1,0））AS B，
 SUM（IF（category =C，1,0））AS C 
 FROM 
 YourTable 
 GROUP BY 
用户ID

当然有三类 - 你可以

查询＃2的结果看起来和你期望的一样：
UserID ABC 1 1 1 0 2 0 0 1 3 1 1 1

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:
UserID Category ------ -------- 1 A 1 B 2 C 3 A 3 C 3 B
I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):
UserID A B C ------ -- -- -- 1 1 1 0 2 0 0 1 3 1 1 1
My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.
解决方案
You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:
SELECT 'select UserID, ' + GROUP_CONCAT_UNQUOTED( 'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category) ) + ' from YourTable group by UserID' FROM ( SELECT category FROM YourTable GROUP BY category )
Resulted will be like below - Query #2
SELECT UserID, SUM(IF(category = "A", 1, 0)) AS A, SUM(IF(category = "B", 1, 0)) AS B, SUM(IF(category = "C", 1, 0)) AS C FROM YourTable GROUP BY UserID
of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:
UserID A B C 1 1 1 0 2 0 0 1 3 1 1 1

这篇关于如何在Google BigQuery中为数千个类别创建虚拟变量列？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在Google BigQuery中为数千个类别创建虚拟变量列？ [英] How to create dummy variable columns for thousands of categories in Google BigQuery?

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录关闭

如何在Google BigQuery中为数千个类别创建虚拟变量列？ [英] How to create dummy variable columns for thousands of categories in Google BigQuery?

问题描述

相关文章

数据库最新文章

热门教程

热门工具

登录 关闭

登录关闭