如何在Google Bigquery中将字符串转换为列名? [英] How to convert strings into column names in Google Bigquery?

查看:32
本文介绍了如何在Google Bigquery中将字符串转换为列名?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google Bigquery中有一个大数据集,其中有数百万行要清除的脏数据(应用程序跟踪).我的问题之一是,同一数据被发送到应用程序中触发的不同事件的不同列.我的意思是说,对于某些事件,可能会将国家/地区发送到自定义维度1,而对于其他事件,会将国家/地区发送到自定义维度147.我无法发布实际数据,但是 SELECT * FROM table_with_dirty_data 会生成如下内容:

I have a large data set in Google Bigquery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data would produce something like this:

date | session | eventAction | cd001 | cd002    | cd004    | cd005 
-----|---------|-------------|-------|----------|----------|-------
1    | 1       | 'event_1'   | '1'   | 'Pizza'  | null     | '21'
1    | 1       | 'event_2'   | '10'  | '25'     | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | '3.1' | null     | '15'     | 'France'
1    | 2       | 'event_1'   | '6'   | 'Burger' | null     | '21'
1    | 2       | 'event_2'   | '21'  | '25'     | 'Burger' | '12.6'

这里的最终目标是最终得到一个可以分析的干净表.像这样:

The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:

date | session | eventAction | country  | vendor   | product  | price
-----|---------|-------------|----------|----------|----------|-------
1    | 1       | 'event_1'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_2'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 2       | 'event_1'   | 'Spain'  | '25'     | 'Burger' | '12.6'
1    | 2       | 'event_2'   | 'Spain'  | '25'     | 'Burger' | '12.6'

我知道有些事件将需要某种程度的统计插补和数据类型转换,但是现在我只关心将每个变量放入其自己的列中.因此,我创建了一个辅助表(我们称其为 matrix ),如下所示:

I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix) that looks like this:

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_2'  | 'cd020'    | 'cd146'    | 'cd056'
'event_3'  | 'cd001'    | 'cd020'    | 'cd035'

,依此类推,其中 variable _#列的每个单元格中的值是 table_with_dirty_data 中可找到信息的列名.也就是说,对于具有事件名称'event_1'和'even_2'的事件,可以在名为 cd020 的列中找到 variable_1 ,但是可以在名为 cd001 variable_1 >用于具有event_name'event_3'的事件.因此,基本上 matrix 所做的就是将每个事件的每个变量发送到哪个自定义维度.

and so on, where the value in each cell of column variable_# is the column name in the table_with_dirty_data where the information can be found. That is, variable_1 can be found in a column called cd020 for events with event_name 'event_1' and 'even_2', but in a column called cd001 for events with event_name 'event_3'. So basically what matrix does is to map to which custom dimension each variable was sent for each event.

现在...我在 table_with_dirty_data 上有数百个不同的事件,而 matrix 拥有所有200个GA自定义维度,所以做类似的事情

Now...I have hundreds of different events on table_with_dirty_data, and matrix holds all 200 GA custom dimensions, so doing something like

SELECT 
  CASE 
    WHEN event_name = 'event_1' THEN cd020
    WHEN event_name = 'event_2' THEN cd020
    WHEN event_name = 'event_3' THEN cd001
  END AS variable_1
  , CASE
    WHEN event_name = 'event_1' THEN cd035
    WHEN event_name = 'event_2' THEN cd146
    WHEN event_name = 'event_3' THEN cd020
    END AS variable_2
  , CASE
    WHEN event_name = 'event_1' THEN cd120
    WHEN event_name = 'event_2' THEN cd056
    WHEN event_name = 'event_3' THEN cd035
  END AS variable_3
FROM table_with_dirty_data

会让我花很多时间,而且很容易出错.我想做的是使用 SELECT 语句,该语句从 table_with_dirty_data 中返回信息所在的列名( cd ### )存储并使用 WHILE 遍历所有不同事件.因此,例如,在 event_name ='event_1'的情况下,将是这样的:

would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT statements that return the column name (cd###) from the table_with_dirty_data where the information is stored and the use WHILE to loop over all different events. So, for example, with event_name = 'event_1' it would be something like this:

SELECT 
  CASE WHEN event_name = 'event_1' THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1') END AS variable_1
  , CASE WHEN event_name = 'event_1' THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1') END AS variable_2
  , CASE WHEN event_name = 'event_1' THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1') END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

这里的目标是,然后我可以遍历具有所有 event_name 的数组(可以轻松实现的东西).最终,我需要将所有事件都放在一个表中,但是我可以为每个事件准备一个表,只要可以通过编程即可完成(我个人不知道这在GBQ中是否可能...进行检查).

The goal here is that I can then loop over an array with all the event_names (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).

问题是我正在使用的 SELECT 语句被评估为字符串,因此 CASE 子句中的查询结果是字符串文字.例如,如果我要为event ='event_1'运行它,

The issue is that the SELECT statements I am using get evaluated to a string, so the result of the queries inside the CASE clause are string literals. For example, if I were to run it for event = 'event_1',

SELECT variable_1 FROM matrix WHERE event_name = 'event_1'

评估为

'cd020'

然后导致外部查询变为

SELECT 
  CASE WHEN event_name = 'event_1' THEN 'cd020' END AS variable_1
  , CASE WHEN event_name = 'event_1' THEN 'cd035' END AS variable_2
  , CASE WHEN event_name = 'event_1' THEN 'cd120' END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

产生这样的表格

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'

每次发现"event_1"事件时,都不会返回存储在 cd020 cd035 cd120 列中的值.

instead of returning the value stored in columns cd020, cd035 or cd120 for each time the event 'event_1' is found.

有人知道取消对内部查询结果的引用的方法,以便在执行外部查询时将其转换为列名(因此"cd020"变为 cd020 )吗?

Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020)??

PS:如果有人知道,我也愿意采取一种完全不同的策略.

PS: I am also open to a totally different strategy if anyone knows one.

推荐答案

听起来像

Sounds like a case where EXECUTE IMMEDIATE might be helpful. I'm not sure I fully understand the mapping, but it should allow you to compose a dynamic statement based off your matrix table.

但是,根据描述并不清楚所生成的查询可能有多笨拙,因为您提到有数百种此类事件,并且可能开始遇到其他问题,例如查询复杂性或长度限制.

However, it is unclear based on the description how unwieldy the generated query might be, as you mention having hundreds of these kinds of events and you may start running into other problems such as query complexity or length limits.

这篇关于如何在Google Bigquery中将字符串转换为列名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆