如何在Google Bigquery中将字符串转换为列名? [英] How to convert strings into column names in Google Bigquery?
问题描述
我在Google Bigquery中有一个大数据集,其中有数百万行要清除的脏数据(应用程序跟踪).我的问题之一是,同一数据被发送到应用程序中触发的不同事件的不同列.我的意思是说,对于某些事件,可能会将国家/地区发送到自定义维度1,而对于其他事件,会将国家/地区发送到自定义维度147.我无法发布实际数据,但是 SELECT * FROM table_with_dirty_data
会生成如下内容:
I have a large data set in Google Bigquery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data
would produce something like this:
date | session | eventAction | cd001 | cd002 | cd004 | cd005
-----|---------|-------------|-------|----------|----------|-------
1 | 1 | 'event_1' | '1' | 'Pizza' | null | '21'
1 | 1 | 'event_2' | '10' | '25' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | '3.1' | null | '15' | 'France'
1 | 2 | 'event_1' | '6' | 'Burger' | null | '21'
1 | 2 | 'event_2' | '21' | '25' | 'Burger' | '12.6'
这里的最终目标是最终得到一个可以分析的干净表.像这样:
The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:
date | session | eventAction | country | vendor | product | price
-----|---------|-------------|----------|----------|----------|-------
1 | 1 | 'event_1' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_2' | 'France' | '1' | 'Pizza' | '14.56'
1 | 1 | 'event_3' | 'France' | '1' | 'Pizza' | '14.56'
1 | 2 | 'event_1' | 'Spain' | '25' | 'Burger' | '12.6'
1 | 2 | 'event_2' | 'Spain' | '25' | 'Burger' | '12.6'
我知道有些事件将需要某种程度的统计插补和数据类型转换,但是现在我只关心将每个变量放入其自己的列中.因此,我创建了一个辅助表(我们称其为 matrix
),如下所示:
I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix
) that looks like this:
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_2' | 'cd020' | 'cd146' | 'cd056'
'event_3' | 'cd001' | 'cd020' | 'cd035'
,依此类推,其中 variable _#
列的每个单元格中的值是 table_with_dirty_data
中可找到信息的列名.也就是说,对于具有事件名称'event_1'和'even_2'的事件,可以在名为 cd020
的列中找到 variable_1
,但是可以在名为 cd001
variable_1 >用于具有event_name'event_3'的事件.因此,基本上 matrix
所做的就是将每个事件的每个变量发送到哪个自定义维度.
and so on, where the value in each cell of column variable_#
is the column name in the table_with_dirty_data
where the information can be found. That is, variable_1
can be found in a column called cd020
for events with event_name 'event_1' and 'even_2', but in a column called cd001
for events with event_name 'event_3'. So basically what matrix
does is to map to which custom dimension each variable was sent for each event.
现在...我在 table_with_dirty_data
上有数百个不同的事件,而 matrix
拥有所有200个GA自定义维度,所以做类似的事情
Now...I have hundreds of different events on table_with_dirty_data
, and matrix
holds all 200 GA custom dimensions, so doing something like
SELECT
CASE
WHEN event_name = 'event_1' THEN cd020
WHEN event_name = 'event_2' THEN cd020
WHEN event_name = 'event_3' THEN cd001
END AS variable_1
, CASE
WHEN event_name = 'event_1' THEN cd035
WHEN event_name = 'event_2' THEN cd146
WHEN event_name = 'event_3' THEN cd020
END AS variable_2
, CASE
WHEN event_name = 'event_1' THEN cd120
WHEN event_name = 'event_2' THEN cd056
WHEN event_name = 'event_3' THEN cd035
END AS variable_3
FROM table_with_dirty_data
会让我花很多时间,而且很容易出错.我想做的是使用 SELECT
语句,该语句从 table_with_dirty_data
中返回信息所在的列名( cd ###
)存储并使用 WHILE
遍历所有不同事件.因此,例如,在 event_name
='event_1'的情况下,将是这样的:
would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT
statements that return the column name (cd###
) from the table_with_dirty_data
where the information is stored and the use WHILE
to loop over all different events. So, for example, with event_name
= 'event_1' it would be something like this:
SELECT
CASE WHEN event_name = 'event_1' THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1') END AS variable_1
, CASE WHEN event_name = 'event_1' THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1') END AS variable_2
, CASE WHEN event_name = 'event_1' THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1') END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
这里的目标是,然后我可以遍历具有所有 event_name
的数组(可以轻松实现的东西).最终,我需要将所有事件都放在一个表中,但是我可以为每个事件准备一个表,只要可以通过编程即可完成(我个人不知道这在GBQ中是否可能...进行检查).
The goal here is that I can then loop over an array with all the event_name
s (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).
问题是我正在使用的 SELECT
语句被评估为字符串,因此 CASE
子句中的查询结果是字符串文字.例如,如果我要为event ='event_1'运行它,
The issue is that the SELECT
statements I am using get evaluated to a string, so the result of the queries inside the CASE
clause are string literals. For example, if I were to run it for event = 'event_1',
SELECT variable_1 FROM matrix WHERE event_name = 'event_1'
评估为
'cd020'
然后导致外部查询变为
SELECT
CASE WHEN event_name = 'event_1' THEN 'cd020' END AS variable_1
, CASE WHEN event_name = 'event_1' THEN 'cd035' END AS variable_2
, CASE WHEN event_name = 'event_1' THEN 'cd120' END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'
产生这样的表格
event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
'event_1' | 'cd020' | 'cd035' | 'cd120'
每次发现"event_1"事件时,都不会返回存储在 cd020
, cd035
或 cd120
列中的值.
instead of returning the value stored in columns cd020
, cd035
or cd120
for each time the event 'event_1' is found.
有人知道取消对内部查询结果的引用的方法,以便在执行外部查询时将其转换为列名(因此"cd020"变为 cd020
)吗?
Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020
)??
PS:如果有人知道,我也愿意采取一种完全不同的策略.
PS: I am also open to a totally different strategy if anyone knows one.
推荐答案
Sounds like a case where EXECUTE IMMEDIATE might be helpful. I'm not sure I fully understand the mapping, but it should allow you to compose a dynamic statement based off your matrix
table.
但是,根据描述并不清楚所生成的查询可能有多笨拙,因为您提到有数百种此类事件,并且可能开始遇到其他问题,例如查询复杂性或长度限制.
However, it is unclear based on the description how unwieldy the generated query might be, as you mention having hundreds of these kinds of events and you may start running into other problems such as query complexity or length limits.
这篇关于如何在Google Bigquery中将字符串转换为列名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!