如何在Google Bigquery中将字符串转换为列名? [英] How to convert strings into column names in Google Bigquery?

查看：32 发布时间：2021/5/12 18:41:57 google-bigquery

本文介绍了如何在Google Bigquery中将字符串转换为列名?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在Google Bigquery中有一个大数据集，其中有数百万行要清除的脏数据(应用程序跟踪).我的问题之一是，同一数据被发送到应用程序中触发的不同事件的不同列.我的意思是说，对于某些事件，可能会将国家/地区发送到自定义维度1，而对于其他事件，会将国家/地区发送到自定义维度147.我无法发布实际数据，但是 SELECT * FROM table_with_dirty_data 会生成如下内容:

I have a large data set in Google Bigquery with millions of rows of dirty data (App tracking) that I am trying to clean up. One of my problems is that the same data got sent to different columns for different events triggered in the App. By this I mean that maybe the country was sent to custom dimension 1 for some events, but to custom dimension 147 for other events. I can't post actual data, but a SELECT * FROM table_with_dirty_data would produce something like this:

date | session | eventAction | cd001 | cd002    | cd004    | cd005 
-----|---------|-------------|-------|----------|----------|-------
1    | 1       | 'event_1'   | '1'   | 'Pizza'  | null     | '21'
1    | 1       | 'event_2'   | '10'  | '25'     | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | '3.1' | null     | '15'     | 'France'
1    | 2       | 'event_1'   | '6'   | 'Burger' | null     | '21'
1    | 2       | 'event_2'   | '21'  | '25'     | 'Burger' | '12.6'

这里的最终目标是最终得到一个可以分析的干净表.像这样:

The ultimate goal here is to end up with a clean table that can be analyzed. Something like this:

date | session | eventAction | country  | vendor   | product  | price
-----|---------|-------------|----------|----------|----------|-------
1    | 1       | 'event_1'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_2'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 1       | 'event_3'   | 'France' | '1'      | 'Pizza'  | '14.56'
1    | 2       | 'event_1'   | 'Spain'  | '25'     | 'Burger' | '12.6'
1    | 2       | 'event_2'   | 'Spain'  | '25'     | 'Burger' | '12.6'

我知道有些事件将需要某种程度的统计插补和数据类型转换，但是现在我只关心将每个变量放入其自己的列中.因此，我创建了一个辅助表(我们称其为 matrix )，如下所示:

I am aware that some events will need some degree of statistical imputation and data type casting, but for now I am only concerned with getting every variable into its own column. So I've created an auxiliary table (let's call it matrix) that looks like this:

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_2'  | 'cd020'    | 'cd146'    | 'cd056'
'event_3'  | 'cd001'    | 'cd020'    | 'cd035'

，依此类推，其中 variable _#列的每个单元格中的值是 table_with_dirty_data 中可找到信息的列名.也就是说，对于具有事件名称'event_1'和'even_2'的事件，可以在名为 cd020 的列中找到 variable_1 ，但是可以在名为 cd001 variable_1 >用于具有event_name'event_3'的事件.因此，基本上 matrix 所做的就是将每个事件的每个变量发送到哪个自定义维度.

and so on, where the value in each cell of column variable_# is the column name in the table_with_dirty_data where the information can be found. That is, variable_1 can be found in a column called cd020 for events with event_name 'event_1' and 'even_2', but in a column called cd001 for events with event_name 'event_3'. So basically what matrix does is to map to which custom dimension each variable was sent for each event.

现在...我在 table_with_dirty_data 上有数百个不同的事件，而 matrix 拥有所有200个GA自定义维度，所以做类似的事情

Now...I have hundreds of different events on table_with_dirty_data, and matrix holds all 200 GA custom dimensions, so doing something like

SELECT 
  CASE 
    WHEN event_name = 'event_1' THEN cd020
    WHEN event_name = 'event_2' THEN cd020
    WHEN event_name = 'event_3' THEN cd001
  END AS variable_1
  , CASE
    WHEN event_name = 'event_1' THEN cd035
    WHEN event_name = 'event_2' THEN cd146
    WHEN event_name = 'event_3' THEN cd020
    END AS variable_2
  , CASE
    WHEN event_name = 'event_1' THEN cd120
    WHEN event_name = 'event_2' THEN cd056
    WHEN event_name = 'event_3' THEN cd035
  END AS variable_3
FROM table_with_dirty_data

会让我花很多时间，而且很容易出错.我想做的是使用 SELECT 语句，该语句从 table_with_dirty_data 中返回信息所在的列名( cd ### )存储并使用 WHILE 遍历所有不同事件.因此，例如，在 event_name ='event_1'的情况下，将是这样的:

would take me a looooong time and is very error-prone. What I am trying to do is to use SELECT statements that return the column name (cd###) from the table_with_dirty_data where the information is stored and the use WHILE to loop over all different events. So, for example, with event_name = 'event_1' it would be something like this:

SELECT 
  CASE WHEN event_name = 'event_1' THEN (SELECT variable_1 FROM matrix WHERE event_name = 'event_1') END AS variable_1
  , CASE WHEN event_name = 'event_1' THEN (SELECT variable_2 FROM matrix WHERE event_name = 'event_1') END AS variable_2
  , CASE WHEN event_name = 'event_1' THEN (SELECT variable_3 FROM matrix WHERE event_name = 'event_1') END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

这里的目标是，然后我可以遍历具有所有 event_name 的数组(可以轻松实现的东西).最终，我需要将所有事件都放在一个表中，但是我可以为每个事件准备一个表，只要可以通过编程即可完成(我个人不知道这在GBQ中是否可能...进行检查).

The goal here is that I can then loop over an array with all the event_names (something that can be easily achieved). Ultimately I need all events in one table, but I'm ok with having a table for curating each event, as long as it can be done programmatically (something I personally don't know if it's even possible in GBQ...would have to check on that).

问题是我正在使用的 SELECT 语句被评估为字符串，因此 CASE 子句中的查询结果是字符串文字.例如，如果我要为event ='event_1'运行它，

The issue is that the SELECT statements I am using get evaluated to a string, so the result of the queries inside the CASE clause are string literals. For example, if I were to run it for event = 'event_1',

SELECT variable_1 FROM matrix WHERE event_name = 'event_1'

评估为

'cd020'

然后导致外部查询变为

SELECT 
  CASE WHEN event_name = 'event_1' THEN 'cd020' END AS variable_1
  , CASE WHEN event_name = 'event_1' THEN 'cd035' END AS variable_2
  , CASE WHEN event_name = 'event_1' THEN 'cd120' END AS variable_3
FROM table_with_dirty_data
WHERE event_name = 'event_1'

产生这样的表格

event_name | variable_1 | variable_2 | variable_3
-----------|------------|------------|----------
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'
'event_1'  | 'cd020'    | 'cd035'    | 'cd120'

每次发现"event_1"事件时，都不会返回存储在 cd020 ， cd035 或 cd120 列中的值.

instead of returning the value stored in columns cd020, cd035 or cd120 for each time the event 'event_1' is found.

有人知道取消对内部查询结果的引用的方法，以便在执行外部查询时将其转换为列名(因此"cd020"变为 cd020 )吗?

Does anyone know a way to unquote the result of those inner queries so that when the outer query executes, they are transformed into column names (so 'cd020' becomes cd020)??

PS:如果有人知道，我也愿意采取一种完全不同的策略.

PS: I am also open to a totally different strategy if anyone knows one.

如何在Google Bigquery中将字符串转换为列名? [英] How to convert strings into column names in Google Bigquery?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在Google Bigquery中将字符串转换为列名? [英] How to convert strings into column names in Google Bigquery?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭