如何使用 BigQuery 和 Apache Beam 将 SQL 表转换为行序列列表? [英] How to transform an SQL table into a list of row sequences using BigQuery and Apache Beam?

查看:30
本文介绍了如何使用 BigQuery 和 Apache Beam 将 SQL 表转换为行序列列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常大的表格,其中每一行代表一个称为 Trip 的抽象.行程由数字列组成,例如车辆 ID、行程 ID、开始时间、停止时间、行驶距离、行驶持续时间等.因此,每个行程都是浮点值的一维向量.

我想将这个表或向量列表转换成一个行程序列列表,其中行程按车辆 ID 分组为序列,并根据开始时间排序.序列长度需要限制为特定大小,例如 256,但可以/应该有多个具有相同 VehicleId 的序列.

示例:
(序列长度 = 4)

<预><代码>[(Vehicle1, [Trip1, Trip2, Trip3, Trip4]),(Vehicle1, [Trip5, Trip6, Trip7]),(Vehicle2, [Trip1, Trip2, Trip3, Trip4])]

我正在尝试使用基于序列的模型(例如 LSTM/Transformer)对基于这些行程的驾驶模式进行建模.将每个 Trip 想象成一个词嵌入,将每个 Trip 序列想象成一个句子.不知何故,我需要通过 BigQuery/Apache Beam 函数(或任何其他推荐的工具)的组合来构建这些句子,因为我们正在谈论数百 GB 的数据.我对这两种工具都很陌生,因此非常感谢您的帮助.

解决方案

以下是 BigQuery Standard SQL

#standardSQLSELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) 行程从 (选择行程,DIV(ROW_NUMBER() OVER(PARTITION BY Vehicle_id ORDER BY start_time) - 1, 4) grpFROM `project.dataset.table` 行)GROUP BY trip.vehicle_id, grp

以上假设行程按 start_time 和序列长度 = 4 进行排序
此外,它返回 Vehicle_id 作为数组中行程信息的一部分 - 如下例

Row vehicle_id trips.vehicle_id trips.trip_id trips.start_time trips.stop_time1 车辆1 车辆1 行程1 1 2车辆1 行程2 2 3车辆 1 行程 3 3 4车辆1 行程4 4 52 车辆1 车辆1 行程5 5 6车辆1 Trip6 6 6车辆1 Trip7 7 63 车辆2 车辆2 行程1 2 3车辆 2 行程 2 3 4车辆 2 行程 3 4 5车辆 2 行程 4 5 6

要消除这种情况 - 请尝试以下

#standardSQL选择车辆_id,大批(SELECT AS STRUCT * EXCEPT(vehicle_id)从 UNNEST(旅行)按开始时间排序) 旅行从 (SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) 行程从 (选择行程,DIV(ROW_NUMBER() OVER(PARTITION BY Vehicle_id ORDER BY start_time) - 1, 4) grpFROM `project.dataset.table` 行)GROUP BY trip.vehicle_id, grp)行 Vehicle_id trips.trip_id trips.start_time trips.stop_time1 车辆1 行程1 1 2旅行 2 2 3旅行 3 3 4旅行 4 4 52 车辆1 行程5 5 6旅行 6 6 6旅行 7 7 63 车辆 2 行程 1 2 3旅行 2 3 4旅行 3 4 5旅行 4 5 6

I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc. So each Trip is a 1D vector of floating point values.

I want to transform this table, or list of vectors, into a list of Trip sequences where Trips are grouped into sequences by vehicle id and are in order according to start time. The sequence length needs to be limited to a specific size such as 256 but there can / should be multiple sequences with the same VehicleId.

Example:
(sequence length = 4)

[  
(Vehicle1, [Trip1, Trip2, Trip3, Trip4]),  
(Vehicle1, [Trip5, Trip6, Trip7]),  
(Vehicle2, [Trip1, Trip2, Trip3, Trip4])  
]

I'm trying to model driving patterns based on these Trips using a sequence-based model such as an LSTM / Transformer. Imagine each Trip as a word embedding and each sequence of trips as a sentence. Somehow I need to construct these sentences through a combination of BigQuery / Apache Beam functions (or any other recommended tools) since we're talking about hundreds of gigabytes of data. I'm fairly new to both tools so any help would be greatly appreciated.

解决方案

Below is for BigQuery Standard SQL

#standardSQL
SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM (
  SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp   
  FROM `project.dataset.table` trip
)
GROUP BY trip.vehicle_id, grp

Above assumes ordering of trips by start_time and sequence length = 4
Also, it returns vehicle_id as a part of trip info in array - like in below example

Row vehicle_id  trips.vehicle_id    trips.trip_id   trips.start_time    trips.stop_time  
1   Vehicle1    Vehicle1            Trip1           1                   2    
                Vehicle1            Trip2           2                   3    
                Vehicle1            Trip3           3                   4    
                Vehicle1            Trip4           4                   5    
2   Vehicle1    Vehicle1            Trip5           5                   6    
                Vehicle1            Trip6           6                   6    
                Vehicle1            Trip7           7                   6    
3   Vehicle2    Vehicle2            Trip1           2                   3    
                Vehicle2            Trip2           3                   4    
                Vehicle2            Trip3           4                   5    
                Vehicle2            Trip4           5                   6    

To eliminate this - try below

#standardSQL
SELECT vehicle_id,
  ARRAY( 
    SELECT AS STRUCT * EXCEPT(vehicle_id)
    FROM UNNEST(trips)
    ORDER BY start_time
  ) trips
FROM (
  SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
  FROM (
    SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp   
    FROM `project.dataset.table` trip
  )
  GROUP BY trip.vehicle_id, grp
)


Row vehicle_id  trips.trip_id   trips.start_time    trips.stop_time  
1   Vehicle1    Trip1           1                   2    
                Trip2           2                   3    
                Trip3           3                   4    
                Trip4           4                   5    
2   Vehicle1    Trip5           5                   6    
                Trip6           6                   6    
                Trip7           7                   6    
3   Vehicle2    Trip1           2                   3    
                Trip2           3                   4    
                Trip3           4                   5    
                Trip4           5                   6    

这篇关于如何使用 BigQuery 和 Apache Beam 将 SQL 表转换为行序列列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆