如何使用BigQuery和Apache Beam将SQL表转换为行序列列表? [英] How to transform an SQL table into a list of row sequences using BigQuery and Apache Beam?

查看:119
本文介绍了如何使用BigQuery和Apache Beam将SQL表转换为行序列列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的表,其中每一行代表一个称为Trip的抽象.行程由数字列组成,例如车辆ID,行程ID,开始时间,停止时间,行驶距离,行驶持续时间等.因此,每个行程都是浮点值的一维向量.

I have a very large table where each row represents an abstraction called a Trip. Trips consist of numeric columns such as vehicle id, trip id, start time, stop time, distance traveled, driving duration, etc. So each Trip is a 1D vector of floating point values.

我想将此表或向量列表转换为Trip序列列表,其中Trips按车辆ID分组为序列,并根据开始时间排序.序列长度必须限制为特定大小,例如256,但可以/应该有多个序列具有相同的VehicleId.

I want to transform this table, or list of vectors, into a list of Trip sequences where Trips are grouped into sequences by vehicle id and are in order according to start time. The sequence length needs to be limited to a specific size such as 256 but there can / should be multiple sequences with the same VehicleId.

示例:
(序列长度= 4)

Example:
(sequence length = 4)

[  
(Vehicle1, [Trip1, Trip2, Trip3, Trip4]),  
(Vehicle1, [Trip5, Trip6, Trip7]),  
(Vehicle2, [Trip1, Trip2, Trip3, Trip4])  
]

我正在尝试使用基于序列的模型(例如LSTM/Transformer)基于这些行程对驾驶模式进行建模.将每个行程想象成单词嵌入,将每个行程序列想象成一个句子.由于我们要谈论的是数百GB的数据,因此我需要以某种方式通过BigQuery/Apache Beam函数(或任何其他推荐的工具)的组合来构造这些语句.我对这两种工具都比较陌生,因此不胜感激.

I'm trying to model driving patterns based on these Trips using a sequence-based model such as an LSTM / Transformer. Imagine each Trip as a word embedding and each sequence of trips as a sentence. Somehow I need to construct these sentences through a combination of BigQuery / Apache Beam functions (or any other recommended tools) since we're talking about hundreds of gigabytes of data. I'm fairly new to both tools so any help would be greatly appreciated.

推荐答案

以下是BigQuery标准SQL

Below is for BigQuery Standard SQL

#standardSQL
SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
FROM (
  SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp   
  FROM `project.dataset.table` trip
)
GROUP BY trip.vehicle_id, grp

以上假设行程按start_time排序,序列长度= 4
另外,它会在数组中将vehicle_id作为行程信息的一部分返回-如下面的示例

Above assumes ordering of trips by start_time and sequence length = 4
Also, it returns vehicle_id as a part of trip info in array - like in below example

Row vehicle_id  trips.vehicle_id    trips.trip_id   trips.start_time    trips.stop_time  
1   Vehicle1    Vehicle1            Trip1           1                   2    
                Vehicle1            Trip2           2                   3    
                Vehicle1            Trip3           3                   4    
                Vehicle1            Trip4           4                   5    
2   Vehicle1    Vehicle1            Trip5           5                   6    
                Vehicle1            Trip6           6                   6    
                Vehicle1            Trip7           7                   6    
3   Vehicle2    Vehicle2            Trip1           2                   3    
                Vehicle2            Trip2           3                   4    
                Vehicle2            Trip3           4                   5    
                Vehicle2            Trip4           5                   6    

要消除这种情况-请尝试以下

To eliminate this - try below

#standardSQL
SELECT vehicle_id,
  ARRAY( 
    SELECT AS STRUCT * EXCEPT(vehicle_id)
    FROM UNNEST(trips)
    ORDER BY start_time
  ) trips
FROM (
  SELECT trip.vehicle_id, ARRAY_AGG(trip ORDER BY trip.start_time) trips
  FROM (
    SELECT trip, DIV(ROW_NUMBER() OVER(PARTITION BY vehicle_id ORDER BY start_time) - 1, 4) grp   
    FROM `project.dataset.table` trip
  )
  GROUP BY trip.vehicle_id, grp
)


Row vehicle_id  trips.trip_id   trips.start_time    trips.stop_time  
1   Vehicle1    Trip1           1                   2    
                Trip2           2                   3    
                Trip3           3                   4    
                Trip4           4                   5    
2   Vehicle1    Trip5           5                   6    
                Trip6           6                   6    
                Trip7           7                   6    
3   Vehicle2    Trip1           2                   3    
                Trip2           3                   4    
                Trip3           4                   5    
                Trip4           5                   6    

这篇关于如何使用BigQuery和Apache Beam将SQL表转换为行序列列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆