“减少"Hive 中的一组行到另一组行 [英] "reduce" a set of rows in Hive to another set of rows

查看:21
本文介绍了“减少"Hive 中的一组行到另一组行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Hive 对我的空间数据库进行批处理.我的跟踪表看起来像这样:

I'm using Hive for batch-processing of my spatial database. My trace table looks something like this:

object | lat  | long  | timestamp

1      | X11  | X12   | T11

1      | X21  | X22   | T12

2      | X11  | X12   | T21

1      | X31  | X22   | T13

2      | X21  | X22   | T22

我想将每个对象的每个经纬度映射到一个数字(例如考虑地图匹配),但算法需要考虑多个相邻数据点才能获得结果.例如,我需要对象 1 的所有 3 个数据点将这 3 个数据点中的每一个映射到一个数字.无法一一处理.

I want to map each lat long of each object to a number (think about map-matching for example), but the algorithm needs to consider a number of adjacent data points to get the result. For example, I need all 3 data points of object 1 to map each of those 3 data points to a number. Can't process them one by one.

我正在考虑使用变换将 map-reduce 与 hive 结合使用,但我不确定如何操作.有人可以帮我吗?

I'm thinking of using map-reduce with hive using transform, but I'm not sure how to this. Can someone please help me out?

推荐答案

您可以在 Hive 中使用自定义地图缩减功能.

You can use the custom map reduce functionality in Hive.

以下内容:

add file /some/path/identity.pl;
add file /some/path/collect.pl;

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as id, list

trace_input 包含如上所述的跟踪数据:

trace_input contains your trace data as described above:

create table trace_input(id string, lat string, lon string, ts string)
row format delimited
fields terminated by '	'
stored as textfile ;

identity.pl 是一个简单的脚本来转储每一行(也可以是一个只选择经纬度字段的脚本):

identity.pl is a simple script to dump out each line (could also be a script to select just the lat, long fields):

#!/usr/bin/perl
while (<STDIN>) {
    print;
}

collect.pl(示例此处)是一个简单的脚本收集具有相同对象 id 的连续行,保存每行的其余部分,并转储出一行带有 id 和逗号分隔列表(制表符分隔符)的行.

collect.pl (sample here) is simple script which collects consecutive lines with the same object id, saves the remainder of each line, and dumps out a line with id and comma separated list (tab separator).

cluster by 子句将确保 reducer 获得 collect 脚本所需的正确排序的输入.

The cluster by clause will assure the reducers get the correctly sorted input needed by the collect script.

用户脚本的输出是制表符分隔的STRING 列.

The output of the user scripts are tab separated STRING columns.

运行查询,将产生以下输出:

Running the query, will result in the following output:

1       X11,X12,T11,X21,X22,T12,X31,X22,T13
2       X11,X12,T21,X21,X22,T22

您可以修改 map 脚本以限制列,和/或修改 reduce 脚本以添加结果或将 lat、lon 与 ts 等分开.

You can modify the map script to limit the columns, and/or modify the reduce script to add results or separate the lat, lon from the ts, etc.

如果这种形式足够了,你可以通过在reduce前添加一个insert直接插入到结果表中:

If this form is sufficient, you could insert directly into a result table by adding an insert before the reduce:

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
INSERT overwrite table trace_res
REDUCE id, lat, lon, ts
USING './collect.pl';

字段将从字符串字段转换为必要时匹配 trace_res 的架构.

The fields will be converted from string fields to match the schema of trace_res as necessary.

如果你像我一样使用集合类型,你也可以这样做:

If you use collection types like I do, you can also do something like:

create table trace_res as
select sq.id, split(sq.list,",") from
(
from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as (id int, list string)
) sq;

创建的表中的第二个字段将是所有经纬度的列表;但可能会有比这更复杂的表格.

This second field in the created table will be a list of all the lat, lon, ts; but probably will have a more complex table than that.

这篇关于“减少"Hive 中的一组行到另一组行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆