"减少" Hive中的一组行成为另一组行 [英] "reduce" a set of rows in Hive to another set of rows

查看:154
本文介绍了"减少" Hive中的一组行成为另一组行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Hive来批量处理我的空间数据库。我的跟踪表看起来像这样:

  object | lat |长|时间戳

1 | X11 | X12 | T11

1 | X21 | X22 | T12

2 | X11 | X12 | T21

1 | X31 | X22 | T13

2 | X21 | X22 | T22

我想将每个对象的每个纬度长度映射到一个数字(考虑地图匹配例如),但算法需要考虑一些相邻的数据点才能得到结果。例如,我需要对象1的所有3个数据点将这3个数据点中的每一个映射到一个数字。无法逐一处理它们。



我正在考虑使用map-reduce与配置单元使用转换,但我不知道如何做到这一点。有人可以帮助我吗?

解决方案

您可以在Hive中使用自定义地图缩减功能。 b

以下内容:

 添加文件/some/path/identity.pl; 
添加文件/some/path/collect.pl;

from(
from trace_input
MAP id,lat,lon,ts
USING'./identity.pl'
as id,lat,lon ,ts
CLUSTER BY id)map_output
REDUCE id,lat,lon,ts
USING'./collect.pl'为id,list



trace_input 包含您上述的追踪资料:

<$ p $创建表trace_input(id字符串,lat字符串,lon字符串,ts字符串)
行格式定界
字段以'\ t'结尾
存储为文本文件;

identity.pl 是一个简单的脚本,用于转储每行也可以是一个脚本来选择lat,long字段):

 #!/ usr / bin / perl 
while(< STDIN>){
print;
}

collect.pl (sample 这里)是一个简单的脚本,用相同的对象ID收集连续的行,保存每行的其余部分,并转储出

cluster by 子句将确保减法器获得所需的正确排序的输入收集脚本。



用户脚本的输出是以制表符分隔的 STRING 列。

运行查询将导致以下输出:

  1 X11,X12,T11,X21,X22,T12 ,X31,X22,T13 
2 X11,X12,T21,X21,X22,T22

您可以修改地图脚本以限制列,和/或修改reduce脚本以添加结果或将ts,lat等等分开。



如果这个表格就足够了,你可以直接插入int通过在reduce之前添加一个 insert 来创建一个结果表:

  from(
from trace_input
MAP id,lat,lon,ts
USING'./identity.pl'
作为id,lat,lon,ts
CLUSTER BY id)map_output
INSERT覆盖表trace_res
REDUCE id,lat,lon,ts
USING'./collect.pl';

这些字段将根据需要从字符串字段转换为匹配trace_res的模式。



如果您像我一样使用集合类型,您也可以执行下列操作:

 将表trace_res创建为

中选择sq.id,split(sq.list,,)(trace_input
中的
MAP id,lat,lon,ts
USING'./identity.pl'
作为id,lat,lon,ts
CLUSTER BY id)map_output
REDUCE id,lat, lon,ts
USING'./collect.pl'as(id int,list string)
)sq;

创建表中的第二个字段将是所有lat,lon,ts的列表;但可能会有比这更复杂的表。


I'm using Hive for batch-processing of my spatial database. My trace table looks something like this:

object | lat  | long  | timestamp

1      | X11  | X12   | T11

1      | X21  | X22   | T12

2      | X11  | X12   | T21

1      | X31  | X22   | T13

2      | X21  | X22   | T22

I want to map each lat long of each object to a number (think about map-matching for example), but the algorithm needs to consider a number of adjacent data points to get the result. For example, I need all 3 data points of object 1 to map each of those 3 data points to a number. Can't process them one by one.

I'm thinking of using map-reduce with hive using transform, but I'm not sure how to this. Can someone please help me out?

解决方案

You can use the custom map reduce functionality in Hive.

With the following:

add file /some/path/identity.pl;
add file /some/path/collect.pl;

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as id, list

trace_input contains your trace data as described above:

create table trace_input(id string, lat string, lon string, ts string)
row format delimited
fields terminated by '\t'
stored as textfile ;

identity.pl is a simple script to dump out each line (could also be a script to select just the lat, long fields):

#!/usr/bin/perl
while (<STDIN>) {
    print;
}

collect.pl (sample here) is simple script which collects consecutive lines with the same object id, saves the remainder of each line, and dumps out a line with id and comma separated list (tab separator).

The cluster by clause will assure the reducers get the correctly sorted input needed by the collect script.

The output of the user scripts are tab separated STRING columns.

Running the query, will result in the following output:

1       X11,X12,T11,X21,X22,T12,X31,X22,T13
2       X11,X12,T21,X21,X22,T22

You can modify the map script to limit the columns, and/or modify the reduce script to add results or separate the lat, lon from the ts, etc.

If this form is sufficient, you could insert directly into a result table by adding an insert before the reduce:

from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
INSERT overwrite table trace_res
REDUCE id, lat, lon, ts
USING './collect.pl';

The fields will be converted from string fields to match the schema of trace_res as necessary.

If you use collection types like I do, you can also do something like:

create table trace_res as
select sq.id, split(sq.list,",") from
(
from (
  from trace_input
  MAP id, lat, lon, ts
  USING './identity.pl'
  as id, lat, lon, ts
 CLUSTER BY id) map_output
REDUCE id, lat, lon, ts
USING './collect.pl' as (id int, list string)
) sq;

This second field in the created table will be a list of all the lat, lon, ts; but probably will have a more complex table than that.

这篇关于&QUOT;减少&QUOT; Hive中的一组行成为另一组行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆