如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中 [英] How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

查看:179
本文介绍了如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我已经对经过长时间的设备ping数据进行了非标准化处理,并创建了一个交叉点 - 产品/笛卡尔产品连接表,其中每行都具有该几何图形的ST_Point(long,lat),geometry_shape_of_ZIP和相关的邮政编码。为了测试目的,我在表格中有大约4,500万行,它的产量将增加到每天大约10亿次。

即使数据变得平坦并且没有连接条件下,查询大约需要2个小时才能完成。有没有更快的方法来计算空间查询?或者我该如何优化以下查询。



内联是我已经完成的一些优化步骤。使用优化,所有其他操作在最多5分钟内完成,除了这一步。我使用的是aws cluster 2 mater节点和5个数据节点。

  set hive.vectorized.execution.enabled = true; 

set hive.execution.engine = tez;

set hive.enforce.sorting = true;

set hive.cbo.enable = true;

set hive.compute.query.using.stats = true;

set hive.stats.fetch.column.stats = true;

set hive.stats.fetch.partition.stats = true;

分析表tele_us_zipmatch列的计算统计信息;

CREATE TABLE zipcheck(

`long4` double,

`lat4` double,

state_name string,

country_code字符串,

country_name字符串,region字符串,
$ b zip int,

县名字符串)PARTITIONED by(state_id字符串)

STORED AS ORC TBLPROPERTIES(orc.compress=SNAPPY,

'orc.create.index'='true',

'orc.bloom.filter.columns'='');

INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)

从tele_us_zipmatch中选择long4,lat4,state_name,country_code,country_name,region,zip,countyname,state_id

其中ST_Contains(wkt_shape,zip_point)= TRUE;

ST_Contains是esri中的函数(ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation

任何帮助都非常感谢。



谢谢。

如果ZIP代码数据集可以放入内存中,请尝试一个自定义的Map-Reduce应用程序,该应用程序使用一个即时(in-time)的应用程序-memory quadtree index on the ZIP-code data,the adapting the
GIS-Tools-for-Hadoop中的示例

[合作者]


I am currently trying to associate each lat long ping from a device to its ZIP code.

I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.

Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.

Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.

set hive.vectorized.execution.enabled = true;

set hive.execution.engine=tez;

set hive.enforce.sorting=true;

set hive.cbo.enable=true;

set hive.compute.query.using.stats=true;

set hive.stats.fetch.column.stats=true;

set hive.stats.fetch.partition.stats=true;

analyze table tele_us_zipmatch compute statistics for columns;

CREATE TABLE zipcheck (

`long4` double,

`lat4` double,

state_name string,

country_code string,

country_name string, region string,

zip int,

countyname string) PARTITIONED by (state_id string)

STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY",

'orc.create.index'='true',

'orc.bloom.filter.columns'='');

INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)

select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch

where ST_Contains(wkt_shape,zip_point)=TRUE;

ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).

Any help is greatly appreciated.

Thanks.

解决方案

If the ZIP-code dataset can fit into memory, try a custom Map-Reduce application that uses a just-in-time in-memory quadtree index on the ZIP-code data, by adapting the sample in the GIS-Tools-for-Hadoop.

[collaborator]

这篇关于如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆