如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中 [英] How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

查看：179 发布时间：2018/5/31 19:00:41 hadoop hive geospatial spatial hiveql

本文介绍了如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我已经对经过长时间的设备ping数据进行了非标准化处理，并创建了一个交叉点 - 产品/笛卡尔产品连接表，其中每行都具有该几何图形的ST_Point（long，lat），geometry_shape_of_ZIP和相关的邮政编码。为了测试目的，我在表格中有大约4,500万行，它的产量将增加到每天大约10亿次。

即使数据变得平坦并且没有连接条件下，查询大约需要2个小时才能完成。有没有更快的方法来计算空间查询？或者我该如何优化以下查询。

内联是我已经完成的一些优化步骤。使用优化，所有其他操作在最多5分钟内完成，除了这一步。我使用的是aws cluster 2 mater节点和5个数据节点。

  set hive.vectorized.execution.enabled = true; 
 
 set hive.execution.engine = tez; 
 
 set hive.enforce.sorting = true; 
 
 set hive.cbo.enable = true; 
 
 set hive.compute.query.using.stats = true; 
 
 set hive.stats.fetch.column.stats = true; 
 
 set hive.stats.fetch.partition.stats = true; 
 
分析表tele_us_zipmatch列的计算统计信息; 
 
 CREATE TABLE zipcheck（
 
`long4` double，
 
`lat4` double，
 
 state_name string，
 
 country_code字符串，
 
 country_name字符串，region字符串，
 $ b zip int，
 
县名字符串）PARTITIONED by（state_id字符串）
 
 STORED AS ORC TBLPROPERTIES（orc.compress=SNAPPY，
 
'orc.create.index'='true'，
 
'orc.bloom.filter.columns'=''）; 
 
 INSERT OVERWRITE TABLE zipcheck PARTITION（state_id）
 
从tele_us_zipmatch中选择long4，lat4，state_name，country_code，country_name，region，zip，countyname，state_id 
 
其中ST_Contains（wkt_shape，zip_point）= TRUE;

ST_Contains是esri中的函数（ref： https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation

任何帮助都非常感谢。

谢谢。

如果ZIP代码数据集可以放入内存中，请尝试一个自定义的Map-Reduce应用程序，该应用程序使用一个即时（in-time）的应用程序-memory quadtree index on the ZIP-code data，the adapting the GIS-Tools-for-Hadoop中的示例。

[合作者]

I am currently trying to associate each lat long ping from a device to its ZIP code.

I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.

Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.

Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.
set hive.vectorized.execution.enabled = true; set hive.execution.engine=tez; set hive.enforce.sorting=true; set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; analyze table tele_us_zipmatch compute statistics for columns; CREATE TABLE zipcheck ( `long4` double, `lat4` double, state_name string, country_code string, country_name string, region string, zip int, countyname string) PARTITIONED by (state_id string) STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY", 'orc.create.index'='true', 'orc.bloom.filter.columns'=''); INSERT OVERWRITE TABLE zipcheck PARTITION(state_id) select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch where ST_Contains(wkt_shape,zip_point)=TRUE;
ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).

Any help is greatly appreciated.

Thanks.
解决方案
If the ZIP-code dataset can fit into memory, try a custom Map-Reduce application that uses a just-in-time in-memory quadtree index on the ZIP-code data, by adapting the sample in the GIS-Tools-for-Hadoop.

[collaborator]

这篇关于如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中 [英] How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中 [英] How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭