如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中 [英] How to optimize scan of 1 huge file / table in Hive to confirm/check if lat long point is contained in a wkt geometry shape
问题描述
我已经对经过长时间的设备ping数据进行了非标准化处理,并创建了一个交叉点 - 产品/笛卡尔产品连接表,其中每行都具有该几何图形的ST_Point(long,lat),geometry_shape_of_ZIP和相关的邮政编码。为了测试目的,我在表格中有大约4,500万行,它的产量将增加到每天大约10亿次。
即使数据变得平坦并且没有连接条件下,查询大约需要2个小时才能完成。有没有更快的方法来计算空间查询?或者我该如何优化以下查询。
内联是我已经完成的一些优化步骤。使用优化,所有其他操作在最多5分钟内完成,除了这一步。我使用的是aws cluster 2 mater节点和5个数据节点。
set hive.vectorized.execution.enabled = true;
set hive.execution.engine = tez;
set hive.enforce.sorting = true;
set hive.cbo.enable = true;
set hive.compute.query.using.stats = true;
set hive.stats.fetch.column.stats = true;
set hive.stats.fetch.partition.stats = true;
分析表tele_us_zipmatch列的计算统计信息;
CREATE TABLE zipcheck(
`long4` double,
`lat4` double,
state_name string,
country_code字符串,
country_name字符串,region字符串,
$ b zip int,
县名字符串)PARTITIONED by(state_id字符串)
STORED AS ORC TBLPROPERTIES(orc.compress=SNAPPY,
'orc.create.index'='true',
'orc.bloom.filter.columns'='');
INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)
从tele_us_zipmatch中选择long4,lat4,state_name,country_code,country_name,region,zip,countyname,state_id
其中ST_Contains(wkt_shape,zip_point)= TRUE;
ST_Contains是esri中的函数(ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation 任何帮助都非常感谢。 谢谢。
[合作者]
I am currently trying to associate each lat long ping from a device to its ZIP code.
I have de-normalized lat long device ping data and created a cross-product/ Cartesian product join table in which each row has the ST_Point(long,lat), geometry_shape_of_ZIP and associated zip code for that geometry. for testing purpose I have around 45 million rows in the table and it'll increase in production to about 1 billion every day.
Even though the data is flattened and no join conditions, the query takes about 2 hours to complete. Is there any faster way to compute spatial queries? Or how can I optimize the following query.
Inline is some of the optimizations steps I have already performed. Using the optimizations all the other operations gets done in max 5 minutes except for this one step. I am using aws cluster 2 mater nodes and 5 data nodes.
set hive.vectorized.execution.enabled = true;
set hive.execution.engine=tez;
set hive.enforce.sorting=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
analyze table tele_us_zipmatch compute statistics for columns;
CREATE TABLE zipcheck (
`long4` double,
`lat4` double,
state_name string,
country_code string,
country_name string, region string,
zip int,
countyname string) PARTITIONED by (state_id string)
STORED AS ORC TBLPROPERTIES ("orc.compress" = "SNAPPY",
'orc.create.index'='true',
'orc.bloom.filter.columns'='');
INSERT OVERWRITE TABLE zipcheck PARTITION(state_id)
select long4, lat4, state_name, country_code, country_name, region, zip, countyname, state_id from tele_us_zipmatch
where ST_Contains(wkt_shape,zip_point)=TRUE;
ST_Contains is the function from esri (ref: https://github.com/Esri/spatial-framework-for-hadoop/wiki/UDF-Documentation#relationship-tests ).
Any help is greatly appreciated.
Thanks.
If the ZIP-code dataset can fit into memory, try a custom Map-Reduce application that uses a just-in-time in-memory quadtree index on the ZIP-code data, by adapting the sample in the GIS-Tools-for-Hadoop.
[collaborator]
这篇关于如何优化扫描Hive中的1个巨大文件/表格以确认/检查纬度长点是否包含在wkt几何形状中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!