如何使用配置单元上下文高效地查询火花中的配置单元表? [英] How to efficiently query a hive table in spark using hive context?

查看:117
本文介绍了如何使用配置单元上下文高效地查询火花中的配置单元表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带时间序列数据的1.6T Hive表。我在 scala中使用 Hive 1.2.1
Spark 1.6.1



以下是我在代码中查询的内容。但是我总是得到 Java内存不足错误

  val sid_data_df = hiveContext.sql(sSELECT time,total_field,sid,year,date FROM tablename WHERE sid ='$ stationId'ORDER BY time LIMIT 4320000)

通过从配置表格中一次迭代地选择几条记录,我试图对结果 dataframe做一个滑动窗口



我有一个包含122 GB的4个节点的集群的内存,44个vCores。我正在使用可用的488 GB中的425 GB内存。我给spark-submit提供了以下参数:

   -  num-executors 16 --driver-memory 4g --executor -memory 22G --executor-cores 10 \ 
--confspark.sql.shuffle.partitions = 1800\
--confspark.shuffle.memory.fraction = 0.6\\ \\
--confspark.storage.memoryFraction = 0.4\
--confspark.yarn.executor.memoryOverhead = 2600\
--confspark.yarn .nodemanager.resource.memory-mb = 123880\
--confspark.yarn.nodemanager.resource.cpu-vcores = 43

请给我一些关于如何优化它并成功从hive表中获取数据的建议。



谢谢

解决方案

问题可能出在这里:

  LIMIT 4320000 

您应该避免使用 LIMIT 以大量记录的子集。在Spark中, LIMIT 将所有行移动到一个分区,并可能导致严重的性能和稳定性问题。



请参阅如何优化以下火花代码(scala)?



<




$ b

我试图通过一次选择几条记录来对结果数据进行滑动窗口。这听起来不对。滑动窗口操作通常可以通过窗口函数和基于时间戳的组合来实现 窗口存储桶


I have a 1.6T Hive table with time series data. I am using Hive 1.2.1 and Spark 1.6.1 in scala.

Following is the query which I have in my code. But I always get Java out of memory error.

val sid_data_df = hiveContext.sql(s"SELECT time, total_field, sid, year, date FROM tablename WHERE sid = '$stationId' ORDER BY time LIMIT 4320000  ")

By iteratively selecting few records at a time from hive table, I am trying to do a sliding window on the resultant dataframe

I have a cluster of 4 nodes with 122 GB of memory, 44 vCores. I am using 425 GB memory out of 488 GB available. I am giving the spark-submit with the following parameters

--num-executors 16 --driver-memory 4g --executor-memory 22G --executor-cores 10 \
--conf "spark.sql.shuffle.partitions=1800" \
--conf "spark.shuffle.memory.fraction=0.6" \
--conf "spark.storage.memoryFraction=0.4" \
--conf "spark.yarn.executor.memoryOverhead=2600" \
--conf "spark.yarn.nodemanager.resource.memory-mb=123880" \
--conf "spark.yarn.nodemanager.resource.cpu-vcores=43"

kindly give me suggestions on how to optimize this and successfully fetch data from hive table.

Thanks

解决方案

The problem is likely here:

LIMIT 4320000

You should avoid using LIMIT to subset large number of records. In Spark, LIMIT moves all rows to a single partition and is likely to cause serious performance and stability issues.

See for example How to optimize below spark code (scala)?

I am trying to do a sliding window on this resultant dataframeiteratively by selecting few records at a time.

This doesn't sound right. Sliding window operations can be usually achieved with some combination of window function, and timestamp-based window buckets.

这篇关于如何使用配置单元上下文高效地查询火花中的配置单元表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆