Spark SQL“限制” [英] Spark SQL "Limit"

查看：786 发布时间：2018/5/31 20:01:33 hadoop apache-spark hive hortonworks-data-platform

本文介绍了Spark SQL“限制”的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Env：使用Hadoop的spark 1.6。 Hortonworks Data Platform 2.5

我有一张包含100亿条记录的表，我希望获得3亿条记录并将它们移动到临时表中。

  sqlContext.sql（select .... from my_table limit 300000000）。repartition（50）
 .write.saveAsTable（temporary_table ）

我看到限制关键字实际上只会使用火花一个执行者！这意味着将3亿条记录移动到一个节点并将其写回Hadoop。
我怎样才能避免这种减少，但在拥有多个执行器的情况下仍然只能获得3亿条记录。我希望所有节点都能写入hadoop。

抽样可以帮助我吗？如果是这样的话？

解决方案

抽样可用于以下方面： - $ / b>

  select .... from my_table TABLESAMPLE（.3 PERCENT）

或

  select .... from my_table TABLESAMPLE（30M row）

Env : spark 1.6 using Hadoop. Hortonworks Data Platform 2.5

I have a table with 10 billion records and I would like to get 300 million records and move them to a temporary table.

sqlContext.sql("select ....from my_table limit 300000000").repartition(50)
.write.saveAsTable("temporary_table")

I saw that the Limit keyword would actually make spark use only one executor!!! This means moving 300 million records to one node and writing it back to Hadoop. How can I avoid this reduce but still get just 300 million records while having more than one executor. I would like all nodes to write into hadoop.

Can sampling help me with that? If so how?

解决方案

Sampling can be used in below ways :-

select ....from my_table TABLESAMPLE(.3 PERCENT)

select ....from my_table TABLESAMPLE(30M ROWS)

这篇关于Spark SQL“限制”的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark SQL“限制” [英] Spark SQL "Limit"

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

Spark SQL“限制” [英] Spark SQL &quot;Limit&quot;

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

Spark SQL“限制” [英] Spark SQL "Limit"

登录关闭