在Pyspark HiveContext中,什么是SQL OFFSET的等价物? [英] In Pyspark HiveContext what is the equivalent of SQL OFFSET?

查看:493
本文介绍了在Pyspark HiveContext中,什么是SQL OFFSET的等价物?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

或者一个更具体的问题是如何处理大量的数据,而这些数据一次不适合内存?用OFFSET我试图做hiveContext.sql(select ... limit 10 offset 10),同时递增偏移量以获取所有数据,但offset在hiveContext中似乎不是有效的。通常用于实现这一目标的替代方案是什么?



对于某些情况,pyspark代码以

<$ p $开头p> from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql(select ... limit 10 offset 10)。show()


解决方案

您的代码看起来像

  from pyspark.sql import HiveContext 
hiveContext = HiveContext(sc)
hiveContext.sql(with result as
(SELECT colunm1 ,row2,column3,ROW_NUMBER()OVER(ORDER BY columnname)AS RowNum FROM tablename)
从RowNum> = OFFSEtvalue和RowNum<(OFFSEtvalue + limtvalue).show()的结果中选择colunm1,column2,column3 )

注意:根据您的要求更新以下变量tcolunm1,tablename,OFFSEtvalue ,limtvalue

Or a more specific question would be how can I process large amounts of data that do not fit into memory at once? With OFFSET I was trying to do hiveContext.sql("select ... limit 10 offset 10") while incrementing offset to get all the data but offset doesn't seem to be valid within hiveContext. What is the alternative usually used to achieve this goal?

For some context the pyspark code starts with

from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("select ... limit 10 offset 10").show()

解决方案

You code will look like

  from pyspark.sql import HiveContext
hiveContext = HiveContext(sc)
hiveContext.sql("    with result as
 (   SELECT colunm1 ,column2,column3, ROW_NUMBER() OVER (ORDER BY columnname) AS RowNum FROM tablename )
select colunm1 ,column2,column3 from result where RowNum >= OFFSEtvalue and  RowNum < (OFFSEtvalue +limtvalue ").show()

Note: Update below variables according your requirement tcolunm1 , tablename, OFFSEtvalue, limtvalue

这篇关于在Pyspark HiveContext中,什么是SQL OFFSET的等价物?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆