从分区的实木复合地板文件读取DataFrame [英] Reading DataFrame from partitioned parquet file

查看:71
本文介绍了从分区的实木复合地板文件读取DataFrame的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何以条件作为数据帧读取分区实木复合地板

How to read partitioned parquet with condition as dataframe,

这很好,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")

day=1 to day=30提供分区,是否可以读取类似(day = 5 to 6)day=5,day=6的内容,

Partition is there for day=1 to day=30 is it possible to read something like(day = 5 to 6) or day=5,day=6,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")

如果我输入*,它将提供我所有30天的数据,而且数据太大.

If I put * it gives me all 30 days data and it too big.

推荐答案

sqlContext.read.parquet可以采用多个路径作为输入.如果只需要day=5day=6,则只需添加两条路径即可:

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext
      .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

如果您在day=X下有文件夹,例如country=XX,则country将自动添加为dataframe中的一列.

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

从Spark 1.6开始,需要提供一个基本路径"选项,以便Spark自动生成列.在Spark 1.6.x中,必须像这样重新编写上面的代码,以创建带有"data","year","month"和"day"列的数据框:

As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

val dataframe = sqlContext
     .read
     .option("basePath", "file:///your/path/")
     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

这篇关于从分区的实木复合地板文件读取DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆