从分区的镶木地板文件中读取 DataFrame [英] Reading DataFrame from partitioned parquet file

查看：41 发布时间：2021/11/14 21:23:48 scala apache-spark parquet spark-dataframe

本文介绍了从分区的镶木地板文件中读取 DataFrame的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何以条件为数据帧读取分区镶木地板，

How to read partitioned parquet with condition as dataframe,

这很好用，

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")

day=1 到 day=30 的分区是否可以读取(day = 5 to 6) 或 day=5 之类的内容，天=6,

Partition is there for day=1 to day=30 is it possible to read something like(day = 5 to 6) or day=5,day=6,

val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")

如果我输入 * 它会给我所有 30 天的数据而且它太大了.

If I put * it gives me all 30 days data and it too big.

推荐答案

sqlContext.read.parquet 可以将多个路径作为输入.如果你只想要 day=5 和 day=6，你可以简单地添加两个路径，如:

sqlContext.read.parquet can take multiple paths as input. If you want just day=5 and day=6, you can simply add two paths like:

val dataframe = sqlContext
      .read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

如果您在 day=X 下有文件夹，例如 country=XX，country 将自动添加为 country 中的一列代码>数据框.

If you have folders under day=X, like say country=XX, country will automatically be added as a column in the dataframe.

从 Spark 1.6 开始，需要提供一个basepath"选项，以便 Spark 自动生成列.在 Spark 1.6.x 中，必须像这样重写上述内容以创建一个包含data"、year"、month"和day"列的数据框:

As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":

val dataframe = sqlContext
     .read
     .option("basePath", "file:///your/path/")
     .parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/", 
                    "file:///your/path/data=jDD/year=2015/month=10/day=6/")

这篇关于从分区的镶木地板文件中读取 DataFrame的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从分区的镶木地板文件中读取 DataFrame [英] Reading DataFrame from partitioned parquet file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从分区的镶木地板文件中读取 DataFrame [英] Reading DataFrame from partitioned parquet file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭