从分区的镶木地板文件中读取 DataFrame [英] Reading DataFrame from partitioned parquet file
问题描述
如何以条件为数据帧读取分区镶木地板,
How to read partitioned parquet with condition as dataframe,
这很好用,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")
day=1 到 day=30
的分区是否可以读取(day = 5 to 6)
或 day=5 之类的内容,天=6
,
Partition is there for day=1 to day=30
is it possible to read something like(day = 5 to 6)
or day=5,day=6
,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")
如果我输入 *
它会给我所有 30 天的数据而且它太大了.
If I put *
it gives me all 30 days data and it too big.
推荐答案
sqlContext.read.parquet
可以将多个路径作为输入.如果你只想要 day=5
和 day=6
,你可以简单地添加两个路径,如:
sqlContext.read.parquet
can take multiple paths as input. If you want just day=5
and day=6
, you can simply add two paths like:
val dataframe = sqlContext
.read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",
"file:///your/path/data=jDD/year=2015/month=10/day=6/")
如果您在 day=X
下有文件夹,例如 country=XX
,country
将自动添加为 country
中的一列代码>数据框.
If you have folders under day=X
, like say country=XX
, country
will automatically be added as a column in the dataframe
.
从 Spark 1.6 开始,需要提供一个basepath"选项,以便 Spark 自动生成列.在 Spark 1.6.x 中,必须像这样重写上述内容以创建一个包含data"、year"、month"和day"列的数据框:
As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":
val dataframe = sqlContext
.read
.option("basePath", "file:///your/path/")
.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",
"file:///your/path/data=jDD/year=2015/month=10/day=6/")
这篇关于从分区的镶木地板文件中读取 DataFrame的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!