根据列值对spark数据框进行分区? [英] Partition a spark dataframe based on column value?

查看:471
本文介绍了根据列值对spark数据框进行分区?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个来自sql源的数据框,如下所示:

i have a dataframe from a sql source which looks like:

User(id: Long, fname: String, lname: String, country: String)

[1, Fname1, Lname1, Belarus]
[2, Fname2, Lname2, Belgium]
[3, Fname3, Lname3, Austria]
[4, Fname4, Lname4, Australia]

我想对该数据进行分区并将其写入csv文件,每个分区均基于该国家的首字母,因此白俄罗斯和比利时应作为输出文件中的一个,奥地利和澳大利亚应作为另一个文件.

I want to partition and write this data into csv files where each partition is based on initial letter of the country, so Belarus and Belgium should be one in output file, Austria and Australia in other.

推荐答案

这是您可以做的

import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
  (1, "Fname1", "Lname1", "Belarus"),
  (2, "Fname2", "Lname2", "Belgium"),
  (3, "Fname3", "Lname3", "Austria"),
  (4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")

//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))

//save the data with partitionby first letter of country 

result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")

您还可以按照Raphel的建议,使用子字符串

Edited: You can also use the substring which can increase the performance as suggested by Raphel as

substring(Column str, int pos, int len)子字符串从pos开始,为 str是String类型或返回字节片时长度len的长度 数组,从pos开始(以字节为单位),并且在str为时长度为len 二进制类型

substring(Column str, int pos, int len) Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

val result = df.withColumn("firstCountry", substring($"country",1,1))

,然后将partitionby与write一起使用

and then use partitionby with write

希望这可以解决您的问题!

Hope this solves your problem!

这篇关于根据列值对spark数据框进行分区?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆