如何在Spark中分配和使用列标题? [英] How to assign and use column headers in Spark?
问题描述
f = sc.textFile(s3:// test / abc。 csv)
我的文件包含50多个字段,我希望为每个字段分配列标题以供引用后来在我的脚本。
我如何在PySpark中做到这一点?是DataFrame的方式去这里?
PS - 新手到Spark。
以下是如何使用DataFrame添加列名称:
假设你的csv有分隔符','。在将数据传输到DataFrame之前,请按照以下步骤准备数据:
f = sc.textFile(s3:// test / abc。 csv)
data_rdd = f.map(lambda行:[x for line.split(',')])
假设数据有3列:
pre $ data $ rd.take(1)
[[u'1.2',u'red',u'55.6']]
现在,您可以使用 toDF()
将此RDD传输到DataFrame时指定列名称:
df_withcol = data_rdd.toDF(['height','color','width'])
df_withcol.printSchema()
root
| - height:string(nullable = true)
| - color:string(nullable = true)
| - width:string(nullable = true)
$ c $如果您不指定列名称,则会得到一个DataFrame,其默认列名为'_1','_2',...:
/ p>
df_default = data_rdd.toDF()
df_default .printSchema()
$ b $ root root $ b $ - | _1:string(nullable = true)
| - _2:string(nullable = true)
| - _3:string(nullable = true)
I am reading a dataset as below.
f = sc.textFile("s3://test/abc.csv")
My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.
How do I do that in PySpark ? Is DataFrame way to go here ?
PS - Newbie to Spark.
解决方案 Here is how to add column names using DataFrame:
Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:
f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])
Suppose the data has 3 columns:
data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]
Now, you can specify the column names when transferring this RDD to DataFrame using toDF()
:
df_withcol = data_rdd.toDF(['height','color','width'])
df_withcol.printSchema()
root
|-- height: string (nullable = true)
|-- color: string (nullable = true)
|-- width: string (nullable = true)
If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:
df_default = data_rdd.toDF()
df_default.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
这篇关于如何在Spark中分配和使用列标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!