如何在Spark中分配和使用列标题? [英] How to assign and use column headers in Spark?

查看:156
本文介绍了如何在Spark中分配和使用列标题?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  f = sc.textFile(s3:// test / abc。 csv)

我的文件包含50多个字段,我希望为每个字段分配列标题以供引用后来在我的脚本。



我如何在PySpark中做到这一点?是DataFrame的方式去这里?



PS - 新手到Spark。

解决方案

以下是如何使用DataFrame添加列名称:

假设你的csv有分隔符','。在将数据传输到DataFrame之前,请按照以下步骤准备数据:

  f = sc.textFile(s3:// test / abc。 csv)
data_rdd = f.map(lambda行:[x for line.split(',')])

假设数据有3列:

pre $ data $ rd.take(1)
[[u'1.2',u'red',u'55.6']]

现在,您可以使用 toDF()将此RDD传输到DataFrame时指定列名称:

  df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

root
| - height:string(nullable = true)
| - color:string(nullable = true)
| - width:string(nullable = true)

/ p>

  df_default = data_rdd.toDF()

df_default .printSchema()
$ b $ root root $ b $ - | _1:string(nullable = true)
| - _2:string(nullable = true)
| - _3:string(nullable = true)


I am reading a dataset as below.

 f = sc.textFile("s3://test/abc.csv")

My file contains 50+ fields and I want assign column headers for each of fields to reference later in my script.

How do I do that in PySpark ? Is DataFrame way to go here ?

PS - Newbie to Spark.

解决方案

Here is how to add column names using DataFrame:

Assume your csv has the delimiter ','. Prepare the data as follows before transferring it to DataFrame:

f = sc.textFile("s3://test/abc.csv")
data_rdd = f.map(lambda line: [x for x in line.split(',')])

Suppose the data has 3 columns:

data_rdd.take(1)
[[u'1.2', u'red', u'55.6']]

Now, you can specify the column names when transferring this RDD to DataFrame using toDF():

df_withcol = data_rdd.toDF(['height','color','width'])

df_withcol.printSchema()

    root
     |-- height: string (nullable = true)
     |-- color: string (nullable = true)
     |-- width: string (nullable = true)

If you don't specify column names, you get a DataFrame with default column names '_1', '_2', ...:

df_default = data_rdd.toDF()

df_default.printSchema()

    root
     |-- _1: string (nullable = true)
     |-- _2: string (nullable = true)
     |-- _3: string (nullable = true)

这篇关于如何在Spark中分配和使用列标题?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆