如何在pyspark中更改数据框列名? [英] How to change dataframe column names in pyspark?
问题描述
我来自熊猫背景,习惯于将数据从 CSV 文件读取到数据框中,然后使用简单的命令简单地将列名更改为有用的内容:
I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list
但是,这在使用 sqlContext 创建的 pyspark 数据帧中不起作用.我能想到的唯一解决方案是:
However, the same doesn't work in pyspark dataframes created using sqlContext. The only solution I could figure out to do this easily is the following:
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', inferschema='true', delimiter='\t').load("data.txt")
oldSchema = df.schema
for i,k in enumerate(oldSchema.fields):
k.name = new_column_name_list[i]
df = sqlContext.read.format("com.databricks.spark.csv").options(header='false', delimiter='\t').load("data.txt", schema=oldSchema)
这基本上是两次定义变量并首先推断架构,然后重命名列名,然后使用更新的架构再次加载数据框.
This is basically defining the variable twice and inferring the schema first then renaming the column names and then loading the dataframe again with the updated schema.
是否有更好、更有效的方法来做到这一点,就像我们在熊猫中所做的那样?
Is there a better and more efficient way to do this like we do in pandas ?
我的 spark 版本是 1.5.0
My spark version is 1.5.0
推荐答案
有很多方法可以做到:
选项 1. 使用 selectExpr.
data = sqlContext.createDataFrame([("Alberto", 2), ("Dakota", 2)],
["Name", "askdaosdka"])
data.show()
data.printSchema()
# Output
#+-------+----------+
#| Name|askdaosdka|
#+-------+----------+
#|Alberto| 2|
#| Dakota| 2|
#+-------+----------+
#root
# |-- Name: string (nullable = true)
# |-- askdaosdka: long (nullable = true)
df = data.selectExpr("Name as name", "askdaosdka as age")
df.show()
df.printSchema()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
选项 2. 使用 withColumnRenamed,注意这个方法允许你覆盖"同一列.对于 Python3,将 xrange
替换为 range
.
Option 2. Using withColumnRenamed, notice that this method allows you to "overwrite" the same column. For Python3, replace xrange
with range
.
from functools import reduce
oldColumns = data.schema.names
newColumns = ["name", "age"]
df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx], newColumns[idx]), xrange(len(oldColumns)), data)
df.printSchema()
df.show()
Option 3. using alias, in Scala you can also use as.
from pyspark.sql.functions import col
data = data.select(col("Name").alias("name"), col("askdaosdka").alias("age"))
data.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
选项 4. 使用 sqlContext.sql,它允许您对注册为表的 DataFrames
使用 SQL 查询.
Option 4. Using sqlContext.sql, which lets you use SQL queries on DataFrames
registered as tables.
sqlContext.registerDataFrameAsTable(data, "myTable")
df2 = sqlContext.sql("SELECT Name AS name, askdaosdka as age from myTable")
df2.show()
# Output
#+-------+---+
#| name|age|
#+-------+---+
#|Alberto| 2|
#| Dakota| 2|
#+-------+---+
这篇关于如何在pyspark中更改数据框列名?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!