spark DataFrame“as"的使用方法 [英] Usage of spark DataFrame "as" method

查看:336
本文介绍了spark DataFrame“as"的使用方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看 spark.sql.DataFrame 文档.

def as(alias: String): DataFrame返回一个带有别名集的新 DataFrame.自从1.3.0

这个方法的目的是什么?它是如何使用的?可以举个例子吗?

我没有在网上找到任何关于这种方法的信息,而且文档也很不存在.我没有设法使用这种方法创建任何类型的别名.

解决方案

Spark <= 1.5

它或多或少相当于 SQL 表别名:

SELECT *FROM table AS 别名;

示例用法改编自 PySpark alias 文档:

import org.apache.spark.sql.functions.colcase class Person(name: String, age: Int)val df = sqlContext.createDataFrame(Person("Alice", 2) :: Person("Bob", 5) :: Nil)val df_as1 = df.as("df1")val df_as2 = df.as("df2")val join_df = df_as1.join(df_as2, col("df1.name") === col("df2.name"), "inner")加入_df.select(col("df1.name"), col("df2.name"), col("df2.age")).show

输出:

+-----+-----+----+|姓名|姓名|年龄|+-----+-----+----+|爱丽丝|爱丽丝|2||鲍勃|鲍勃|5|+-----+-----+----+

同样的事情使用 SQL 查询:

df.registerTempTable("df")sqlContext.sql("""SELECT df1.name, df2.name, df2.age从 df AS df1 加入 df AS df2ON df1.name == df2.name""")

<块引用>

这个方法的目的是什么?

几乎避免了歧义的列引用.

Spark 1.6+

还有一个新的 as[U](implicit arg0: Encoder[U]): Dataset[U] 用于将 DataFrame 转换为 DataFrame给定类型的 code>DataSet.例如:

df.as[人]

I am looking at spark.sql.DataFrame documentation.

There is

def as(alias: String): DataFrame
    Returns a new DataFrame with an alias set.
    Since
        1.3.0 

What is the purpose of this method? How is it used? Can there be an example?

I have not managed to find anything about this method online and the documentation is pretty non-existent. I have not managed to make any kind of alias using this method.

解决方案

Spark <= 1.5

It is more or less equivalent to SQL table aliases:

SELECT *
FROM table AS alias;

Example usage adapted from PySpark alias documentation:

import org.apache.spark.sql.functions.col
case class Person(name: String, age: Int)

val df = sqlContext.createDataFrame(
    Person("Alice", 2) :: Person("Bob", 5) :: Nil)

val df_as1 = df.as("df1")
val df_as2 = df.as("df2")
val joined_df = df_as1.join(
    df_as2, col("df1.name") === col("df2.name"), "inner")
joined_df.select(
    col("df1.name"), col("df2.name"), col("df2.age")).show

Output:

+-----+-----+---+
| name| name|age|
+-----+-----+---+
|Alice|Alice|  2|
|  Bob|  Bob|  5|
+-----+-----+---+

Same thing using SQL query:

df.registerTempTable("df")
sqlContext.sql("""SELECT df1.name, df2.name, df2.age
                  FROM df AS df1 JOIN df AS df2
                  ON df1.name == df2.name""")

What is the purpose of this method?

Pretty much avoiding ambiguous column references.

Spark 1.6+

There is also a new as[U](implicit arg0: Encoder[U]): Dataset[U] which is used to convert a DataFrame to a DataSet of a given type. For example:

df.as[Person]

这篇关于spark DataFrame“as"的使用方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆