Spark 数据框:collect () vs select () [英] Spark dataframe: collect () vs select ()

查看:28
本文介绍了Spark 数据框:collect () vs select ()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 RDD 上调用 collect() 会将整个数据集返回给驱动程序,这可能导致内存不足,我们应该避免这种情况.

如果在数据帧上调用 collect() 是否会以相同的方式运行?
select() 方法怎么样?
如果在数据帧上调用它,它的工作方式是否也与 collect() 相同?

解决方案

行动与转型

<块引用>
  • Collect (Action) - 在驱动程序中将数据集的所有元素作为数组返回.这通常在过滤器或返回足够小的数据子集的其他操作.

spark-sql 文档<块引用>

select(*cols)(转换)- 投影一组表达式并返回一个新的 DataFrame.

参数:cols – 列名(字符串)或表达式列表(柱子).如果列名称之一是*",则扩展该列包括当前 DataFrame 中的所有列.**

df.select('*').collect()[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]df.select('name', 'age').collect()[Row(name=u'Alice', age=2), Row(name=u'Bob',age=5)]df.select(df.name, (df.age + 10).alias('age')).collect()[Row(name=u'Alice', age=12), Row(name=u'Bob',age=15)]

在数据帧上执行 select(column-name1,column-name2,etc) 方法,返回一个新的数据帧,其中只包含在 select() 中选择的列代码>函数.

例如假设 df 有几列,包括名称"和值"等.

df2 = df.select("name","value")

df2 将只包含 df

的整列中的两列(name"和value")

df2 作为 select 的结果将在执行程序中而不是在驱动程序中(如使用 collect() 的情况)

sql-programming-guide

df.printSchema()# 根# |-- age: long (nullable = true)# |-- 名称:字符串(可为空 = 真)# 只选择名称"列df.select("name").show()# +-------+# |姓名|# +-------+# |迈克尔|# |安迪|# |贾斯汀|# +-------+

您可以在数据帧上运行 collect() (spark 文档)

<预><代码>>>>l = [('爱丽丝', 1)]>>>spark.createDataFrame(l).collect()[行(_1=u'爱丽丝',_2=1)]>>>spark.createDataFrame(l, ['name', 'age']).collect()[Row(name=u'Alice', age=1)]

spark 文档

<块引用>

要打印驱动程序上的所有元素,可以使用 collect() 方法首先将RDD带到驱动程序节点,因此:rdd.collect().foreach(println).这可能会导致驱动程序耗尽但是,因为 collect() 将整个 RDD 提取到一个单机;如果你只需要打印 RDD 的几个元素,一个更安全的方法是使用 take(): rdd.take(100).foreach(println).

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.

Will collect() behave the same way if called on a dataframe?
What about the select() method?
Does it also work the same way as collect() if called on a dataframe?

解决方案

Actions vs Transformations

  • Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

spark-sql doc

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**

df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.

e.g. assuming df has several columns including "name" and "value" and some others.

df2 = df.select("name","value")

df2 will hold only two columns ("name" and "value") out of the entire columns of df

df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())

sql-programming-guide

df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

You can running collect() on a dataframe (spark docs)

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

spark docs

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

这篇关于Spark 数据框:collect () vs select ()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆