Spark数据框:收集()与选择() [英] Spark dataframe: collect () vs select ()

查看:84
本文介绍了Spark数据框:收集()与选择()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在RDD上调用collect()会将整个数据集返回给驱动程序,这可能会导致内存不足,我们应该避免这种情况.

Calling collect() on an RDD will return the entire dataset to the driver which can cause out of memory and we should avoid that.

如果在数据帧上调用,collect()的行为是否会相同?
select()方法呢?
如果在数据帧上调用它,它是否也与collect()一样工作?

Will collect() behave the same way if called on a dataframe?
What about the select() method?
Does it also work the same way as collect() if called on a dataframe?

推荐答案

行动与变革

  • 收集(操作)-在驱动程序中将数据集的所有元素作为数组返回.通常在过滤器或 其他返回足够小的数据子集的操作.
  • Collect (Action) - Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

spark-sql文档

select(* cols)(转换)-投影一组表达式并返回一个新的DataFrame.

select(*cols) (transformation) - Projects a set of expressions and returns a new DataFrame.

参数:cols –列名(字符串)或表达式的列表 (柱子).如果列名称之一是"*",则该列将被展开 包括当前DataFrame中的所有列.**

Parameters: cols – list of column names (string) or expressions (Column). If one of the column names is ‘*’, that column is expanded to include all columns in the current DataFrame.**

df.select('*').collect()
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
df.select('name', 'age').collect()
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
df.select(df.name, (df.age + 10).alias('age')).collect()
[Row(name=u'Alice', age=12), Row(name=u'Bob', age=15)]

在数据框上执行

执行select(column-name1,column-name2,etc)方法,将返回一个新的数据框,该数据框仅包含在select()函数中选择的列.

Execution select(column-name1,column-name2,etc) method on a dataframe, returns a new dataframe which holds only the columns which were selected in the select() function.

例如假设df有几列,包括名称"和值"以及其他一些列.

e.g. assuming df has several columns including "name" and "value" and some others.

df2 = df.select("name","value")

df2将仅容纳df

df2作为select的结果将在执行程序中而不在驱动程序中(如使用collect()的情况一样)

df2 as the result of select will be in the executors and not in the driver (as in the case of using collect())

sql编程指南

df.printSchema()
# root
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)

# Select only the "name" column
df.select("name").show()
# +-------+
# |   name|
# +-------+
# |Michael|
# |   Andy|
# | Justin|
# +-------+

您可以在数据框上运行collect()(火花文档)

You can running collect() on a dataframe (spark docs)

>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1=u'Alice', _2=1)]
>>> spark.createDataFrame(l, ['name', 'age']).collect()
[Row(name=u'Alice', age=1)]

火花文档

要在驱动程序上打印所有元素,可以使用collect()方法 首先将RDD带到驱动程序节点,从而: rdd.collect().foreach(println). 这可能会导致驱动程序用尽 但是,因为collect()会将整个RDD提取到一个内存中 单机;如果您只需要打印RDD的一些元素,则 比较安全的方法是使用take():rdd.take(100).foreach(println).

To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

这篇关于Spark数据框:收集()与选择()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆