Spark DataFrame 方法“toPandas"实际上在做什么? [英] What is the Spark DataFrame method `toPandas` actually doing?

查看:24
本文介绍了Spark DataFrame 方法“toPandas"实际上在做什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Spark-DataFrame API 的初学者.

I'm a beginner of Spark-DataFrame API.

我使用此代码将制表符分隔的 csv 加载到 Spark Dataframe 中

I use this code to load csv tab-separated into Spark Dataframe

lines = sc.textFile('tail5.csv')
parts = lines.map(lambda l : l.strip().split('	'))
fnames = *some name list*
schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
ddf = sqlContext.createDataFrame(parts,schemaData)

假设我使用 Spark 从新文件创建 DataFrame,并使用内置方法 toPandas() 将其转换为 Pandas,

Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

  • 它是否将 Pandas 对象存储到本地内存中?
  • Pandas 的底层计算是否全部由 Spark 处理?
  • 它是否暴露了所有的 Pandas 数据框功能?(我猜是)
  • 我可以将它转换为 Pandas 并完成它,而不用太多接触 DataFrame API 吗?

推荐答案

使用 spark 将 CSV 文件读入 pandas 是实现将 CSV 文件读入的最终目标的一种迂回方法记忆.

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.

您似乎误解了此处使用的技术的用例.

It seems like you might be misunderstanding the use cases of the technologies in play here.

Spark 用于分布式计算(尽管它可以在本地使用).它通常过于重量级,无法用于简单地读取 CSV 文件.

Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.

在您的示例中,sc.textFile 方法将简单地为您提供一个 spark RDD,它实际上是一个文本行列表.这可能不是您想要的.不会执行类型推断,因此如果您想对 CSV 文件中的一列数字求和,您将无法进行,因为就 Spark 而言,它们仍然是字符串.

In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.

只需使用 pandas.read_csv 并将整个 CSV 读入内存.Pandas 会自动推断每一列的类型.Spark 不会这样做.

Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.

现在回答您的问题:

是否将 Pandas 对象存储到本地内存:

是的.toPandas() 会将 Spark DataFrame 转换成 Pandas DataFrame,当然在内存中.

Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

Pandas 底层计算是否全部由 Spark 处理

没有.Pandas 运行自己的计算,spark 和 Pandas 之间没有相互作用,只是一些 API 兼容性.

No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.

它是否暴露了所有的 Pandas 数据框功能?

没有.例如,Series 对象有一个 interpolate 方法,该方法在 PySpark Column 对象中不可用.有许多 PySpark API 中没有的方法和函数在 PySpark API 中.

No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.

我可以将它转换为 Pandas 并完成它,而不用太多接触 DataFrame API 吗?

绝对的.事实上,在这种情况下,您甚至可能根本不应该使用 Spark.pandas.read_csv 可能会处理您的用例,除非您正在处理大量 的数据.

Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.

尝试使用简单、技术含量低、易于理解的库来解决您的问题,并且根据您的需要去做一些更复杂的事情.很多时候,您不需要更复杂的技术.

Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

这篇关于Spark DataFrame 方法“toPandas"实际上在做什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆