Spark Dataframe在性能方面如何优于Pandas Dataframe? [英] How Spark Dataframe is better than Pandas Dataframe in performance?

查看:89
本文介绍了Spark Dataframe在性能方面如何优于Pandas Dataframe?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

谁能解释一下Spark数据帧在执行时间上比Pandas数据帧更好.我正在处理中等容量的数据并进行python函数驱动的转换

Can anyone please explain how Spark Dataframes are better in terms of time taken for execution than Pandas Dataframes. I'm dealing with data with moderate volume and making python function powered transformations

例如,我的数据集中有一个数字从1到100,000的列,并且想要执行基本的数字操作-创建一个新列,该列是现有数字列的多维数据集.

For example, I have a column with numbers from 1 to 100,000 in my dataset and want to perform basic numeric action - creating a new column which is cube of existing numeric column.

from datetime import datetime
import numpy as np
import pandas as pd

def cube(num):
    return num**3

array_of_nums = np.arange(0,100000)

dataset = pd.DataFrame(array_of_nums, columns = ["numbers"])

start_time = datetime.now() 
# Some complex transformations...
dataset["cubed"] = [cube(x) for x in dataset.numbers]
end_time = datetime.now() 

print("Time taken :", (end_time-start_time))

输出为

Time taken : 0:00:00.109349

如果我将Spark Dataframe与10个工作节点一起使用,我可以期望得到以下结果吗?(这是Pandas DataFrame所花费时间的1/10)

If i use Spark Dataframe with 10 worker nodes, can I expect the following result? (which is 1/10th of time taken by Pandas DataFrame)

Time taken : 0:00:00.010935

推荐答案

1)熊猫数据帧未分发&Spark的DataFrame已分发.->因此,您将无法在Pandas DataFrame&中获得并行处理的好处.对于大量数据,Pandas DataFrame中的处理速度会降低.

1) Pandas data frame is not distributed & Spark's DataFrame is distributed. -> Hence you won't get the benefit of parallel processing in Pandas DataFrame & speed of processing in Pandas DataFrame will be less for large amount of data.

2)Spark DataFrame确保您具有容错能力(具有弹性)&熊猫DataFrame不能保证.->因此,如果您的数据处理在两次处理之间被中断/失败,那么spark可以从谱系(从DAG)重新生成失败的结果集.熊猫不支持容错功能.您需要实现自己的框架来确保它.

2) Spark DataFrame assures you fault tolerance (It's resilient) & pandas DataFrame does not assure it. -> Hence if your data processing got interrupted/failed in between processing then spark can regenerate the failed result set from lineage (from DAG) . Fault tolerance is not supported in Pandas. You need to implement your own framework to assure it.

这篇关于Spark Dataframe在性能方面如何优于Pandas Dataframe?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆