Spark Dataframe未分发 [英] Spark Dataframe not distributed

查看:61
本文介绍了Spark Dataframe未分发的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不明白为什么我的数据框只在一个节点上.我有一个独立的集群,其中包含14台计算机的集群和4个物理CPU.

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster.

我通过笔记本连接并创建了我的spark上下文:

I am connected through a notebook and create my spark context :

我期望有8个分区的并行处理,但是当我创建一个数据框时,我只会得到一个分区:

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition :

我想念什么?

感谢user8371915的分析服务,我为数据帧重新分区(我正在读取压缩文件(.csv.gz),因此我可以在splittable中进行理解.

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable.

但是当我对其进行计数"时,我认为它仅在一个执行程序上执行:即使文件大700 Mb,并且在HDFS的6个块上,也就是在执行器n°1上.据我了解,演算应该超过10个核心,超过5个节点...但是,一切都只在一个节点上计算:-(

But When I do a "count" on it, I see it as being executed only on one executor : Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

推荐答案

有两种可能性:

在第一种情况下,您可以考虑调整参数,但是如果使用默认值,则该参数已经很小.

In the first case you may consider adjusting parameters, but if you go with defaults, it is already small.

在第二种情况下,最好先将文件解压缩,然后再加载到Spark.如果您无法执行此操作,请在加载后 repartition ,但是速度会很慢.

In the second case it is best to unpack file before loading to Spark. If you cannot do that, repartition after loading, but it'll be slow.

这篇关于Spark Dataframe未分发的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆