Spark Dataframe未分发 [英] Spark Dataframe not distributed

查看：61 发布时间：2021/4/8 19:55:58 python apache-spark

本文介绍了Spark Dataframe未分发的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不明白为什么我的数据框只在一个节点上.我有一个独立的集群，其中包含14台计算机的集群和4个物理CPU.

I can't understand why my dataframe is only on one node. I have a cluster of 14 machines with 4 physical CPU on a spark standalone cluster.

我通过笔记本连接并创建了我的spark上下文:

I am connected through a notebook and create my spark context :

我期望有8个分区的并行处理，但是当我创建一个数据框时，我只会得到一个分区:

I expect a parralelism of 8 partitions, but when I create a dataframe I get only one partition :

我想念什么?

感谢user8371915的分析服务，我为数据帧重新分区(我正在读取压缩文件(.csv.gz)，因此我可以在splittable中进行理解.

Thanks to anser from user8371915 I repartitions my dataframe (I was reading a compressed file (.csv.gz) so I understand in splittable.

但是当我对其进行计数"时，我认为它仅在一个执行程序上执行:即使文件大700 Mb，并且在HDFS的6个块上，也就是在执行器n°1上.据我了解，演算应该超过10个核心，超过5个节点...但是，一切都只在一个节点上计算:-(

But When I do a "count" on it, I see it as being executed only on one executor : Here namely on executor n°1, even if the file is 700 Mb large, and is on 6 blocks on HDFS. As far as I understand, the calculus should be over 10 cores, over 5 nodes ... But everything is calculated only on one node :-(

Spark Dataframe未分发 [英] Spark Dataframe not distributed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Spark Dataframe未分发 [英] Spark Dataframe not distributed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭