尽管我增加了节点数量,但Spark CSV的读取速度非常慢 [英] Spark csv reading speed is very slow although I increased the number of nodes

查看:526
本文介绍了尽管我增加了节点数量,但Spark CSV的读取速度非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google Compute Engine上创建了两个集群,这些集群读取100 GB数据.

I created two clusters on Google Compute Engine and that clusters read 100 GB data.

集群I: 1个主服务器-15 GB内存-250 GB磁盘 10个节点-7.5 GB内存-200 GB磁盘

Cluster I: 1 master - 15 GB memory - 250 GB disk 10 nodes - 7.5 GB memory - 200 GB disk

集群II: 1个主服务器-15 GB内存-250 GB磁盘 150个节点-1.7 GB内存-200 GB磁盘

Cluster II: 1 master - 15 GB memory - 250 GB disk 150 nodes - 1.7 GB memory - 200 GB disk

我正在使用它来读取文件:

I am using that to read file:

val df = spark.read.format("csv")
    .option("inferSchema", true)
    .option("maxColumns",900000)
    .load("hdfs://master:9000/tmp/test.csv")

这也是一个包含55k行和850k列的数据集.

Also this is a dataset that contains 55k rows and 850k columns.

第一季度:尽管我增加了机器数量,但阅读速度没有明显提高.有什么问题或该怎么做才能使此过程更快?我应该增加节点数吗?

Q1: I did not see a significant increase in reading speed although I increased the number of machines. What is wrong or what to do make this process faster? Should I increase nodes more?

Q2:增加计算机数量是否对提高速度是否重要很重要,或者对Spark来说,增加内存量是否重要?节点,内存和速度之间是否存在性能图?

Q2: Is the increase in the number of machines important to be faster or is the increase in the amount of memory important for Spark? Is there a performance graph between nodes, memory and speed?

Q3:同样,针对hadoop的复制或移动命令运行非常缓慢.数据仅为100 GB.大公司如何处理TB级的数据?我无法捕捉到数据读取速度的提高.

Q3: Also copy or move commands for hadoop are working very slow. Data is just 100 GB. How does big companies deal with terabytes of data? I could not capture the increase in data reading speed.

感谢您的回答

推荐答案

TL; DR Spark SQL(以及通用的Spark和其他共享相似体系结构和设计的项目)主要用于处理长(相对)狭窄的数据.这与您的数据完全相反,在这种情况下,输入很宽(相对)很短.

TL;DR Spark SQL (as well as Spark in general and other projects sharing similar architecture and design) is primarily designed to handle long and (relatively) narrow data. This is the exact opposite of your data, where input is wide and (relatively) short.

请记住,尽管Spark使用列格式进行缓存,但其核心处理模型仍会处理数据行(记录).如果数据很宽但是很短,那么它不仅限制了数据分发的能力,更重要的是,这导致了非常大的对象的初始化.这对整个内存管理和垃圾回收过程具有不利影响(对于JVM GC来说,什么是大对象).

Remember that although Spark uses columnar formats for caching its core processing model handles rows (records) of data. If data is wide but short, it not only limits ability to distribute the data, but what's more important, leads to initialization of very large objects. This has detrimental impact on overall memory management and garbage collection process (What is large object for JVM GC).

在Spark SQL中使用非常广泛的数据会导致其他问题.就查询中使用的表达式而言,不同的优化器组件具有非线性复杂性.数据狭窄(小于1K列)通常不是问题,但是很容易成为数据集更广的瓶颈.

Using very wide data with Spark SQL causes additional problems. Different optimizer components have non-linear complexity in terms of expressions used in a query. This is usually not a problem with data is narrow (< 1K columns), but can easily become a bottleneck with wider datasets.

此外,您使用的输入格式不太适合高性能分析和昂贵的读取器选项(模式推断).

Additionally you're using input format which is not well suited for high performance analytics and expensive reader options (schema inference).

根据对数据的了解以及以后计划处理的方式,您可以尝试解决其中的一些问题,例如,通过在读取时转换为长格式,或直接使用稀疏表示形式对数据进行编码(如果适用)

Depending on what you know about the data and how you plan to process it later you can try to address some of these issues, for by converting to long format on read, or encoding data directly using some sparse representation (if applicable).

除此之外,您最好的选择是根据运行时统计信息仔细地进行内存和GC调整.

Other than that your best choice is careful memory and GC tuning based on runtime statistics.

这篇关于尽管我增加了节点数量,但Spark CSV的读取速度非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆