小文件的HDFS性能 [英] HDFS performance for small files

查看:150
本文介绍了小文件的HDFS性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的Haddoop。最近我试图处理(只读)hdfs / hadoop上的很多文件。平均文件大小大约 1 kb ,文件数大于10M。由于某些限制,程序必须用C ++编写。



这只是一个性能评估,所以我只用5台机器来处理数据节点。每个数据节点都有5个数据盘。



我写了一个小的 C ++项目来直接从硬盘读取文件(而不是从HDFS )建立绩效基线。该程序将为每个磁盘创建4个读取线程。性能结果是每个磁盘大约有14MB / s。总吞吐量约为14MB / s * 5 * 5 = <350MB / s(14MB / s * 5个磁盘* 5台机器)。然而,

,当该程序(仍然使用C ++,动态链接到libhdfs.so,创建4 * 5 * 5 = 100个线程)从hdfs集群读取文件时,吞吐量仅约为55MB / s如果在mapreduce(hadoop流式处理,5个作业,每个作业有20个线程,总线程数仍然是100)中触发该编程,则吞吐量下降到约45MB / s。 (我想这是由一些簿记过程放缓)。

我想知道HDFS可以提供​​什么样的合理性能。如您所见,与本机代码相比,数据吞吐量仅为 1/7 。这是我的配置问题吗?还是HDFS限制?还是Java限制?我的场景最好的方式是什么?序列文件的帮助(很多)?什么是合理的吞吐量与本地IO读取相比,我们可以预期?



以下是我的一些配置:



NameNode堆大小32G。

作业/任务节点堆大小8G。

NameNode处理程序计数:128

DataNode Handler Count:8

DataNode最大传输线程数:4096


b $ b <1GBps以太网。



谢谢。

解决方案

让我们试着了解我们的限制,看看我们什么时候触及它们

a)我们需要namenode给我们提供文件所在位置的信息。我可以假设这个数字大约是每秒数千次。更多信息在这里 https://issues.apache.org/jira/browse/HADOOP-2149
假设这个数字为10000K,我们应该能够获得1K文件10 MB秒的信息。 (不知何故,你得到更多...)。可能

b)HDFS的开销。这种开销主要是延迟而不是吞吐量。可以将HDFS调整为在parralel中提供大量文件。 HBase正在这样做,我们可以从HBase调优指南中进行设置。这里的问题实际上是您需要多少Datanodes

c)您的LAN。您从网络移动数据,以便您可能达到1GB以太网吞吐量限制。 (我认为你得到了什么。)



我也必须同意乔 - HDFS不是为该场景而构建的,而应该使用其他技术(如HBase,如果你喜欢Hadoop堆栈)或者将文件压缩在一起 - 例如,将文件压缩成序列文件。



关于从HDFS读取更大的文件 - 运行DFSIO基准测试, 。
同时,单一主机上的固态硬盘也可以成为解决方案。


I'm new to Haddoop. Recently I'm trying to process (only read) many small files on hdfs/hadoop. The average file size is about 1 kb and the number of files is more than 10M. The program must be written in C++ due to some limitations.

This is just a performance evaluation so I only use 5 machines for data nodes. Each of the data node have 5 data disks.

I wrote a small C++ project to read the files directly from hard disk(not from HDFS) to build the performance base line. The program will create 4 reading threads for each disk. The performance result is to have about 14MB/s per disk. Total throughput is about 14MB/s * 5 * 5 = 350MB/s (14MB/s * 5 disks * 5 machines ).

However, when this program ( still using C++, dynamically linked to libhdfs.so, creating 4*5*5=100 threads) reads files from hdfs cluster, the throughput is about only 55MB/s.

If this programming is triggered in mapreduce (hadoop streamming, 5 jobs, each have 20 threads, total number of threads is still 100), the throughput goes down to about 45MB/s. (I guess it's slow down by some bookkeeping process).

I'm wondering what is the reasonable performance HDFS can prvoide. As you can see, comparing with native code, the data throughput is only about 1/7. Is it the problem of my config? Or HDFS limitation? Or Java limitation? What's the best way for my scenario? Will sequence file help (much)? What is the reasonable throughput comparing to native IO read we can expect?

Here's some of my config:

NameNode heap size 32G.

Job/Task node heap size 8G.

NameNode Handler Count: 128

DataNode Handler Count: 8

DataNode Maximum Number of Transfer Threads: 4096

1GBps ethernet.

Thanks.

解决方案

Lets try to understand our limits and see when we hit them
a) We need namenode to give us information where files are sitting. I can assume that this number is around thousands per second. More information is here https://issues.apache.org/jira/browse/HADOOP-2149 Assuming this number to be 10000K we should be able to get information about 10 MB second for 1K files. (somehow you get more...). may
b) Overhead of HDFS. This overhead is mostly on latency not in throughput. HDFS can be tuned to serve a lot of files in parralel. HBase is doing it and we can take settings from HBase tuning guides. The question here is actually how much Datanodes you need
c) Your LAN. You move data from the network so you might hit 1GB ethernet throughput limit. (i think it what you got.

I also have to agree with Joe - that HDFS is not built for the scenario and you should use other technology (like HBase, if you like Hadoop stack) or compress files together - for example into sequence files.

Regarding reading bigger files from HDFS - run DFSIO benchmark and it will be your number.
In the same time - SSD on single host perfectly can be a solution also.

这篇关于小文件的HDFS性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆