Hadoop:处理大型序列化对象 [英] Hadoop: Processing large serialized objects

查看:270
本文介绍了Hadoop:处理大型序列化对象的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在开发一个应用程序来使用Hadoop框架处理(并合并)几个大型java序列化对象(订单GB的大小)。 Hadoop存储在不同主机上分发文件块。但是由于反序列化将要求所有块都出现在单个主机上,因此它会大大提高性能。与文本文件不同,我如何处理不同块不能单独处理的情况?

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file on different hosts. But as deserialization will require the all the blocks to be present on single host, its gonna hit the performance drastically. How can I deal this situation where different blocks have to cant be individually processed, unlike text files ?

推荐答案

有两个问题:一个是每个文件必须(在初始阶段)整体处理:看到第一个字节的映射器必须处理该文件的所有其余部分。另一个问题是局部性:为了获得最佳效率,您希望每个此类文件的所有块都驻留在同一主机上。

There's two issues: one is that each file must (in the initial stage) be processed in whole: the mapper that sees the first byte must handle all the rest of that file. The other problem is locality: for best efficiency, you'd like all the blocks for each such file to reside on the same host.

整体处理文件:

一个简单的技巧是让第一阶段映射器处理一个文件名列表,而不是他们的内容。如果要运行50个映射作业,请使用该部分文件名制作50个文件。这很容易,适用于java或流式hadoop。

One simple trick is to have the first-stage mapper process a list of filenames, not their contents. If you want 50 map jobs to run, make 50 files each with that fraction of the filenames. This is easy and works with java or streaming hadoop.

或者,使用不可拆分的输入格式,例如 NonSplitableTextInputFormat

Alternatively, use a non-splittable input format such as NonSplitableTextInputFormat.

有关详细信息,请参阅我如何处理文件,每个地图一个?<​​/a>和如何让我的每张地图在一个完整的输入文件上工作?在hadoop wiki上。

For more details, see "How do I process files, one per map?" and "How do I get each of my maps to work on one complete input-file?" on the hadoop wiki.

地点:

然而,这会产生一个问题,即您正在读取的块在整个HDFS中被分配:通常是性能提升,这是一个真正的问题。我不相信有任何方法可以将某些块链接在HDFS中一起移动。

This leaves a problem, however, that the blocks you are reading from are disributed all across the HDFS: normally a performance gain, here a real problem. I don't believe there's any way to chain certain blocks to travel together in the HDFS.

是否可以将文件放在每个节点的本地存储中?这实际上是解决此问题的最高性能和最简单的方法:让每台计算机启动作业以处理所有文件,例如: / data / 1 / ** / * .data (尽可能聪明地使用本地分区和CPU内核数量)。

Is it possible to place the files in each node's local storage? This is actually the most performant and easiest way to solve this: have each machine start jobs to process all the files in e.g. /data/1/**/*.data (being as clever as you care to be about efficiently using local partitions and number of CPU cores).

如果文件来自SAN或者说来自s3,请尝试直接从那里拉出来:它是为处理群体而构建的。

If the files originate from a SAN or from say s3 anyway, try just pulling from there directly: it's built to handle the swarm.

关于使用第一个技巧的注意事项:如果某些文件比其他文件大得多,请将它们单独放在最早命名的列表中,以避免投机执行问题。如果任务是可靠的并且您不希望多次处理某些批次,则可能会关闭此类作业的推测执行。

A note on using the first trick: If some of the files are much larger than others, put them alone in the earliest-named listing, to avoid issues with speculative execution. You might turn off speculative execution for such jobs anyway if the tasks are dependable and you don't want some batches processed multiple times.

这篇关于Hadoop:处理大型序列化对象的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆