如何在Hadoop中共享全局序列号生成器? [英] How to share global sequential number generator in Hadoop?

查看:123
本文介绍了如何在Hadoop中共享全局序列号生成器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现在我正在使用Hadoop处理最终将加载到同一个表中的数据。我需要一个共享的序列号生成器来为每一行生成id。现在我正在使用以下方法来生成唯一编号:



1)在HDFS中创建一个文本文件,例如test.seq,以保存当前的序列号。

2)我使用锁定文件.lock来控制并发。假设我们有两个任务并行处理数据。如果task1想要获得该号码,它将检查锁定文件是否存在。如果是,则表示task2正在访问test.seq中的号码,然后task1必须等待。当task2获得该号码时,它会在返回时将旧号码加1,并删除锁定文件.lock。当task1看到.lock消失时,task1将首先创建一个.lock文件,然后以相同的方式获取序列号。

然而,我不确定这种方法是否可行。由于我将.lock和test.seq文件保存在HDFS中,即使test.seq的内容已被任务1更改,它也可能不会立即被任务2意识到。由于其他任务通过namenode获取有关HDFS中数据的信息。因此,datanode将首先通知对namenode的更改,然后通知其他任务更改。这是对的吗?

另一个想法是创建在Master上运行的torjan程序。所以,任务获得序号是通过RPC的Torjan程序。但是如何在主程序上运行Torjan程序?

有人可以给我一些建议吗?感谢!

解决方案

主要问题是您选择hadoop,因为水平可伸缩性属性。

全部当您包含需要从中心点进行协调的事物时,横向可伸缩性的形式会受到很大影响。



所以你有两个选择:


  1. 你接受缩放限制并去为他人提出的解决方案。 (如zookeeper选项)

  2. 您选择不需要中央协调形式的解决方案。以这个键的一些属性为代价。

我会试着看看后者是否足够满足您的需求。
一个这样的解决方案可能是您获取当前跟踪器实例的id并附加一个本地计数器值。
通过这种方式,每个跟踪器和同一作业的多次运行值都是唯一的,但不在作业中。

Now I am using Hadoop to process the data that will finally be loaded into the same table. I need to a shared sequential number generator to generate id for each row. Now I am using the following approach to generate the unique number:

1) Create a text file, e.g., test.seq, in HDFS for saving the current sequential number.

2) I use a lock file ".lock" to control concurrency. Suppose we have two tasks to processing the data in parallel. If task1 wants to get the number, it will check if the lock file exists. If yes, it means that task2 is accessing the number from the test.seq, then task1 has to wait. When task2 has acquired the number, it overwrites the old number by increasing 1 when it returns, and deletes the lock file ".lock". When task1 sees the .lock disappear, task1 will firstly create a ".lock" file, then does the same way to get the sequential number.

However, I am not sure if this approach is practical. Because I keep the .lock and test.seq files in the HDFS, even the content of test.seq was changed by task 1, it might not immediately be aware by the task2. As the other tasks get the information about the data in the HDFS is through by namenode. So, the datanode will first notify the change to the namenode, then the other tasks are notified the changes. Is it correct?

Another idea is to create torjan program running on the Master. so, the task get the sequential number is through RPC the Torjan program. But how to run the Torjan program on the master program?

Could anybody give me some advice? thanks!

解决方案

The main problem is that you choose hadoop because of the horizontal scalability properties.
All forms of horizontal scalability suffer greatly when you include something that needs to be coordinated from a central point.

So you have two options:

  1. You accept the scaling limitations and go for the solutions proposed by others. (like the zookeeper option)
  2. You choose a solution that does not require a form of central coordination. At the expense of some properties of the key.

I would try to see if the latter would be enough for your purposes. One such solution could be that you take the id of the current tracker instance and append a local counter value. This way the value is unique and sequentially per tracker and over multiple runs of the same job, but not within the job.

这篇关于如何在Hadoop中共享全局序列号生成器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆