使用Hadoop运行jar文件 - Python [英] Using Hadoop to run a jar file - Python

查看:147
本文介绍了使用Hadoop运行jar文件 - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个现有的Python程序,其运行顺序如下所示:


  1. 连接MySQL DB并检索文件到本地FS。

  2. 运行一个程序X,对这些文件进行操作。例如: java -jar X.jar< folder_name> 这将打开文件夹中的每个文件并对它们执行一些操作,并将相同数量的转换文件写入另一个文件夹。
  3. 然后,运行一个对这些文件进行操作的程序Y: java -jar Y.jar< folder_name> 这将创建多行文件,每行一个行,然后使用合并功能合并到一个文件中。
  4. 这个合并文件是一些进一步操作的输入并分析这个问题并不重要。

我想使用Hadoop来加速操作Y as如果存在以下情况,则需要很长时间才能完成:a)更多文件或b)大型输入文件进行操作。

我想知道的是如果最好先用Hadoop来做这种事情,如果线程在这种情况下更有意义。请记住,X和Y是不能以任何方式被替换或更改的事情。



我想出了这个想法: p>


  1. 在上面的第2步之后,在映射器中,将这些文件复制到HDFS中并再次退出,运行jar文件,此时结果被写回HDFS。我将结果复制回文件系统并发送给进一步处理。

我想知道这是否有意义,尤其是考虑到映射器期望(键,值)对我甚至在这种情况下有一对KV对?

我知道这听起来像一个项目,这是因为它是,但我不寻找代码,只是一些指导,关于这是否会甚至工作,如果它的确如此,如果我提出的解决方案不准确(足够),那么做这件事的正确方法是什么。



谢谢! 解决方案

你完全可以使用hadoop mapreduce框架完成你的工作,但是如果这是一个好主意,答案可能是取决于。它取决于你想要继续处理的文件的数量和大小。



请记住,hdfs在处理小文件方面不是很好,它可能是一个灾难namenode,如果你有一个很好的数字(比如说1000万)小文件(大小小于1k字节)。另一方面,如果尺寸太大但只需要几个文件进行处理,将第2步直接打包到映射器中并不酷,因为作业不会被广泛均匀地分散(在这种情况下我猜这个键值只能是文件编号 - 文件内容或文件名称 - 文件内容,如果你提到X不能以任何方式改变,实际上,line no。-line会更容易设置)

顺便说一句,有两种方法可以利用hadoop mapreduce框架,一种方法是在java中编写mapper / reducer并在jar中编译它们,然后使用hadoop jar运行mapreduce作业you_job.jar。另一种方法是串流,你可以使用mapper / reducer来编写python就是这样。


I have an existing Python program that has a sequence of operations that goes something like this:

  1. Connect to MySQL DB and retrieve files into local FS.
  2. Run a program X that operates on these files. Something like: java -jar X.jar <folder_name> This will open every file in the folder and perform some operation on them and writes out an equal number of transformed files into another folder.
  3. Then, run a program Y that operates on these files as: java -jar Y.jar <folder_name> This creates multiple files of one line each which are then merged into a single file using a merge function.
  4. This merged file is then the input for some further operations and analyses that is not really important for this question.

I'd like to use make use of Hadoop to speed up operation Y as it takes very long to complete if there are: a) more number of files or b) large input files to be operated upon.

What I'd like to know is if it is a good idea to go with Hadoop in the first place to do something of this nature and if threads would make more sense in this case. Bear in mind that X and Y are things that cannot be replaced or changed in any way.

I came up with this idea:

  1. After step 2 above, within a mapper, copy the files into the HDFS and back out again, run the jar file, at which point the results are written back into the HDFS. I copy the results back out to the File System and send it for further processing.

I would like to know if this makes sense at all and especially given that the mapper expects a (key,value) pair, would I even have a k-v pair in this scenario?

I know this sounds like a project and that's because it is, but I'm not looking for code, just some guidance about whether or not this would even work and if it did, what is the right way of going about doing this if my proposed solution is not accurate (enough).

Thank you!

解决方案

You absolutely can use hadoop mapreduce framework to complete your work, but the answer for if it's a good idea could be "it depends". It depends the number and sizes of files you want to proceed.

Keep in mind that hdfs is not very good at deal with small files, it could be a disaster for the namenode if you have a good number (say 10 million) of small files (size is less than 1k bytes). An another hand, if the sizes are too large but only a few files are needed to proceed, it is not cool to just wrap step#2 directly in a mapper, because the job won't be spread widely and evenly (in this situation i guess the key-value only can be "file no. - file content" or "file name - file content" given you mentioned X can't changed in any way. Actually, "line no. - line" would be more situable)

BTW, there are 2 ways to utilize hadoop mapreduce framework, one way is you write mapper/reducer in java and compile them in jar then run mapreduce job with hadoop jar you_job.jar . Another way is streaming, you can write mapper/reducer by using python is this way.

这篇关于使用Hadoop运行jar文件 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆