通过Apache Pig UDF在JavaScript中读取文件 [英] Reading a file in javascript via Apache Pig UDF

查看:224
本文介绍了通过Apache Pig UDF在JavaScript中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  var fs = require('fs');我们在这里有一些(非常简化的)nodejs代码。 

var derpfile = String(fs.readFileSync('./derp.txt','utf-8'));
var derps = derpfile.split('\\\
');
for(var i = 0; i< derps.length; ++ i){
//在这里用我的derps做些事
}

问题是,我无法在Pig UDF中使用节点(我知道这一点;如果我可以这样做,请让我知道!)。当我在javascript中查看'file io'时,我看到的所有教程都在浏览器沙箱中。我需要从文件系统中读取一个文件,例如 hdfs:///foo/bar/baz/jane/derps.txt ,我无法保证它会在CWD中,但我将有权获得。所有这些教程似乎都涉及异步读取。我真的需要在这里进行阻塞调用,因为只有在读取这个文件之前,猪的工作才能开始。还有很多关于如何从其他网站下载URL的解释。



这是令人难以置信的令人沮丧的事,因为使用Java来完成这项任务是可怕的矫枉过正,JavaScript是真正的工作的正确工具(好吧,好吧, perl 是,但我不会选择那个......),并且我对基本文件IO这样简单的东西感到束手无策。 :(


解决方案

我无法说出您对JavaScript的使用,因为我从来没有写过UDF,但一般来说,文件访问不是在UDF内部完成的,特别是当你试图在HDFS上访问某些东西时。HDFS上的文件是通过NameNode访问的,所以一旦你在DataNode上执行,你运气不好。将文件放入分布式缓存中。



Pig可以通过执行 JOIN 来完成此操作,如果文件适合内存,你可以做一个复制连接,这将利用分布式缓存。我将使用Pig将文件加载到关系中,使用 GROUP关系ALL 来获取它放入一个单独的包中,然后将 CROSS 这个包放在您感兴趣的关系中的所有记录中,然后您可以将此包传递给您喜欢的任何UDF。 >

  a = LOAD'a'AS ...; 
f = LOAD'/ the / file / you / want' ..;

/ *把所有东西放入一个单一的bag * /
f_bag = FOREACH(GROUP f ALL)GENERATE f;
/ *现在你与一条记录有关系;
表示该记录有一个字段:该包,f * /
a2 = CROSS a,f_bag;
/ *现在你已经复制了一个并附加了
的包f到每个记录* /

b = FOREACH a2生成yourUDF(field1,field2,f)


I have some (very simplified) nodejs code here:

var fs = require('fs');

var derpfile = String(fs.readFileSync( './derp.txt', 'utf-8' ));
var derps    = derpfile.split( '\n' );
for (var i = 0; i < derps.length; ++i) {
    // do something with my derps here
}

The problem is, I cannot use node in Pig UDF's (that I am aware of; if I can do this, please let me know!). When I look at 'file io' in javascript, all the tutorials I see are in re the browser sandbox. I need to read a file off the filesystem, like hdfs:///foo/bar/baz/jane/derps.txt, which I cannot guarantee will be in the CWD, but which I will have permissions to get at. All these tutorials also seem to be involving asynchronous reads. I really need to have a blocking call here, as the pig job cannot begin until this file is read. There are also lots of explanations of how to pull down a URL from another site.

This is kind of incredibly frustrating as using Java for this task is horrific overkill, and javascript is really The Right Tool For The Job (well, okay, perl is, but I don't get to choose that…), and I'm hamstrung on something as simple as basic file IO. :(

解决方案

I can't speak to your use of JavaScript, since I've never written a UDF with it, but in general file access is not done inside of a UDF, especially if you are trying to access something on HDFS. Files on HDFS are accessed via the NameNode, so once you are executing on a DataNode, you are out of luck. You need to place the files in the distributed cache.

Pig can do this for you by doing a JOIN. If the file fits in memory, you can do a replicated join, which will leverage the distributed cache. I would use Pig to load the file into a relation, use GROUP relation ALL to get it into a single bag, and then CROSS this bag with all records in your relation of interest. Then you can pass this bag to any UDFs you like. Something like:

a = LOAD 'a' AS ...;
f = LOAD '/the/file/you/want' AS ...;

/* Put everything into a single bag */
f_bag = FOREACH (GROUP f ALL) GENERATE f;
/* Now you have a relation with one record;
   that record has one field: the bag, f */
a2 = CROSS a, f_bag;
/* Now you have duplicated a and appended
   the bag f to each record */

b = FOREACH a2 GENERATE yourUDF(field1, field2, f)

这篇关于通过Apache Pig UDF在JavaScript中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆