通过 Apache Pig UDF 在 javascript 中读取文件 [英] Reading a file in javascript via Apache Pig UDF

查看:23
本文介绍了通过 Apache Pig UDF 在 javascript 中读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这里有一些(非常简化的)nodejs 代码:

I have some (very simplified) nodejs code here:

var fs = require('fs');

var derpfile = String(fs.readFileSync( './derp.txt', 'utf-8' ));
var derps    = derpfile.split( '\n' );
for (var i = 0; i < derps.length; ++i) {
    // do something with my derps here
}

问题是,我不能在 Pig UDF 中使用节点(我知道;如果我可以这样做,请让我知道!).当我在 javascript 中查看文件 io"时,我看到的所有教程都在浏览器沙箱中.我需要从文件系统中读取一个文件,比如 hdfs:///foo/bar/baz/jane/derps.txt,我不能保证它会在 CWD 中,但我会有获得的权限.所有这些教程似乎也涉及异步读取.我真的需要在这里进行阻塞调用,因为在读取此文件之前,猪作业无法开始.也有很多关于如何从另一个站点下拉 URL 的解释.

The problem is, I cannot use node in Pig UDF's (that I am aware of; if I can do this, please let me know!). When I look at 'file io' in javascript, all the tutorials I see are in re the browser sandbox. I need to read a file off the filesystem, like hdfs:///foo/bar/baz/jane/derps.txt, which I cannot guarantee will be in the CWD, but which I will have permissions to get at. All these tutorials also seem to be involving asynchronous reads. I really need to have a blocking call here, as the pig job cannot begin until this file is read. There are also lots of explanations of how to pull down a URL from another site.

这有点令人难以置信的令人沮丧,因为使用 Java 来完成这项任务是可怕的矫枉过正,而 javascript 确实是这项工作的正确工具(好吧,好吧,perl 是,但我不明白选择……),而我对基本文件 IO 之类的简单事情感到束手无策.:(

This is kind of incredibly frustrating as using Java for this task is horrific overkill, and javascript is really The Right Tool For The Job (well, okay, perl is, but I don't get to choose that…), and I'm hamstrung on something as simple as basic file IO. :(

推荐答案

我不能说你使用 JavaScript,因为我从来没有用它写过 UDF,但一般来说文件访问不是在一个内部完成的UDF,尤其是当您尝试访问 HDFS 上的某些内容时.HDFS 上的文件是通过 NameNode 访问的,所以一旦你在 DataNode 上执行,你就不走运了.您需要将文件放在分布式缓存中.

I can't speak to your use of JavaScript, since I've never written a UDF with it, but in general file access is not done inside of a UDF, especially if you are trying to access something on HDFS. Files on HDFS are accessed via the NameNode, so once you are executing on a DataNode, you are out of luck. You need to place the files in the distributed cache.

Pig 可以通过执行 JOIN 为您完成此操作.如果文件适合内存,您可以进行复制连接,这将利用分布式缓存.我会使用 Pig 将文件加载到一个关系中,使用 GROUP relation ALL 将它放入一个包中,然后 CROSS 这个包中包含你的关系中的所有记录兴趣.然后您可以将此包传递给您喜欢的任何 UDF.类似的东西:

Pig can do this for you by doing a JOIN. If the file fits in memory, you can do a replicated join, which will leverage the distributed cache. I would use Pig to load the file into a relation, use GROUP relation ALL to get it into a single bag, and then CROSS this bag with all records in your relation of interest. Then you can pass this bag to any UDFs you like. Something like:

a = LOAD 'a' AS ...;
f = LOAD '/the/file/you/want' AS ...;

/* Put everything into a single bag */
f_bag = FOREACH (GROUP f ALL) GENERATE f;
/* Now you have a relation with one record;
   that record has one field: the bag, f */
a2 = CROSS a, f_bag;
/* Now you have duplicated a and appended
   the bag f to each record */

b = FOREACH a2 GENERATE yourUDF(field1, field2, f)

这篇关于通过 Apache Pig UDF 在 javascript 中读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆