空RDD上的转换结果 [英] Result of transformation on an Empty RDD

查看:116
本文介绍了空RDD上的转换结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个RDD(combinerRDD),我在这个RDD下应用了转换

  JavaPairRDD< String,Integer> count = combinerRDD.mapToPair(
)新PairFunction< Tuple2< LongWritable,Text>,String,Integer>(){
字符串文件名;
整数;
消息;

@Override
public Tuple2< String,Integer> call(Tuple2< LongWritable,Text> tuple)throws Exception {
xlhrCount = 0;
filename =;

filename =New_File;
for(JobStep js:message.getJobStep()){
if(js.getStepName()。equals(StepName.NEW_STEP)){
count + = 1;
}
}

返回新的Tuple2< String,Integer>(filename,xlhrCount);
}
} ).reduceByKey(新函数2<整数,整数,整数> (){
@Override
public Integer调用(Integer count1,Integer count2)抛出Exception {
return(count1 + count2);
}
}
);

我的问题是当 combinerRDD 有一些数据在里面,我得到了正确的结果。但是当 combinerRDD 为空时,写入HDFS的结果只是一个空的_SUCCESS文件。我期待2个文件在空RDD上转换,例如_SUCCESS和空白部分00000文件。我对吗?我应该得到多少个输出文件。



我之所以问这是因为我在2个群集中得到了不同的结果,群集1上运行的代码导致了_SUCCESS文件并且第2组导致了_SUCCESS和空的部分00000。我现在很困惑。注意:我在 newRDD.leftOuterJoin(combinerRDD)上进行左连接是否依赖任何群集设置?

c>,这给了我没有结果(当combinerRDD只有_SUCCESS)和newRDD包含值时。

解决方案

一个办法。我正在使用spark-1.3.0,它有以下问题:ie。一个emptyRDD的左外连接会给出空结果。



https://issues.apache.org/jira/browse/SPARK-9236



我创建了如下所示的空对RDD :

  JavaRDD< Tuple2< LongWritable,Text>> emptyRDD = context.emptyRDD(); 
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

现在使用:

 列表< Tuple2< LongWritable,Text>> data = Arrays.asList(); 
JavaRDD< Tuple2< LongWritable,Text>> emptyRDD = context.parallelize(data);
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

它现在可以工作,即我的RDD不再是空的。修正版本有以下版本:
1.3.2,1.4.2,1.5.0(参考上面的链接)。


I have an RDD(combinerRDD)on which I applied below transformation

    JavaPairRDD<String, Integer> counts = combinerRDD.mapToPair(
            new PairFunction<Tuple2<LongWritable, Text>, String, Integer>() {
                String filename;
                Integer count;
                Message message;

                @Override
                public Tuple2<String, Integer> call(Tuple2<LongWritable, Text> tuple) throws Exception {
                    xlhrCount = 0;
                    filename = "";

                        filename = "New_File";
                        for (JobStep js : message.getJobStep()) {
                            if (js.getStepName().equals(StepName.NEW_STEP)) {
                                count += 1;
                            }
                        }

                    return new Tuple2<String, Integer>(filename, xlhrCount);
                }
            }).reduceByKey(new Function2<Integer, Integer, Integer>() {
                               @Override
                               public Integer call(Integer count1, Integer count2) throws Exception {
                                   return (count1 + count2);
                               }
                           }
    );

My question is when combinerRDD has some data inside, I get right result .But when combinerRDD is empty the result written into HDFS is only an empty _SUCCESS file . I was expecting 2 files in the case of transformation on an empty RDD ie _SUCCESS and empty part-00000 file .Am I right? How many output files should I get .

I reason why I am asking this is because I got different result in 2 clusters , the code ran on cluster 1 resulted in _SUCCESS file and cluster 2 resulted in _SUCCESS and empty part-00000 . I am confused now . Is the result dependent on any cluster setup?

Note : I am doing a left join on newRDD.leftOuterJoin(combinerRDD), which gives me no result(when combinerRDD has only _SUCCESS) and newRDD contains value .

解决方案

Ok,so I found a solution. I am using spark-1.3.0, which has below issue: ie. a left outer join with an emptyRDD gives empty result .

https://issues.apache.org/jira/browse/SPARK-9236

I was creating empty Pair RDD like below:

JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.emptyRDD();
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

Now using :

List<Tuple2<LongWritable, Text>> data = Arrays.asList();
JavaRDD<Tuple2<LongWritable, Text>> emptyRDD = context.parallelize(data);
myRDD = JavaPairRDD.fromJavaRDD(emptyRDD);

It works now, ie my RDD is no more empty. Fix is available in versions: 1.3.2, 1.4.2, 1.5.0 (reference above link).

这篇关于空RDD上的转换结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆