猪的过滤器匹配太多 [英] Too many filter matching in pig
问题描述
我有一个过滤关键字列表(大约有 1000 个),我需要使用这个列表过滤 pig 中关系的一个字段.
I have a list of filter keywords (about 1000 in numbers) and I need to filter a field of a relation in pig using this list.
最初,我声明了这些关键字,例如:%declare p1 '.keyword1.';.......
Initially, I have declared these keywords like: %declare p1 '.keyword1.'; .... ...
%declare p1000 '.keyword1000.';
%declare p1000 '.keyword1000.';
然后我正在做如下过滤:
I am then doing filtering like:
Filtered= FITLER SRC BY(不是 $0 匹配 '$p1')和(不是 $0 匹配 '$p2')和......(不是 $0 匹配 '$p1000');
Filtered= FITLER SRC BY (not $0 matches '$p1') and (not $0 matches '$p2') and ...... (not $0 matches '$p1000');
转储过滤;
假设我的源关系在 SRC 中,我需要对第一个字段(即 $0)应用过滤.
Assume that my source relation is in SRC and I need to apply filtering on first field i.e. $0.
如果我将过滤器的数量减少到 100-200,它就可以正常工作.但随着过滤器数量增加到 1000.它不起作用.
If I am reducing the number of filters to 100-200, it's working fine. But as number of filters increases to 1000. It doesn't work.
有人可以建议解决方法来获得正确的结果吗?
Can somebody suggest a work around to get the results right?
提前致谢
推荐答案
您可以编写一个简单的过滤器 UDF,在其中执行所有检查,例如:
You can write a simple filter UDF where you'd perform all the checks something like:
package myudfs;
import java.io.IOException;
import org.apache.pig.FilterFunc;
import org.apache.pig.data.Tuple;
public class MYFILTER extends FilterFunc
{
static List<String> filterList;
static MYFILTER(){
//load all filters
}
public Boolean exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
try{
String str = (String)input.get(0);
return !filterList.contains(str);
}catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
这篇关于猪的过滤器匹配太多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!