猪拉丁文字数统计 [英] Pig Latin Word Count

查看:22
本文介绍了猪拉丁文字数统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算猪脚本中包含以下单词的行数:'jack', 'hack', 'mat', 'throttle'.我正在使用 Cloudera 快速入门虚拟机.

I am trying to count number of lines that contain the following words: 'jack', 'hack', 'mat', 'throttle' in a pig script. I am using Cloudera quickstart vm.

输入文件为:

09-jack-17,5:00PM;#slowmotion,Tribune Logic hack: how is life in temrs of money Creative hack

14-June-18,7:15PM;#Indiacalling,Horton-NJ Strategic/Halloween One World at Application Deployment

12-jack-16,jfh:er;#temporary, accomodation, osteoporosis, juxtapose, don't misinterpret this awaiting throttle jack

输出应该是:黑客 2插孔 2油门 1垫子 0我无法提取这些单词并计算它的计数.我该怎么办?我尝试了以下由 inquisitive_mind 给出的脚本:

The output should be: hack 2 jack 2 throttle 1 mat 0 I am unable to extract those words and calculate it's count. What should I do? I tried the following script which was given by inquisitive_mind:

A = LOAD 'Input.txt'AS(line: chararray);
SPLIT A INTO M IF line matches'hackathon,N IF line matches'dec', O IF line matches'chicago',P IF line matches'java';
M1 = GROUP M ALL;
M2 = FOR EACH M1 GENERATE COUNT(M);
M3 = FOREACH M2 GENERATE CONCAT('hackathon',(chararray)M2.$0);
N1 = GROUP N ALL;
N2 = FOREACH N1 GENERATE COUNT(N);
N3 = FOREACHN2 GENERATE CONCAT('dec',(chararray)N2.$0);
O1 = GROUP O ALL;
O2 = FOREACH O1 GENERATE COUNT(O);
O3 = FOR EACH O2 GENERATE CONCAT('chicago',(chararray)O2.$0);
P1 = GROUP P ALL;
P2 = FOR EACH P1 GENERATE COUNT(P);
P3 = FOREACH P2 GENERATE CONCAT('java',(chararray)P2.$0);
DUMP M3;
DUMP N3;
DUMP O3;
DUMP P3;

但是当我在 mapreduce 或本地模式下运行它时,我收到以下错误:

But when I run it in mapreduce or local mode I get the following error:

2016-10-11 09:43:44,084 [main] 错误 org.apache.pig.tools.grunt.Grunt - 错误 1000:解析过程中出错.第 19 行第 0 列的词法错误.遇到:之后:"日志文件中的详细信息:/home/cloudera/pig_1476204218406.log

2016-10-11 09:43:44,084 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 19, column 0. Encountered: after : "" Details at logfile: /home/cloudera/pig_1476204218406.log

这是日志文件:

ERROR 1000: Error during parsing. Lexical error at line 19, column 0.  Encountered: <EOF> after : ""

org.apache.pig.tools.pigscript.parser.TokenMgrError: Lexical error at line 19, column 0.  Encountered: <EOF> after : ""
    at org.apache.pig.tools.pigscript.parser.PigScriptParserTokenManager.getNextToken(PigScriptParserTokenManager.java:3326)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.jj_ntk(PigScriptParser.java:1379)
    at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:106)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
    at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
    at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
    at org.apache.pig.Main.run(Main.java:613)
    at org.apache.pig.Main.main(Main.java:158)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

推荐答案

如果单词固定且数量少,可以使用 split 并计算行数.

If the words are fixed and small in number you can use split and count the lines.

A = LOAD 'data.txt' AS (line:chararray);
SPLIT A INTO M IF line matches 'jack', N IF line matches 'hack',O IF line matches 'throttle', P IF line matches 'mat';

M1 = GROUP M ALL;
M2 = FOREACH M1 COUNT(M); 
M3 = FOREACH M2 GENERATE CONCAT('jack ',(chararray)M2.$0); 

N1 = GROUP N ALL;
N2 = FOREACH N1 COUNT(N); 
N3 = FOREACH N2 GENERATE CONCAT('hack ',(chararray)N2.$0);

O1 = GROUP O ALL;
O2 = FOREACH O1 COUNT(O); 
O3 = FOREACH O2 GENERATE CONCAT('throttle ',(chararray)O2.$0);

P1 = GROUP P ALL;
P2 = FOREACH P1 COUNT(P); 
P3 = FOREACH P2 GENERATE CONCAT('mat ',(chararray)P2.$0);

DUMP M3;
DUMP N3;
DUMP O3;
DUMP P3;

这篇关于猪拉丁文字数统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆