使用 Pig 从数据中删除单引号 [英] Remove single quotes from data using Pig

查看:30
本文介绍了使用 Pig 从数据中删除单引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这就是我的数据的样子

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

我想使用 Pig Script 从此数据中删除 (, )' .我希望我的数据看起来像这样-

I want to remove (, ) and ' from this data using Pig Script. I want my data to look like this-

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

我很长一段时间都被困在这个问题上.请帮忙.提前致谢.

I am stuck on this from quite long time. Please help. Thanks in advance.

推荐答案

你能用下面的正则表达式试试 REPLACE 函数吗?

Can you try REPLACE function with the below regex?

解释:
在正则表达式中,很少有字符具有特殊含义 \ ^ $ ., |?* + ( ) [ {.这些特殊字符称为metacharacters".如果您想将这些字符中的任何一个用作正则表达式的一部分,那么您需要用一个反斜杠将它们转义.在我们的例子中,Pig 使用基于 Java 的正则表达式引擎,所以所有 特殊字符都需要用双反斜杠转义(Java 使用 \\ 双反斜杠来区分特殊字符).

Explanation:
In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {. These special characters are called as "metacharacters". If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash. In our case Pig uses Java based regex engine so all the specials characters needs be escaped with double backslash (Java uses \\ double backslash to differentiate the special characters ).

从您的输入中删除 '(' ')' 和 '(单引号) 字符.
1. 只需将 () 替换为双反斜杠 \\(\\).
2. '(单引号)是Pig中的特殊字符(默认字符串字面量),所以这也需要双反斜杠去掉特殊含义,但是双反斜杠没有't 说服猪解析器(你会得到双反斜杠的错误),这就是我使用三个反斜杠作为单引号\\\'的原因 去掉特殊含义.
3. [] 是字符类,这将只匹配几个字符中的一个.只需将要匹配的字符放在方括号内即可.在我们的例子中是 [()'].
4、+符号用于匹配一个或多个字符.

To remove '(' ')' and '(single quote) characters from your input.
1. Just Replace () with double backslash \\(\\).
2. '(single quote) is special character in Pig(default string literal), so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser(you will get error for double backslash) that is the reason i used three backslash for single quote \\\' to remove the special meaning.
3. [] is character class, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()'].
4. + symbol is for matching one or more characters.

输入

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

PigScript1:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
STORE B INTO 'output';

Pigscript2:

A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
STORE B into 'output1' USING PigStorage(',');

输出:将存储在 output/part-m-00000 文件中

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

这篇关于使用 Pig 从数据中删除单引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆