使用Pig从数据中删除单引号 [英] Remove single quotes from data using Pig

查看:65
本文介绍了使用Pig从数据中删除单引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我的数据的样子

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

我想使用Pig脚本从此数据中删除()'.我希望我的数据看起来像这样-

I want to remove (, ) and ' from this data using Pig Script. I want my data to look like this-

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

很长一段时间以来,我一直坚持这一点.请帮忙.预先感谢.

I am stuck on this from quite long time. Please help. Thanks in advance.

推荐答案

可以使用以下正则表达式尝试REPLACE函数吗?

Can you try REPLACE function with the below regex?

说明:
在正则表达式中,只有几个具有特殊含义\ ^ $ . , | ? * + ( ) [ {的字符.这些特殊字符称为"metacharacters". 如果要将这些字符中的任何一个用作正则表达式的一部分,则需要escape them with a single backslash.在我们的情况下,Pig uses Java based regex engine是所有的specials characters needs be escaped with double backslash(Java使用\\双反斜杠来区分特殊字符).

Explanation:
In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {. These special characters are called as "metacharacters". If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash. In our case Pig uses Java based regex engine so all the specials characters needs be escaped with double backslash (Java uses \\ double backslash to differentiate the special characters ).

要从输入中删除'(' ')' and '(single quote)个字符.
1.只需用双反斜杠\\(\\)替换().
2. '(single quote)是Pig (default string literal)中的特殊字符,因此这也需要双反斜杠以消除特殊含义,但双反斜杠doesn't convince pig parser(you will get error for double backslash)就是我将three backslash用于单引号删除特殊含义.
3. [] is character class,这将仅匹配几个字符中的一个.只需将字符放在要匹配的方括号内即可.在我们的例子中是[()'].
4. + symbol用于匹配一个或多个字符.

To remove '(' ')' and '(single quote) characters from your input.
1. Just Replace () with double backslash \\(\\).
2. '(single quote) is special character in Pig(default string literal), so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser(you will get error for double backslash) that is the reason i used three backslash for single quote \\\' to remove the special meaning.
3. [] is character class, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()'].
4. + symbol is for matching one or more characters.

输入

(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')

PigScript1:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
STORE B INTO 'output';

Pigscript2:

A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
STORE B into 'output1' USING PigStorage(',');

输出:将存储在output/part-m-00000文件中

10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON

这篇关于使用Pig从数据中删除单引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆