使用Pig从数据中删除单引号 [英] Remove single quotes from data using Pig
问题描述
这是我的数据的样子
(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')
我想使用Pig脚本从此数据中删除(
,)
和'
.我希望我的数据看起来像这样-
I want to remove (
, )
and '
from this data using Pig Script. I want my data to look like this-
10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON
很长一段时间以来,我一直坚持这一点.请帮忙.预先感谢.
I am stuck on this from quite long time. Please help. Thanks in advance.
推荐答案
可以使用以下正则表达式尝试REPLACE
函数吗?
Can you try REPLACE
function with the below regex?
说明:
在正则表达式中,只有几个具有特殊含义\ ^ $ . , | ? * + ( ) [ {
的字符.这些特殊字符称为"metacharacters
".
如果要将这些字符中的任何一个用作正则表达式的一部分,则需要escape them with a single backslash
.在我们的情况下,Pig uses Java based regex engine
是所有的specials characters needs be escaped with double backslash
(Java使用\\双反斜杠来区分特殊字符).
Explanation:
In Regex there are few characters have special meanings \ ^ $ . , | ? * + ( ) [ {
. These special characters are called as "metacharacters
".
If you want to use any of these characters as part of your regex, then you need to escape them with a single backslash
. In our case Pig uses Java based regex engine
so all the specials characters needs be escaped with double backslash
(Java uses \\ double backslash to differentiate the special characters ).
要从输入中删除'(' ')' and '(single quote)
个字符.
1.只需用双反斜杠\\(\\)
替换()
.
2. '(single quote)
是Pig (default string literal)
中的特殊字符,因此这也需要双反斜杠以消除特殊含义,但双反斜杠doesn't convince pig parser
(you will get error for double backslash
)就是我将three backslash
用于单引号
3. [] is character class
,这将仅匹配几个字符中的一个.只需将字符放在要匹配的方括号内即可.在我们的例子中是[()']
.
4. + symbol
用于匹配一个或多个字符.
To remove '(' ')' and '(single quote)
characters from your input.
1. Just Replace ()
with double backslash \\(\\)
.
2. '(single quote)
is special character in Pig(default string literal)
, so this also required double backslash to remove the special meaning but double backslash doesn't convince pig parser
(you will get error for double backslash
) that is the reason i used three backslash
for single quote \\\'
to remove the special meaning.
3. [] is character class
, this will match only one out of several characters. Simply place the characters inside the square bracket that you want to match ie. in our case its [()']
.
4. + symbol
is for matching one or more characters.
输入
(10, 'ACCOUNTING', 'NEW YORK')
(20, 'RESEARCH', 'DALLAS')
(30, 'SALES', 'CHICAGO')
(40, 'OPERATIONS', 'BOSTON')
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE REPLACE(line,'[\\\'\\(\\)]+','');
STORE B INTO 'output';
Pigscript2:
A = LOAD 'input' USING PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray);
B = FOREACH A GENERATE REPLACE(col1,'[\\(]+',''),REPLACE(col2,'[\\\']',''),REPLACE(col3,'[\\)\\\']+','');
STORE B into 'output1' USING PigStorage(',');
输出:将存储在output/part-m-00000文件中
10, ACCOUNTING, NEW YORK
20, RESEARCH, DALLAS
30, SALES, CHICAGO
40, OPERATIONS, BOSTON
这篇关于使用Pig从数据中删除单引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!