Pig-删除gzip文件中的嵌入式换行符和逗号 [英] Pig - Remove embedded newlines and commas in gzip files

查看:178
本文介绍了Pig-删除gzip文件中的嵌入式换行符和逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个gzip文件,其数据字段用逗号分隔.我目前正在使用PigStorage加载文件,如下所示:

I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below:

A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text);

gzip文件中的数据具有嵌入字符-嵌入了换行符和逗号.这些字符存在于所有三个字段中-id,日期和文本.嵌入的字符始终在"引号内.

The data in the gzip file has embedded characters - embedded newlines and commas. These characters exist in all the three fields - id, date and text. The embedded characters are always within the "" quotes.

我想在进行任何进一步处理之前使用Pig 替换或删除这些字符.

I would like to replace or remove these characters using Pig before doing any further processing.

我认为我需要首先查找"引号的出现情况.找到这些引号后,我需要查看这些引号内的字符串并在其中搜索逗号和换行符.找到后,我需要将其替换为空格或将其删除.

I think I need to first look for the occurrence of the "" quotes. Once I find these quotes, I need to look at the string within these quotes and search for the commas and new line characters in it. Once found, I need to replace them with a space or remove them.

如何通过Pig实现此目的?

How can I achieve this via Pig?

推荐答案

尝试一下:

REGISTER piggybank.jar; 
A = LOAD 'myfile.gz' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:chararray,date:chararray,text:chararray);
B = FOREACH A GENERATE  REPLACE(REPLACE(id,'\n',''),',','') AS id, REPLACE(REPLACE(date,'\n',''),',','') AS date, REPLACE(REPLACE(text,'\n',''),',','') AS text;

我们可以使用org.apache.pig.piggybank.storage.CSVExcelStorage()或org.apache.pig.piggybank.storage.CSVLoader().

We can use either : org.apache.pig.piggybank.storage.CSVExcelStorage() or org.apache.pig.piggybank.storage.CSVLoader().

有关详细信息,请参见下面的API链接

Refer the below API links for details

  1. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
  2. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html
  1. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
  2. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html

这篇关于Pig-删除gzip文件中的嵌入式换行符和逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆