Pentaho Kettle将CSV拆分为多个记录 [英] Pentaho Kettle split CSV into multiple records

查看:1336
本文介绍了Pentaho Kettle将CSV拆分为多个记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Kettle的新手,但是到目前为止已经很好了。

I'm new to Kettle, but getting on well with it so far. However I can't figure out how to do this.

我有一个csv,看起来像这样

I have a csv which looks something like this

a, col1, col2, col3
a, col1, col2, col3
a, col1, col2, col3
b, col1, col2, col3
b, col1, col2, col3
c, col1, col2, col3
c, col1, col2, col3


b $ b

第一列以键(a,b,c)开头,然后其余的列跟随。我想做的是读取csv(得到覆盖),然后拆分基于键的csv,所以我有3个块/数据组,然后将每个数据块转换为单独的json文件,我认为我可以得到。

The first column starts with a key (a,b,c), and then the rest of the columns follow. What I want to do is read in the csv (got that covered) and then split the csv based on key, so I have 3 chunks/ groups of data and then convert each of those chunks of data into a separate json file, which I think I can get.

我不能得到我的头是分组数据,然后执行单独的操作(转换为json)组。它不是创建json我有一个问题。

What I can't get my head around is the grouping the data and then performing a separate action (convert to json) on each of those separate groups. Its not the creating json I have an issue with.

这些数据来自许多环境传感器的传感器网络,所以有很多键,数百和新的。我使用map reduce来处理这个数据,因为分区的概念是我想在这里复制,而不使用水壶的hadoop元素,因为部署是不同的。一旦我分区了数据,它需要作为独立的记录加载到不同的地方。关键是传感器的唯一ID(序列号)。

The data is from a sensor network of many environmental sensors so there are many keys, hundreds, and new ones get added. I've used map reduce to process this data before as the concept of partitioning is what I'm trying to replicate here, without using the hadoop elements of kettle as the deployment is different. Once I've partitioned the data it needs to be loaded into different places as seperate records. The key is a unique ID (serial number) of a sensor.

有任何想法吗?

/ p>

Thanks

推荐答案

如果我正确理解你的问题,你可以使用GROUP BY步骤来分组列

If i have understood your question correctly, you can use "GROUP BY" step to group the columns (i.e. the first header in your data set) and then store these into memory.

完成后,使用参数循环获取变量并动态生成多个JSON输出。请检查下面的图片:

Once this is done.. use parameter looping to "get the variables" and dynamically generate multiple JSON output. Check the image below:

在JSON输出步骤中,使用像header1这样的变量来生成多个文件。

In the JSON output step, use variables like header1 to generate multiple files. Highlighted below the changes i made in the JSON Output.

如果您发现混淆,我已在此处

In case you find in confusing, i have uploaded a sample code in here.

希望它有助于:)

这篇关于Pentaho Kettle将CSV拆分为多个记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆