在Pentaho Data Integration中填补流中的数据空白,有可能吗? [英] Filling data gaps in a stream in Pentaho Data Integration, is it possible?

查看:91
本文介绍了在Pentaho Data Integration中填补流中的数据空白,有可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中货币兑换为EUR-USD.该文件是从加拿大银行下载的.自2013年10月10日起,我下载了包含数据的CSV文件.

I have a CSV file with currency exchanges EUR-USD. The file was downloaded from the Bank of Canada. I downloaded the CSV with data since Oct 10th, 2013 onwards.

尽管如此,数据中还是存在差距.天没有转化率.

There are, nevertheless, gaps in the data, ie. days without the conversion rates.

我一直在与Spoon Kettle战斗(第一天),以找出一种简单(但通用)的方法来填补空白,例如最后一个非空值.我设法做到这一点的唯一方法是链接4个获取上一行字段",并在计算器中使用NVL来获取第一个非空值.但这仅在流中的间隙不大于4行的情况下有效.

I've been fighting (1st day with Spoon Kettle) to find out a simple (but general) way to fill the gaps, say, with the last non-null value. And the only way I've managed to accomplish this is by chaining 4 "Get previous row fields" and the using the NVL in a Calculator to take the first non-null value. But that only works if gaps are not bigger than 4 rows in a stream.

该图像表示转换:

我的第一个问题归结为:在有间隙的流中,是否有一种通用的方法来进行内插/外推?

My first question reduces to: Is there a general way to do interpolation/extrapolation in a stream with gaps?

我尝试使用修改后的JavaScript值",但API仍然使我无法幸免.而且,似乎这一步似乎只有MapReduce组合的Map部分,我可能都需要.

I tried to use the "Modified JavaScript Value" but the API still escapes me. Moreover, it seems that this step only have the Map part of a MapReduce combo, I'd probably need both.

所以,我的第二个问题是:是否可以使用非Java语言(Scala,Clojure,Jython或JS)来编程MapReduce组合?

So, my second question is: Is there a way to program a MapReduce combo in a language that is not Java (Scala, Clojure, Jython or JS)?

推荐答案

您可以使用以下三个步骤的组合:

You can use a combination of the following three steps:

1)分析查询-允许您获取当前行之前或之后N行的字段的值;对于您的情况,您将需要获取前1行的日期(下一个可用日期)

1) Analytical query - allows you to fetch the value of a field N rows before or after the current row; In your case, you will want to fetch the date 1 row ahead (the next available date)

2)计算器-确定了该行的上一个日期,用它来计算日期之间的天数;

2) Calculator - having determined the previous date for the row, use it to calculate Days between dates;

3)将字段number_of_clones计算为dbd-1(缺少的天数;

3) Calculate a field number_of_clones as dbd-1 (the number of days missing;

4)在复制行"步骤中使用该字段,以根据需要多次增加一行;添加一个clone_number字段

4) Use that field on the Clone Rows step to multiple a row as many times as necessary; Add a clone_number field

5)在日期中添加clone_number作为天,您将获得它所指的日期.

5) Add the clone_number as days to the date and you get the day it refers to.

此外,分析"查询步骤允许您将一个字段指定为分组依据"字段,这样,如果您拥有美元的x汇率,然后拥有英镑的x汇率,则最终的美元x汇率日将检索null作为下一个值.

Moreover, the Analytical query step allows you to specify a field as the "group by" field, so that if you have x-rates for USD and then you have x-rates for GBP, the final USD x-rate day will retrieve null as the next value.

这是一个示例KTR文件:

Here's a sample KTR file:

数据网格步骤生成几行,其中存在一些数据间隙:

The data grid step generates a few rows with some data gaps in there:

Analytical查询以相同的货币值获取下一个日期

The Analytical query fetches the next date, for the same currency value

然后,计算器步骤将计算缺少多少行.请注意,每种货币的最后一天将具有null作为值,因此我们需要对其进行调整,并改用0(N如果A为空,则NVL(A,B)返回B,否则为A)

Then the calculator step calculates how many rows are missing. Note that the last day of each currency will have null as value, so we need to tweak that and use 0 instead (NVL(A,B) returns B if A is null, A otherwise)

克隆行:占用一行并创建副本.

Clone rows: takes a row and creates copies.

clone_number字段使我们能够计算行所引用的实际日期

The clone_number field allows us to calculate the actual date the row refers to

最后,这是数据.您想要的字段是new_date,货币和exchange_rate.使用选择值对字段列表进行重新排序,并摆脱掉不再需要的那些值.

Finally, here's the data. The fields you want are the new_date, currency and exchange_rate. Use a select values to re-order the field list and get rid of those you don't need anymore.

如您所见,现在我们拥有2014-01-03和2014-01-04的数据,使用以前的已知值.

As you can see, now we have data for 2014-01-03 and 2014-01-04, using the previous known value.

这篇关于在Pentaho Data Integration中填补流中的数据空白,有可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆