消除重复项并插入具有最大值的唯一记录。通过Talend存在的列值 [英] Eliminate duplicates and Insert Unique records having max no. of column values present through Talend

查看:215
本文介绍了消除重复项并插入具有最大值的唯一记录。通过Talend存在的列值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个excel文件,每天更新,即每次数据总是不同

I have an excel file which gets updated on a daily basis i.e the data is always different every time.

我将使用 Talend 将Excel表格中的数据拉入表格。我在表格中定义了一个主键 Company_ID

I am pulling the data from the Excel sheet into the table using Talend. I have a primary key Company_ID defined in the table.

我所面临的错误是Excel表单很少重复 Company_ID 值。它将会随着Excel文件的每日更新而获得更多重复的值。

The error I am facing is that the Excel sheet has few duplicate Company_ID values. It will also pick up more duplicate values in the future as the Excel file will be updated daily.

我想选择第一个记录,其中公司ID 字段为 1 ,其余列中的记录不为空。另外,对于 Company_ID 3 ,对于一列是一个空值,因为它是唯一的记录因为 company_id

I want to choose the first record where the Company ID field is 1 and the record doesn't have null in the rest of the columns. Also, for a Company_ID of 3 there is a null value for one column which is ok since it is a unique record for that company_id.

如何选择最大值为唯一的行。存在的列值,例如在Talend中的公司ID 1 的情况下

How do I choose a unique row which has maximum no. of column values present ie for eg in the case of Company ID of 1 in Talend ?

推荐答案

tUniqRow通常是处理重复项的最简单的方法。
如果您担心tUniqRow的第一行可能不是您想要的第一行,则可以对行进行排序,以便按照您首选的顺序输入tUniqRow:

tUniqRow is usually the easiest way to handle duplicates. If you are worried that the first row coming to tUniqRow may not be the first row that you want there, you can sort your rows, so they enter tUniqRow in your preferred order:


(已使用的组件:tFileInputExcel,tJavaRow,tSortRow,tUniqRow,tFilterColumns)

(used components: tFileInputExcel, tJavaRow, tSortRow, tUniqRow, tFilterColumns)

在您的特殊情况下,tJava可能如下所示:

In your particular case, the tJava could look like this:

// Code generated according to input schema and output schema
output_row.company_id = input_row.company_id;
output_row.name       = input_row.name;
output_row.et_cetera  = input_row.et_cetera;
// End of pre-generated code

int i = 0;
if (input_row.company_id == null) { i++; }
if (input_row.name       == null) { i++; }
if (input_row.et_cetera  == null) { i++; }
output_row.priority = i;

这篇关于消除重复项并插入具有最大值的唯一记录。通过Talend存在的列值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆