计算每个文件的行数以及Talend中的文件名 [英] Count the number of rows for each file along with the file name in Talend

查看:247
本文介绍了计算每个文件的行数以及Talend中的文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我建立了一个从文件中读取数据的作业,并基于特定列的唯一数据将数据集拆分为多个文件。

I have built a job that reads the data from a file, and based on the unique data of a particular columns, splits the data set into many files.

我可以通过以下工作来满足要求:

I am able to acheive the requirement by the below job :

现在从这项将输出分成多个文件的工作中,我想要添加一个子工作,该工作将给我两列。

Now from this job which is splitting the output into multiple files, what I want is to add a sub job which would give me two columns.

第一列是我要在主作业中创建的文件的名称,第二列是每个创建的输出文件具有的行数。

In the first column I want the name of the files that I created in my main job and in the second column, I want the count of number of rows each created output file has.

为达到此目的,我使用了tflowmeter并捕获了计数结果,我使用了tFlowmeterCatcher,它为相应的输出文件为每行的计数提供了正确的结果,但为所有文件提供了最后一个文件名

To achive this I used tflowmeter and to catch the result of count i used the tFlowmeterCatcher, which is giving me correct result for the count of each rows for the correspoding output files, but is giving the last file name in all the files that i have generated for the counts.

如何获取正确的文件名和相应的行数。

How can I get the correct file names and the corresponding row count.

推荐答案

如果使用以下说明,您的工作最终将具有其他类似的组件:

If you use the following directions, your job will in the end have additional components like so:

直接使用 tJavaFlex 在主 tFileOutputDelimited 之后。看起来应该像这样:

Use a tJavaFlex directly after the tFileOutputDelimited on main. It should look like this:

Start Code: int countRows = 0;
Main Code:  countRows = countRows + 1;
End Code:   globalMap.put("rowCount", countRows);

将此组件 OnComponentOk 与新子作业的第一个组件连接。该子作业包含 tFixedFlowInput tJavaRow tBufferOutput

Connect this component OnComponentOk with the first component of a new subjob. This subjob holds a tFixedFlowInput, a tJavaRow and a tBufferOutput.

strong> tFixedFlowInput 就在这里,以便可以连接 OnComponentOk ,而无需进行任何更改。在 tJavaRow 中,放置以下内容:

The tFixedFlowInput is just here so that the OnComponentOk can be connected, nothing has to be altered. In tJavaRow you put the following:

output_row.filename = (String)globalMap.get("row7.newColumn"); 
//or whatever is your row variable where the filename is located

output_row.rowCount = (Integer)globalMap.get("rowCount");

在架构中,添加以下元素:

In the schema, add the following elements:

现在在第一个子作业的末尾添加 tBufferOutput

Simply add a tBufferOutput now at the end of the first subjob.

现在,使用组件 tBufferInput 以及可能需要处理和存储数据的任何组件创建另一个新的子作业。使用 tBufferInput 组件将工作的第一部分与 OnSubjobOk 连接。我使用了 tLogRow 来显示结果(带有我随机创建的假数据):

Now, create another new subjob with the components tBufferInput and whatever components you may need to process and store the data. Connect the very first component of your job with a OnSubjobOk with the tBufferInput component. I used a tLogRow to show the result (with my randomly created fake data):

.---------------+--------.
|      LogFileData       |
|=--------------+-------=|
|filename       |rowCount|
|=--------------+-------=|
|fileblerb1.txt |27      |
|fileblerb29.txt|14      |
|fileblerb44.txt|20      |
'---------------+--------'

注意:请注意,如果您向文件添加标头(包含标头已在 tFileOutputDelimited ),则可能需要更改作业(只需设置 int countRows = 1; 或您需要的任何内容)。我没有测试这种情况。

NOTE: Keep in mind that if you add a header to the file (Include Header checked in tFileOutputDelimited), the job might need to be changed (simply set int countRows = 1; or whatever you would need). I did not test this case.

这篇关于计算每个文件的行数以及Talend中的文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆