PDI-读取CSV文件,如果缺少字段/数据,则移至下一个文件 [英] PDI - Read CSV Files, if missing field/data then move to the next file

查看:138
本文介绍了PDI-读取CSV文件,如果缺少字段/数据,则移至下一个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对PDI还是陌生的,但仍然对此有所了解.我正在尝试创建一个转换,该转换将从一个文件夹中读取所有的csv文件,检查文件中的数据是否正确,这意味着没有格式缺失/错误/错误的行,然后将其存储在数据库中.

我尝试的是:

  1. 使用Text File Input使用Apache Common VFS访问FTP中的CSV文件.
  2. 使用Filter Row
  3. 验证条件,以检查CSV中的数据(检查文件名,字段是否存在).
  4. 使用Syncronize After Merge输出到PostgreSQL表中.我之所以使用它,是因为我还将CSV数据与另一个表中的数据结合在一起.

第二步的结果不是我想要的.目前,它会在读取所有csv之后进行检查,并将所有数据传递给下一步,但是我要在读取数据时进行检查,以便仅将正确的数据传递给下一步.我怎样才能做到这一点?有什么建议吗? (需要集思广益)

如果无法在PDI中实现,那么可以读取所有数据并将其传递到下一步,但是在插入数据之前将再次进行验证.

解决方案

只有在完全读取并检查了所有数据后,才能验证文件.

执行此操作的好方法是协调几种转换(一种用于读取目录,一种用于检查文件是否有效,一种用于加载已验证文件的数据).

现在写一份工作可能会使一项艰巨的任务接缝,直到您写了一打一打.因此,您可以进行一次转换.实际上,这是一种基于在整个输入数据上定义的指标进行决策或进行计算的模式.

  1. 获取文件列表.
  2. 读取它们以跟踪文件名(在Additional output field选项卡中).
  3. 像以前一样逐行进行检查.
  4. 进行摘要以拒绝至少有一个错误.
  5. 取回2的主流,如果文件名被拒绝,则对每一行进行查找. (查找流是分组依据的结果).
  6. 用拒绝的文件名过滤掉行.
  7. 将它们放在Postgres上(在用其他文件或表丰富数据之后).

只是一句话.在您的特定情况下,我会稍微改变一下流程,测试第一个filter中可接受的文件名,并删除group by和第二个filter.但是我认为拥有标准模式对您会更有用.

但是,由于种种原因,好的做法还是要熟练掌握.

I'm new with PDI and still learn about it. I'm trying to create transformation that will read all the csv file from one folder, check if the data of the file is correct, meaning there is no rows with missing/error/wrong format, then store it in a database.

What I have try is :

  1. Use Text File Input accessing CSV file in FTP using Apache Common VFS.
  2. Validate and make condition to check the data (checking filename, field if exist) in CSV using Filter Row
  3. Output into PostgreSQL Table using Syncronize After Merge. I used this because I also join CSV data with data from another table.

The result from my second step is not what I want. Currently it checks after all csv are read and pass all the data to next step but what I want is to check while read the data so it will pass only correct data to next step. How can I do that? any suggestion? (need brainstorming)

And if that impossible to implement in PDI then it's okay to read all data and pass it to the next step but then will validate again before insert the data.

解决方案

You can only validate a file after all its data has been completely read and checked.

The good way to do this is a job to orchestrate several transformation (one to read the directory, one to check if the files are valid, one to load the data of the validated files).

Now writing a job may seam a daunting task until you have written 1/2 a dozen. So you can have it in one transform. In facts, it a pattern to take decisions or make computations based on indicators defined on the whole input data.

  1. Get the list of files.
  2. Read them keeping track of the filename (in the Additional output field tab).
  3. Make the check line by line as you did.
  4. Make a summary to reject if there is at least one error.
  5. Take back the main stream of 2, and for each row lookup if the filename was rejected. (The lookup stream is the result of the group by).
  6. Filter out the rows with a rejected filename.
  7. Put them on the Postgres (after enriching the data with other file or tables).

Just a remark. In your specific case, I would change a bit the flow, testing for the accepted filename in the first filter, and removing group by and the second filter. But I thought it would be more useful for you to have the standard pattern.

But, again, for various reason, good practice would be to do it with a master job.

这篇关于PDI-读取CSV文件,如果缺少字段/数据,则移至下一个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆