如何使用Apache Nifi连接两个CSV [英] How to join two CSVs with Apache Nifi

查看:226
本文介绍了如何使用Apache Nifi连接两个CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究ETL工具(例如Talend),并研究是否可以使用Apache Nifi. Nifi是否可以用于执行以下操作:

I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following:

  1. 拾取放置在本地磁盘上的两个CSV文件
  2. 在公用列上加入CSV
  3. 将已加入的CSV文件写入磁盘

我尝试在Nifi中设置作业,但看不到如何执行两个单独的CSV文件的联接.在Apache Nifi中可以执行此任务吗?

I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi?

看起来像

It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case.

这是输入CSV的示例,需要在state_id上​​将其连接:

Here's an example of the input CSVs, which need to be joined on state_id:

customers.csv

id | name | address      | state_id
---|------|--------------|---------
1  | John | 10 Blue Lane | 100
2  | Bob  | 15 Green St. | 200

states.csv

state_id | state
---------|---------
100      | Alabama
200      | New York

输出文件

output.csv

id | name | address      | state
---|------|--------------|---------
1  | John | 10 Blue Lane | Alabama
2  | Bob  | 15 Green St. | New York

推荐答案

Apache NiFi更像是一种数据流工具,并不是真正用于执行流数据的任意联接.通常,这些类型的操作更适合于Storm,Flink,Apex等流处理系统或ETL工具.

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

NiFi擅长的联接类型是在具有固定大小的查找数据集的情况下进行的富集查找,对于传入数据中的每个记录,您都可以使用查找数据集来检索一些值.例如,在您的情况下,可能有一个名为LookUpState的处理器,该处理器具有状态数据"属性,该属性指向包含所有状态的文件,然后customers.csv可能是该处理器的输入.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

一个社区成员启动了一个项目,以为NiFi提供通用的查询服务: https://github.com/jfrazee/nifi-lookup-service

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service

这篇关于如何使用Apache Nifi连接两个CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆