如何使用 Apache Nifi 加入两个 CSV [英] How to join two CSVs with Apache Nifi
问题描述
我正在研究 ETL 工具(如 Talend)并调查是否可以使用 Apache Nifi.Nifi 能否用于执行以下操作:
I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following:
- 提取本地磁盘上的两个 CSV 文件
- 在公共列上加入 CSV
- 将加入的 CSV 写入磁盘
我尝试在 Nifi 中设置作业,但看不到如何执行两个单独的 CSV 文件的连接.在 Apache Nifi 中可以执行此任务吗?
I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi?
它看起来像 QueryDNS 处理器 可用于使用另一个来丰富一个 CSV 文件,但对于这个用例来说,这似乎过于复杂.
It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case.
以下是输入 CSV 的示例,需要在 state_id 上加入:
Here's an example of the input CSVs, which need to be joined on state_id:
customers.csv
id | name | address | state_id
---|------|--------------|---------
1 | John | 10 Blue Lane | 100
2 | Bob | 15 Green St. | 200
states.csv
state_id | state
---------|---------
100 | Alabama
200 | New York
输出文件
output.csv
id | name | address | state
---|------|--------------|---------
1 | John | 10 Blue Lane | Alabama
2 | Bob | 15 Green St. | New York
推荐答案
Apache NiFi 更像是一种数据流工具,并不是真正用来执行流式数据的任意连接.通常,这些类型的操作更适合于流处理系统(如 Storm、Flink、Apex 等)或 ETL 工具.
Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.
NiFi 可以很好地完成的连接类型是扩充查找,其中有固定大小的查找数据集,并且对于传入数据中的每条记录,您使用查找数据集来检索某些值.例如,在您的情况下,可能有一个名为 LookUpState 的处理器,它具有一个属性State Data",该属性指向一个包含所有状态的文件,然后customers.csv 可能是该处理器的输入.
The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.
社区成员启动了一个项目,为 NiFi 制作通用查找服务:https://github.com/jfrazee/nifi-lookup-service
A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service
这篇关于如何使用 Apache Nifi 加入两个 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!