如何使用 Apache Nifi 加入两个 CSV [英] How to join two CSVs with Apache Nifi

查看:29
本文介绍了如何使用 Apache Nifi 加入两个 CSV的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究 ETL 工具(如 Talend)并调查是否可以使用 Apache Nifi.Nifi 能否用于执行以下操作:

I'm looking into ETL tools (like Talend) and investigating whether Apache Nifi could be used. Could Nifi be used to perform the following:

  1. 提取本地磁盘上的两个 CSV 文件
  2. 在公共列上加入 CSV
  3. 将加入的 CSV 写入磁盘

我尝试在 Nifi 中设置作业,但看不到如何执行两个单独的 CSV 文件的连接.在 Apache Nifi 中可以执行此任务吗?

I've tried setting up a job in Nifi, but couldn't see how to perform the join of two separate CSV files. Is this task possible in Apache Nifi?

它看起来像 QueryDNS 处理器 可用于使用另一个来丰富一个 CSV 文件,但对于这个用例来说,这似乎过于复杂.

It looks like the QueryDNS processor could be used to perform enrichment of one CSV file using the other, but that seems to be over-complicated for this use case.

以下是输入 CSV 的示例,需要在 state_id 上加入:

Here's an example of the input CSVs, which need to be joined on state_id:

customers.csv

id | name | address      | state_id
---|------|--------------|---------
1  | John | 10 Blue Lane | 100
2  | Bob  | 15 Green St. | 200

states.csv

state_id | state
---------|---------
100      | Alabama
200      | New York

输出文件

output.csv

id | name | address      | state
---|------|--------------|---------
1  | John | 10 Blue Lane | Alabama
2  | Bob  | 15 Green St. | New York

推荐答案

Apache NiFi 更像是一种数据流工具,并不是真正用来执行流式数据的任意连接.通常,这些类型的操作更适合于流处理系统(如 Storm、Flink、Apex 等)或 ETL 工具.

Apache NiFi is more of a dataflow tool and not really made to perform arbitrary joins of streaming data. Typically those types of operations are better suited to stream processing systems like Storm, Flink, Apex, etc, or ETL tools.

NiFi 可以很好地完成的连接类型是扩充查找,其中有固定大小的查找数据集,并且对于传入数据中的每条记录,您使用查找数据集来检索某些值.例如,在您的情况下,可能有一个名为 LookUpState 的处理器,它具有一个属性State Data",该属性指向一个包含所有状态的文件,然后customers.csv 可能是该处理器的输入.

The types of joins that NiFi can do well are enrichment look ups where there is a fixed size lookup dataset, and for each record in the incoming data you use the lookup dataset to retrieve some value. For example, in your case there could be a processor called LookUpState which has a property "State Data" which points to a file containing all the states, then the customers.csv could be the input to this processor.

社区成员启动了一个项目,为 NiFi 制作通用查找服务:https://github.com/jfrazee/nifi-lookup-service

A community member started a project to make a generic lookup service for NiFi: https://github.com/jfrazee/nifi-lookup-service

这篇关于如何使用 Apache Nifi 加入两个 CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆