如何有效地将数据从 Kafka 移动到 Impala 表? [英] how to efficiently move data from Kafka to an Impala table?

查看：32 发布时间：2021/11/12 2:10:59 hadoop apache-kafka flume impala

本文介绍了如何有效地将数据从 Kafka 移动到 Impala 表?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

以下是当前流程的步骤:

我查看了我构建的流程，发现它闻起来很糟糕":有太多的中间步骤会影响数据流.
大约 20 个月前，我看到了一个演示，其中数据从 Amazon Kinesis 管道流式传输，并且可以由 Impala 近乎实时地查询.我不认为他们做了如此丑陋/令人费解的事情.是否有更有效的方法将数据从 Kafka 流式传输到 Impala(可能是可以序列化为 Parquet 的 Kafka 消费者)?
我想将数据流式传输到低延迟 SQL"一定是一个相当普遍的用例，所以我很想知道其他人是如何解决这个问题的.
解决方案
如果您需要将 Kafka 数据按原样转储到 HDFS，最好的选择是使用 Kafka Connect 和 Confluent HDFS 连接器.
您可以将数据转储到 HDFS 上的 parket 文件，您可以在 Impala 中加载.我认为您需要使用 TimeBasedPartitioner 分区程序每 X 毫秒创建一次镶木地板文件(调整 partition.duration.ms 配置参数).
将这样的内容添加到您的 Kafka Connect 配置中可能会奏效:
```
# 不要将少于 1000 条消息刷新到 HDFS冲洗大小 = 1000# 转储到镶木地板文件format.class=io.confluent.connect.hdfs.parquet.ParquetFormatpartitioner.class = TimebasedPartitioner# 每小时一个文件.如果您更改此设置，请记住更改文件名格式以反映此更改partition.duration.ms = 3600000# 文件名格式path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
```
Here are the steps to the current process:
1. Flafka writes logs to a 'landing zone' on HDFS.
2. A job, scheduled by Oozie, copies complete files from the landing zone to a staging area.
3. The staging data is 'schema-ified' by a Hive table that uses the staging area as its location.
4. Records from the staging table are added to a permanent Hive table (e.g. insert into permanent_table select * from staging_table).
5. The data, from the Hive table, is available in Impala by executing refresh permanent_table in Impala.
I look at the process I've built and it "smells" bad: there are too many intermediate steps that impair the flow of data.

About 20 months ago, I saw a demo where data was being streamed from an Amazon Kinesis pipe and was queryable, in near real-time, by Impala. I don't suppose they did something quite so ugly/convoluted. Is there a more efficient way to stream data from Kafka to Impala (possibly a Kafka consumer that can serialize to Parquet)?

I imagine that "streaming data to low-latency SQL" must be a fairly common use case, and so I'm interested to know how other people have solved this problem.
解决方案
If you need to dump your Kafka data as-is to HDFS the best option is using Kafka Connect and Confluent HDFS connector.

You can either dump the data to a parket file on HDFS you can load in Impala. You'll need I think you'll want to use a TimeBasedPartitioner partitioner to make parquet files every X miliseconds (tuning the partition.duration.ms configuration parameter).

Addign something like this to your Kafka Connect configuration might do the trick:
```
# Don't flush less than 1000 messages to HDFS
flush.size = 1000 

# Dump to parquet files   

format.class=io.confluent.connect.hdfs.parquet.ParquetFormat

partitioner.class = TimebasedPartitioner

# One file every hour. If you change this, remember to change the filename format to reflect this change
partition.duration.ms = 3600000
# Filename format
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH/'minute'=mm
```
这篇关于如何有效地将数据从 Kafka 移动到 Impala 表?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何有效地将数据从 Kafka 移动到 Impala 表? [英] how to efficiently move data from Kafka to an Impala table?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何有效地将数据从 Kafka 移动到 Impala 表? [英] how to efficiently move data from Kafka to an Impala table?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭