首页
其他开发
读取Spark中未压缩的节俭文件

读取Spark中未压缩的节俭文件 [英] Read uncompressed thrift files in spark

查看：95 发布时间：2021/4/8 19:53:10 apache-spark thrift hadoop-lzo

本文介绍了读取Spark中未压缩的节俭文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试从S3读取未压缩的节俭文件.到目前为止，它没有起作用.

I'm trying to get spark to read uncompressed thrift files from s3. So far it has not been working.

数据在s3中作为未压缩的节俭文件加载.来源是AWS Kinesis Firehose.
我有一个可以毫无问题地反序列化文件的工具，所以我知道节俭的序列化/反序列化是可行的.
在火花中，我正在使用newAPIHadoopFile
使用Elephantbird的LzoThriftBlockInputFormat，我能够成功读取lzo压缩的节俭文件
我不知道应该使用哪种InputFormat读取未压缩的节俭文件.

那里的任何InputFormats可能吗?我必须自己实现吗?

Is that possible with any of the InputFormats out there? Do I have to implement my own?

推荐答案

我最终写了自己的自定义节俭解串器.

I ended up writing my own custom thrift deserializer.

需要实现自定义InputFormat和自定义RecordReader.对于某些库中还不存在这样的类，仍然感到惊讶.这两个类已经过测试并且可以正常工作，但是由于我在解决此问题后不久就停止了该项目的工作，因此未清理代码.

Needed to implement a custom InputFormat and custom RecordReader. Still surprised that such classes don't already exist in some lib. The two classes have been tested and work, but since i stopped working on the project soon after i solved this, the code is not cleaned up.

https://github.com/mklosi/thrift-deserializer

这篇关于读取Spark中未压缩的节俭文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

登录关闭

扫码关注1秒登录

发送“验证码”获取 | 15天全站免登陆