使用hadoop流解压缩文件 [英] Unzip files using hadoop streaming

查看：435 发布时间：2020/11/22 1:53:40 hadoop zip hadoop-streaming

本文介绍了使用hadoop流解压缩文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在HDFS中有很多文件，所有文件都是一个zip文件，其中有一个CSV文件. 我正在尝试解压缩文件，以便可以对它们运行流式作业.

我尝试过:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.reduce.tasks=0 \
    -mapper /bin/zcat -reducer /bin/cat \
    -input /path/to/files/ \
    -output /path/to/output

但是我收到一个错误(subprocess failed with code 1) 我也尝试在单个文件上运行，出现同样的错误.

有什么建议吗?

解决方案

问题的根本原因是:您可以从hadoop中获得许多(文本)信息(在可以接收数据之前).

例如hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | zcat | wc -l也不起作用-带有"gzip:stdin:非gzip格式"错误消息.

因此，您应该跳过此不必要的"信息.就我而言，我必须跳过86行

因此，我的一行命令将是以下命令(用于计数记录): hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | tail -n + 86 | zcat | wc -l </p>

注意:这是一种解决方法(不是真正的解决方案)，而且非常丑陋-由于是"86"，但效果很好:)

I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them.

I tried:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
    -D mapred.reduce.tasks=0 \
    -mapper /bin/zcat -reducer /bin/cat \
    -input /path/to/files/ \
    -output /path/to/output

However I get an error (subprocess failed with code 1) I also tried running on a single file, same error.

Any advice?

解决方案

The root cause of the problem is: you get many (text-)infos from hadoop (before you can receive the data).

e.g. hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz | zcat | wc -l will NOT work either - with "gzip: stdin: not in gzip format" error message.

Therefore you should skip this "unneccesary" infos. In my case I have to skip 86 lines

Therefore my one line command will be this (for counting the records): hdfs dfs -cat hdfs://hdm1.gphd.local:8020/hive/gphd/warehouse/my.db/my/part-m-00000.gz |tail -n+86 | zcat | wc -l

Note: this is a workaround (not a real solution) and very ugly - because of "86" - but it works fine :)

这篇关于使用hadoop流解压缩文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用hadoop流解压缩文件 [英] Unzip files using hadoop streaming

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用hadoop流解压缩文件 [英] Unzip files using hadoop streaming

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭