配置单元 - 跨文件分割数据 [英] Hive -- split data across files

查看:180
本文介绍了配置单元 - 跨文件分割数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法指示Hive将数据分割成多个输出文件?或者可能会限制输出文件的大小。

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.

我打算使用Redshift,它建议将数据分割为多个文件以允许并行加载 http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files。 html

I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html

我们预先处理蜂巢中的所有数据,我想知道是否有创建方法,比如说10个1GB文件可能会使复制到红移更快。

We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.

我在查看 https:// cwiki .apache.org / Hive / adminmanual-configuration.html https://cwiki.apache .org / confluence / display / Hive / Configuration + Properties ,但我找不到任何东西

I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything

推荐答案

有几种方法可以分解Hive输出。第一个也是最简单的方法是设置减速器的数量。由于每个写操作都减少了写入其自己的输出文件的数量,因此您指定的还原器的数量将与写入的输出文件的数量相对应。请注意,一些Hive查询不会导致您指定的reducer数量(例如, SELECT COUNT(*)FROM some_table 总是导致一个reducer)。要指定减法器的数量,在查询之前运行它:

There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:

set mapred.reduce.tasks=10

另一种可以分割为多个输出文件的方式是让Hive将查询结果插入分区表中。这会导致每个分区至少有一个文件。为了有意义,你必须有一些合理的分栏。例如,您不希望在唯一的ID列上进行分区,或者每个记录都有一个文件。这种方法至少可以保证每个分区的输出文件,并且至多保证 numPartitions * numReducers 。下面是一个示例(不要太担心 hive.exec.dynamic.partition.mode ,它需要设置为使此查询工作)。

Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).

hive.exec.dynamic.partition.mode=nonstrict

CREATE TABLE table_to_export_to_redshift (
  id INT,
  value INT
)
PARTITIONED BY (country STRING)

INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table

为了获得更精细的控制,您可以编写自己的reduce脚本传递给配置单元,并减少脚本写入多个文件。一旦你编写自己的reducer,你可以做任何你想做的事。

To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.

最后,你可以放弃尝试操纵Hive来输出你想要的文件数量,一旦Hive完成,就可以将它们分开。默认情况下,Hive将其表压缩并以纯文本形式存储在仓库目录中(例如, / apps / hive / warehouse / table_to_export_to_redshift )。您可以使用Hadoop shell命令,MapReduce作业Pig,或将它们拉入Linux并按照您的喜好将它们分开。

Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.

我对Redshift没有任何经验,所以我的一些建议可能不适合Redshift为任何原因使用。

I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.

一些注释:将文件拆分成更多,更小的文件通常对Hadoop不利。 Redshift的速度可能会提高,但如果文件被Hadoop生态系统的其他部分(MapReduce,Hive,Pig等)占用,如果文件太小(尽管1GB会好),您可能会看到性能下降, 。还要确保额外的处理/开发人员时间值得您节省并行Redshift数据加载时间。

A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.

这篇关于配置单元 - 跨文件分割数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆