如何拆分 CSV 或 JSON 文件以获得最佳的雪花摄取? [英] How to split a CSV or JSON file for optimal Snowflake ingestion?

查看：18 发布时间：2021/12/28 12:22:40 split command-line-interface gzip snowflake-cloud-data-platform

本文介绍了如何拆分 CSV 或 JSON 文件以获得最佳的雪花摄取?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Snowflake 建议在摄取前拆分大文件:

Snowflake recommends splitting large files before ingesting:

为了优化加载的并行操作数量，我们建议目标是生成压缩大小约为 100-250 MB(或更大)的数据文件.https://docs.snowflake.com/en/用户指南/数据加载考虑因素-prepare.html

To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 100-250 MB (or larger) in size compressed. https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html

拆分大文件并压缩它们的最佳方法是什么?

What's the best way to split my large files, and compress them?

推荐答案

这是我能想到的最好的命令行序列:

This is the best command line sequence I could come up with:

cat bigfile.json  | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'

根据源文件，将第一步替换为将 JSON 或 CSV 输出到标准输出的任何内容.如果是普通文件 cat 就可以，如果是 .gz 那么 gzcat 如果是 .zstd然后unzstd --long=31 -c file.zst 等

Replace the first step with anything that will output JSON or CSV to stdout, depending on the source file. If it's a plain file cat will do, if it's a .gz then gzcat, if it's a .zstd then unzstd --long=31 -c file.zst, etc.

然后split:

-C 1000000000 创建 1GB 的文件，但考虑行完整性的行尾.
-d 给每个文件一个数字后缀(我更喜欢这个而不是默认的字母_
-a4 使数字后缀长度为 4(而不是只有 2)
- 将读取管道中前一个 cat 的输出
output_prefix 是所有输出文件的基本名称
--filter='gzip >$FILE.gz' 使用 gzip 即时压缩 1GB 的文件，因此每个最终文件的大小约为 100MB.

-C 1000000000 creates 1GB files, but respects end-lines for row integrity.
-d gives a numeric suffix to each file (I prefer this to the default letters_
-a4 makes the numeric suffix length 4 (instead of only 2)
- will read the output from the previous cat in the pipeline
output_prefix is the base name for all output files
--filter='gzip > $FILE.gz' compresses the 1GB files on the fly with gzip, so each final file will end up with a size around 100MB.

Snowflake 可以摄取 .gz 文件，因此这最后的压缩步骤将帮助我们在网络中移动文件.

Snowflake can ingest .gz files, so this final compression step will help us moving the files around the network.

这篇关于如何拆分 CSV 或 JSON 文件以获得最佳的雪花摄取?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何拆分 CSV 或 JSON 文件以获得最佳的雪花摄取? [英] How to split a CSV or JSON file for optimal Snowflake ingestion?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何拆分 CSV 或 JSON 文件以获得最佳的雪花摄取? [英] How to split a CSV or JSON file for optimal Snowflake ingestion?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭