S3-如何获得文件的快速行数? wc -l太慢 [英] s3 - how to get fast line count of file? wc -l is too slow

查看:626
本文介绍了S3-如何获得文件的快速行数? wc -l太慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人能快速获取S3中托管的文件的行数吗?最好使用CLI,s3api,但我也对python/boto开放. 注意:解决方案必须非交互式地运行,即在一整夜中运行.

Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well. Note: solution must run non-interactively, ie in an overnight batch.

不是,我正在这样做,它可以工作,但是一个20GB的文件大约需要10分钟:

Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

 aws cp s3://foo/bar - | wc -l

推荐答案

以下两种方法可能对您有用...

Here's two methods that might work for you...

Amazon S3具有一项称为 S3 Select 的新功能,允许您查询存储在S3上的文件.

Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.

您可以对文件中的记录(行)数进行计数,甚至可以在GZIP文件上使用.结果可能因文件格式而异.

You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.

Amazon Athena 也是一个合适的类似选项.它可以查询存储在Amazon S3中的文件.

Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.

这篇关于S3-如何获得文件的快速行数? wc -l太慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆