如何使用s3 select从镶木地板文件中获取所有列的列表? [英] How to get list of all columns from a parquet file using s3 select?

查看:116
本文介绍了如何使用s3 select从镶木地板文件中获取所有列的列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在S3存储桶中存储了一个实木复合地板文件.我想获取镶木地板文件所有列的列表.我正在使用s3 select,但是它只是给我所有行的列表,而没有任何列标题.

I have a parquet file stored in S3 bucket. I want to get the list of all columns of the parquet file. I am using s3 select but it just give me list of all rows wihtout any column headers.

无论如何,是否需要从该镶木地板文件中获取所有列名而无需完全下载它?由于实木复合地板文件可能很大,因此我不想下载整个实木复合地板文件,这就是为什么我使用s3 select使用

Is there anyway to get all column names from this parquet file without downloading it completely? Since parquet file can be very large, I would not want to download the entire parquet file which is why I am using s3 select to pick first few rows using

select * from S3Object LIMIT 10

我尝试通过

SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'S3Object'

但是它不起作用,因为AWS S3尚不支持此功能.

but it would not work as AWS S3 doesn't support this yet.

还有其他方法可以实现相同目标吗?

Is there any other way to achieve the same?

推荐答案

我遇到了同样的问题,但不幸的是,这次我的Google-Kung-Fu不够强大.

I have the same problem but unfortunately, my Google-Kung-Fu was not strong enough this time.

我发现了以下我不太喜欢的解决方法,但它对我有用:

I found the following workaround which I don't really like but it works for me:

r = s3.select_object_content(Bucket='...your bucket...', 
                             Key='...your key...', 
                             ExpressionType='SQL', 
                             Expression="select s.* from S3Object s limit 1",
                             InputSerialization={'Parquet': {}}, 
                              OutputSerialization={'JSON': {}})
row = json.loads([rec['Records']['Payload'].decode('utf-8') for rec in r['Payload'] if 'Records' in rec][0])

print("Columns: ", list(row.keys()))

即代码要做的是请求数据的第一行,提取有效负载并加载返回的JSON对象.所获得的JSON对象具有以下结构{"Column name": "value", ....},因此只需提取JSON对象的键(最后一行)即可.

I.e. what the code does is to request the first line of the data, to extract the payload and to load the returned JSON object. The obtained JSON object has the following structure {"Column name": "value", ....}, such that one only has to extract the keys of the JSON object (last line).

另一个问题是,这不会返回列的类型.这是我无法解决的问题.

One additional problem is that this does not return the types of columns. This is something that I could not solve yet.

更新: 我观察到在某些情况下,某些列名报告不正确.代替了_18,_19之类的真实姓名.不知道如何处理.

UPDATE: I observed that in some situations, some column names were not correct reported. Instead of the real name something like _18, _19 were returned. No idea how to deal with it.

这篇关于如何使用s3 select从镶木地板文件中获取所有列的列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆