jq中的SQL风格GROUP BY聚合函数(COUNT,SUM等) [英] SQL-style GROUP BY aggregate functions in jq (COUNT, SUM and etc)
问题描述
前面提到过类似的问题:
问题
如何模拟COUNT聚合函数,该函数的行为应与其SQL原始行为类似?让我们继续扩展这个问题以包含其他常规SQL函数:
最后一个不是标准的SQL函数 - 它是来自PostgreSQL,但是非常有用。
在输入处有一个有效的JSON对象流。示范让我们选择一个简单的故事,主人和他们的宠物。
模型和数据
基础关系:所有者
id姓名年龄
1 Adams 25
2 Baker 55
3克拉克40
4戴维斯31
基本关系: Pet <
id姓名垃圾owner_id
10 Bella 4 1
20 Lucy 2 1
30 Daisy 3 2
40 Molly 4 3
50 Lola 2 4
60 Sadie 4 4
70 Luna 3 4
来源
从上面我们可以得到一个派生关系 Owner_Pet 以上关系的SQL JOIN)以JSON格式呈现给我们的jq查询(源数据):
{owner_id:1,owner:Adams,age:25,pet_id:10,pet:Bella,litter:4}
{owner_id :1, owner:Adams,age:25,pet_id:20,pet:Lucy,litter:2}
{owner_id:2, Baker,age:55,pet_id:30,pet:Daisy,litter:3}
{owner_id:3, :40,pet_id:40,宠物:Molly,垃圾:4}
{owner_id:4,所有者:戴维斯,年龄 pet_id:50,pet:Lola,litter:2}
{owner_id:4,所有者:戴维斯,年龄:31,pet_id 宠物:Sadie,垃圾:4}
{owner_id:4,所有者:戴维斯,年龄:31,pet_id:70, Luna,litter:3}
请求
下面是示例请求及其预期输出:
{owner_id:1,owner:Adams,age:25,pets_count: 2}
{owner_id:2,owner:Baker,age:55,pets_count:1}
{owner_id:3,owner:Clark ,age:40,pets_count:1}
{owner_id:4,owner: 戴维斯,年龄:31,pets_count:3}
- 为每个所有者和取得小孩的数量得到他们的MAX(MIN / AVG):
{owner_id:1,owner:Adams,age:25,litter_total:6,litter_max:4}
{owner_id :2,所有者:贝克,年龄:55,litter_total:3,litter_max:3}
{owner_id:3,所有者:克拉克 :40,litter_total:4,litter_max:4}
{owner_id:4,owner:戴维斯,年龄:31,litter_total:9,litter_max :4}
- 每位拥有者的ARRAY_AGG宠物:
$ b pre $
{owner_id:1,owner:Adams,age:25,pets :[Bella,Lucy]}
{owner_id:2,owner:Baker,age:55,pets:[Daisy]}
{owner_id:3,owner:Clark,age:40,pets:[Molly]}
{owner_id:4,owner:Davis, age:31,pets:[Lola,Sadie,Luna]}
<这是一个很好的练习,但是SO不是一个编程服务,所以我将重点介绍jq中通用解决方案的一些关键概念,这些概念是高效的,即使对于非常大的集合。
GROUPS_BY
效率的关键在于避免内置 group_by
,因为它需要排序。由于jq基本上是面向流的,因此以下定义 GROUPS_BY
同样也是面向流的。它利用基于键的查找的效率,同时避免在字符串上调用 tojson
:
#发出由f
定义的组的流。def GROUPS_BY(stream; f):
def unwind:
to_entries [] | .value | to_entries [] | .value;
将$ x({};
($ x | f)作为$ s
|($ s | type)减少为$ t
|(如果$ t ==string,那么$ s else($ s | tojson)结束)为$ y
|。[$ t] [$ y] + = [$ x])
|放松;
distinct
和 count_distinct
#在`stream`中发出不同实体的数组, b def distinct(stream):
将$ x({};
($ x | type)作为$ t
|(如果$ t ==stringthen $ x else($ x | tojson)end)as $ y
| if(。[$ t] | has($ y))then。else。[$ t] [$ y] + = [$ x] end )
| [。[] []] |添加;
#发出给定流中不同项目的数量
def count_distinct(stream):
def sum(s):reduce s as $ x(0 ; + $ X);
将$ x({};
($ x | type)作为$ t
|(如果$ t ==string,然后$ x else($ x | tojson)结束)为$ y
|。[$ t] [$ y] = 1)
| sum(。[] []);
方便功能
def owner:{owner_id,owner,age};
示例:COUNT每个所有者的宠物数量
GROUPS_BY(输入; .owner_id)
| (。[0] |所有者)+ {pets_count:count_distinct(。[] | .pet_id)}
调用:jq -nc -f program1.jq input.json
输出:
{owner_id:1,owner:Adams,age:25,pets_count:2}
{owner_id:2,owner:Baker ,age:55,pets_count:1}
{owner_id:3,owner:Clark,age:40,pets_count:1}
{ owner_id:4,owner:Davis,age:31,pets_count:3}
示例:SUM计算每个所有者的whelps数量并获得它们的MAX
GROUPS_BY(inputs; .owner_id )
| (。[0] |所有者)
+ {litter_total :(地图(.litter)| add)}
+ {litter_max :(地图(.litter)| max)}
调用:jq -nc -f program2.jq input.json
输出:给出。
示例:ARRAY_AGG pets per owner
GROUPS_BY(输入; .owner_id)
| (。[0] |所有者)+ {pets:distinct(。[] | .pet)}
调用:jq -nc -f program3.jq input.json
输出:
{owner_id:1,owner:Adams,age:25,pets:[Bella,Lucy]}
{owner_id:2 ,owner:Baker,age:55,pets:[Daisy]}
{owner_id:3,owner:Clark 宠物:[莫莉]]
{owner_id:4,所有者:戴维斯,年龄:31,宠物:[Lola,Sadie, ]}
Similar questions asked here before:
Count items for a single key: jq count the number of items in json by a specific key
Calculate the sum of object values: How do I sum the values in an array of maps in jq?
Question
How to emulate the COUNT aggregate function which should behave similarly to its SQL original? Let's extend this question even more to include other regular SQL functions:
- COUNT
- SUM / MAX/ MIN / AVG
- ARRAY_AGG
The last one is not a standard SQL function - it's from PostgreSQL but is quite useful.
At input comes a stream of valid JSON objects. For demonstration let's pick a simple story of owners and their pets.
Model and data
Base relation: Owner
id name age
1 Adams 25
2 Baker 55
3 Clark 40
4 Davis 31
Base relation: Pet
id name litter owner_id
10 Bella 4 1
20 Lucy 2 1
30 Daisy 3 2
40 Molly 4 3
50 Lola 2 4
60 Sadie 4 4
70 Luna 3 4
Source
From above we get a derivative relation Owner_Pet (a result of SQL JOIN of the above relations) presented in JSON format for our jq queries (the source data):
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy", "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola", "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna", "litter": 3 }
Requests
Here are sample requests and their expected output:
- COUNT the number of pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
- SUM up the number of whelps per owner and get their MAX (MIN/AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
- ARRAY_AGG pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
This is a nice exercise, but SO is not a programming service, so I will focus here on some key concepts for generic solutions in jq that are efficient, even for very large collections.
GROUPS_BY
The key to efficiency here is avoiding the built-in group_by
, as it requires sorting. Since jq is fundamentally stream-oriented, the following definition of GROUPS_BY
is likewise stream-oriented. It takes advantage of the efficiency of key-based lookups, while avoiding calling tojson
on strings:
# emit a stream of the groups defined by f
def GROUPS_BY(stream; f):
def unwind:
to_entries[] | .value | to_entries[] | .value ;
reduce stream as $x ({};
($x|f) as $s
| ($s|type) as $t
| (if $t == "string" then $s else ($s|tojson) end) as $y
| .[$t][$y] += [$x] )
| unwind ;
distinct
and count_distinct
# Emit an array of the distinct entities in `stream`, without sorting
def distinct(stream):
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| if (.[$t] | has($y)) then . else .[$t][$y] += [$x] end )
| [.[][]] | add ;
# Emit the number of distinct items in the given stream
def count_distinct(stream):
def sum(s): reduce s as $x (0;.+$x);
reduce stream as $x ({};
($x|type) as $t
| (if $t == "string" then $x else ($x|tojson) end) as $y
| .[$t][$y] = 1 )
| sum( .[][] ) ;
Convenience function
def owner: {owner_id,owner,age};
Example: "COUNT the number of pets per owner"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets_count: count_distinct(.[]|.pet_id)}
Invocation: jq -nc -f program1.jq input.json
Output:
{"owner_id":1,"owner":"Adams","age":25,"pets_count":2}
{"owner_id":2,"owner":"Baker","age":55,"pets_count":1}
{"owner_id":3,"owner":"Clark","age":40,"pets_count":1}
{"owner_id":4,"owner":"Davis","age":31,"pets_count":3}
Example: "SUM up the number of whelps per owner and get their MAX"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner)
+ {litter_total: (map(.litter) | add)}
+ {litter_max: (map(.litter) | max)}
Invocation: jq -nc -f program2.jq input.json
Output: as given.
Example: "ARRAY_AGG pets per owner"
GROUPS_BY(inputs; .owner_id)
| (.[0] | owner) + {pets: distinct(.[]|.pet)}
Invocation: jq -nc -f program3.jq input.json
Output:
{"owner_id":1,"owner":"Adams","age":25,"pets":["Bella","Lucy"]}
{"owner_id":2,"owner":"Baker","age":55,"pets":["Daisy"]}
{"owner_id":3,"owner":"Clark","age":40,"pets":["Molly"]}
{"owner_id":4,"owner":"Davis","age":31,"pets":["Lola","Sadie","Luna"]}
这篇关于jq中的SQL风格GROUP BY聚合函数(COUNT,SUM等)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!