在python或javascript中正确使用折叠或缩小函数来处理长到宽的数据? [英] Correct use of a fold or reduce function to long-to-wide data in python or javascript?
问题描述
我的数据是一个json字符串,看起来像这样:
s =
'[
{query:Q1,detail: cool,rank:1,url:awesome1},
{query:Q1,detail:cool,rank:2,url: awesome2},
{query:Q1,detail:cool,rank:3,url:awesome3},
{query Q#2,detail:same,rank:1,url:newurl1},
{query:Q#2,detail: ,rank:2,url:newurl2},
{query:Q#2,detail:same,rank:3,url: newurl3}
]'
我想把它变成这样的东西,其中查询是定义行的主键,嵌套与排名值和url字段对应的唯一行:
'[
{query:Q1,
results:[
{rank:1,url:awesome1},
{rank:2,url:awesom e2},
{rank:3,url:awesome3}
]},
{query:Q#2,
结果:[
{rank:1,url:newurl1},
{rank:2,url:newurl2},
{等级:3,url:newurl3},
]}
]'
我知道我可以迭代,但我怀疑有一个功能操作可以完成这个转换,对吗?
也会好奇地知道如何
'[
{query:Q1,
所有结果通用:[
{detail:cool}
],
结果:[
{rank:1, url:awesome1},
{rank:2,url:awesome2},
{rank:3,url:awesome3}
]},
{query:Q#2,
所有结果通用:[
{detail:same}
],
结果:[
{rank: 1,url:newurl1},
{rank:2,url:newurl2},
{rank:3,url:newurl3}
]}
]'
在第二个版本中,我想在同一个查询下重复所有数据,并将其放入其他东西容器中,其中排名下唯一的所有项目都将位于结果容器中。
我正在使用mongodb中的json对象,并且可以使用python或javascript来尝试这个转换。
编辑
任何建议,例如这个转换的正确名称,可能是在大型数据集上执行此操作的最快方法, / h1>
在下面引入@abarnert的优秀解决方案,试图让我的Version2成为其他任何处理同一类问题的人,要求在一个级别下分出一些密钥, ...
以下是我试过的内容:
from functools import部分
groups = itertools.groupby(initial,operator.itemgetter('query'))
def filterkeys(d,mylist):
return {k:v for k,v in d。如果在mylist中有k个元素,那么items(){
结果=((key,map(partial(filterkeys,mylist = ['rank','url']),group))) )
other_stuff =((key,map(partial(filterkeys,mylist = ['detail']),gr oup))为钥匙,小组分组)
???
哦,不!!
我知道这不是你所要求的折叠式解决方案,但我会用 itertools
来做到这一点,除非你认为Haskell的功能不如Lisp ...),也可能是解决这个问题的最好方法。
这个想法是把你的序列看作一个懒惰的列表,并对其进行一系列惰性转换,直到获得您想要的列表。
这里的关键步骤是 groupby
:
>>> initial = json.loads(s)
>>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> print([key,list(group)for key,group in groups])
[('Q1',
[{'detail':'cool','query':'Q1',' rank':1,'url':'awesome1'},
''detail':'cool','query':'Q1','rank':2,'url':'awesome2'},
{'detail':'cool','query':'Q1','rank':3,'url':'awesome3'}]),
('Q#2',$ b $ {['detail':'same','query':'Q#2','rank':1,'url':'newurl1'},
{'detail':'same' ,'query':'Q#2','rank':2,'url':'newurl2'},
''detail':'same','query':'Q#2','等级':3,'url':'newurl3'}])]
我们已经在一步之内了。
重组每个键,将其组合成您想要的dict格式:
>>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>>打印([{query:key,results:list(group)} for key,group in groups])
[{'query':'Q1',
'results':[ {'detail':'cool',
'query':'Q1',
'rank':1,
'url':'awesome1'},
{详细信息':'cool',
'query':'Q1',
'rank':2,
'url':'awesome2'},
{'detail' :'cool',
'query':'Q1',
'rank':3,
'url':'awesome3'}]},
{'query' :'Q#2',
'results':[{'detail':'same',
'query':'Q#2',
'rank':1,
'url':'newurl1'},
{'detail':'same',
'query':'Q#2',
'rank':2,
'url':'newurl2 },
{'detail':'same',
'query':'Q#2',
'rank':3,
'url':'newurl3'但是,等等,仍然有那些你想要摆脱的额外领域。}}}]
Easy:
>>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> def filterkeys(d):
... return {k:v for k,v in d.items()if k in('rank','url')}
>>> ; (key,map(filterkeys,group))为key,group为组)
>>> )$ b $ {['query':'Q1',$ b $'results':[{'query:key,results:list(group)} for key, {'rank':1,'url':'awesome1'},
{'rank':2,'url':'awesome2'},
{'rank':3,'url' :'awesome3'}]},
{'query':'Q#2',
'results':[{'rank':1,'url':'newurl1'},
{'rank':2,'url':'newurl2'},
{'rank':3,'url':'newurl3'}]}]
唯一要做的就是调用 json.dumps
而不是 print
。
为了您的后续行动,您希望获取所有相同的值在每行中使用相同的查询
并将它们分组为 otherstuff
,然后列出 results 。
因此,对于每个组,首先我们要获取公共密钥。我们可以通过迭代组中任何成员的键来完成此操作(不在第一个成员中的任何成员不能在所有成员中),因此:
def common_fields(group):
def in_all_members(key,value):
返回所有(组[1:]中成员的成员[key] ==值)
return {key:key的值,group [0]中的值.items()如果in_all_members(key,value)}
或者,或者,如果我们将每个成员转换为集合
的键值对,而不是字典,那么我们就可以将 intersect
它们全部。这意味着我们最终可以使用 reduce
,所以我们试试看:
def common_fields(group):
返回dict(functools.reduce(set.intersection,(set(d.items())for d))
我认为在 dict
和 set 之间来回转换, code>可能会降低可读性,并且这也意味着您的值必须是可哈希的(对于您的示例数据来说不是问题,因为这些值都是字符串)......但它确实更简洁。
当然,这通常包含 query
作为通用字段,但我们稍后会处理。 (另外,你希望 otherstuff
是一个 list
,其中一个 dict
,所以我们会在它周围增加一对括号)。
同时, results 是除了
filterkeys
过滤掉所有常用字段,而不是过滤除 rank
和网址
。把它放在一起:
def process_group(group):
group = list(group)
common = dict(functools.reduce(set.intersection,(set(d.items())for d))
def filterkeys(member):
return {k:v for k,v in member.items()if k not common}
results = list(map(filterkeys,group))
query = common.pop('query')
return {'query':查询,
'otherstuff':[common],
'results':list(results)}
因此,现在我们只使用该函数:
>>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> print([process_group(group)for key,group in groups])
[{'otherstuff':[{'detail':'cool'}],
'query':'Q1',
'results':[{'rank':1,'url':'awesome1'},
{'rank':2,'url':'awesome2'},
{ rank':3,'url':'awesome3'}]},
{'otherstuff':[{'detail':'same'}],
'query':'Q#2' ,
'results':[{'rank':1,'url':'newurl1'},
{'rank':2,'url':'newurl2'},
{'rank':3,'url':'newurl3'}]}]
并不像原来的版本那样微不足道,但希望这一切仍然有意义。只有两个新的技巧。首先,我们必须迭代 groups
多次(一次找到常用键,然后再解压剩下的键)。
Trying to learn to think like a functional programmer a little more---I'd like to transform a data set with what I think is either a fold or a reduce operation. In R, I would think of this as a reshape operation, but I'm not sure how to translate that thinking.
My data is a json string that looks like this:
s =
'[
{"query":"Q1", "detail" : "cool", "rank":1,"url":"awesome1"},
{"query":"Q1", "detail" : "cool", "rank":2,"url":"awesome2"},
{"query":"Q1", "detail" : "cool", "rank":3,"url":"awesome3"},
{"query":"Q#2", "detail" : "same", "rank":1,"url":"newurl1"},
{"query":"Q#2", "detail" : "same", "rank":2,"url":"newurl2"},
{"query":"Q#2", "detail" : "same", "rank":3,"url":"newurl3"}
]'
I'd like to turn it into something like this, where query is the master key defining the 'row', nesting the unique "rows" corresponding to the "rank" values and "url" fields:
'[
{ "query" : "Q1",
"results" : [
{"rank" : 1, "url": "awesome1"},
{"rank" : 2, "url": "awesome2"},
{"rank" : 3, "url": "awesome3"}
]},
{ "query" : "Q#2",
"results" : [
{"rank" : 1, "url": "newurl1"},
{"rank" : 2, "url": "newurl2"},
{"rank" : 3, "url": "newurl3"},
]}
]'
I know I can iterate through, but I suspect there is a functional operation that does this transformation, right?
Would also be curious to know how to get to something more like this, Version2:
'[
{ "query" : "Q1",
"Common to all results" : [
{"detail" : "cool"}
],
"results" : [
{"rank" : 1, "url": "awesome1"},
{"rank" : 2, "url": "awesome2"},
{"rank" : 3, "url": "awesome3"}
]},
{ "query" : "Q#2",
"Common to all results" : [
{"detail" : "same"}
],
"results" : [
{"rank" : 1, "url": "newurl1"},
{"rank" : 2, "url": "newurl2"},
{"rank" : 3, "url": "newurl3"}
]}
]'
In this second version, I'd like to take all data repeating under the same query, and shove it into an "other stuff" container, where all the items unique under "rank" would be in the "results" container.
I'm working on json objects in mongodb, and can use either python or javascript to try out this transform.
Any advice, such as the proper name for this transformation, what might be the fastest way to do this on a large data set, is appreciated!
EDIT
Incorporating @abarnert's excellent solution below, trying to get my Version2 above for anyone else working on the same kind of problem, requiring bifurcating some keys under one level, other keys under another...
Here's what I tried:
from functools import partial
groups = itertools.groupby(initial, operator.itemgetter('query'))
def filterkeys(d,mylist):
return {k: v for k, v in d.items() if k in mylist}
results = ((key, map(partial(filterkeys, mylist=['rank','url']),group)) for key, group in groups)
other_stuff = ((key, map(partial(filterkeys, mylist=['detail']),group)) for key, group in groups)
???
Oh no!
I know this isn't the fold-style solution you were asking for, but I would do this with itertools
, which is just as functional (unless you think Haskell is less functional than Lisp…), and also probably the most pythonic way to solve this.
The idea is to think of your sequence as a lazy list, and apply a chain of lazy transformations to it until you get the list you want.
The key step here is groupby
:
>>> initial = json.loads(s)
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([key, list(group) for key, group in groups])
[('Q1',
[{'detail': 'cool', 'query': 'Q1', 'rank': 1, 'url': 'awesome1'},
{'detail': 'cool', 'query': 'Q1', 'rank': 2, 'url': 'awesome2'},
{'detail': 'cool', 'query': 'Q1', 'rank': 3, 'url': 'awesome3'}]),
('Q#2',
[{'detail': 'same', 'query': 'Q#2', 'rank': 1, 'url': 'newurl1'},
{'detail': 'same', 'query': 'Q#2', 'rank': 2, 'url': 'newurl2'},
{'detail': 'same', 'query': 'Q#2', 'rank': 3, 'url': 'newurl3'}])]
You can see how close we are already, in just one step.
To restructure each key, group pair into the dict format you want:
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([{"query": key, "results": list(group)} for key, group in groups])
[{'query': 'Q1',
'results': [{'detail': 'cool',
'query': 'Q1',
'rank': 1,
'url': 'awesome1'},
{'detail': 'cool',
'query': 'Q1',
'rank': 2,
'url': 'awesome2'},
{'detail': 'cool',
'query': 'Q1',
'rank': 3,
'url': 'awesome3'}]},
{'query': 'Q#2',
'results': [{'detail': 'same',
'query': 'Q#2',
'rank': 1,
'url': 'newurl1'},
{'detail': 'same',
'query': 'Q#2',
'rank': 2,
'url': 'newurl2'},
{'detail': 'same',
'query': 'Q#2',
'rank': 3,
'url': 'newurl3'}]}]
But wait, there's still those extra fields you want to get rid of. Easy:
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> def filterkeys(d):
... return {k: v for k, v in d.items() if k in ('rank', 'url')}
>>> filtered = ((key, map(filterkeys, group)) for key, group in groups)
>>> print([{"query": key, "results": list(group)} for key, group in filtered])
[{'query': 'Q1',
'results': [{'rank': 1, 'url': 'awesome1'},
{'rank': 2, 'url': 'awesome2'},
{'rank': 3, 'url': 'awesome3'}]},
{'query': 'Q#2',
'results': [{'rank': 1, 'url': 'newurl1'},
{'rank': 2, 'url': 'newurl2'},
{'rank': 3, 'url': 'newurl3'}]}]
The only thing left to do is to call json.dumps
instead of print
.
For your followup, you want to take all values that are identical across every row with the same query
and group them into otherstuff
, and then list whatever remains in the results
.
So, for each group, first we want to get the common keys. We can do this by iterating the keys of any member of the group (anything that's not in the first member can't be in all members), so:
def common_fields(group):
def in_all_members(key, value):
return all(member[key] == value for member in group[1:])
return {key: value for key, value in group[0].items() if in_all_members(key, value)}
Or, alternatively… if we turn each member into a set
of key-value pairs, instead of a dict, we can then just intersect
them all. And this means we finally get to use reduce
, so let's try that:
def common_fields(group):
return dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))
I think the conversion back and forth between dict
and set
may make this less readable, and it also means that your values have to be hashable (not a problem for you sample data, since the values are all strings)… but it's certainly more concise.
This will, of course, always include query
as a common field, but we'll deal with that later. (Also, you wanted otherstuff
to be a list
with one dict
, so we'll throw an extra pair of brackets around it).
Meanwhile, results
is the same as above, except that filterkeys
filters out all of the common fields, instead of filtering out everything but rank
and url
. Putting it together:
def process_group(group):
group = list(group)
common = dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))
def filterkeys(member):
return {k: v for k, v in member.items() if k not in common}
results = list(map(filterkeys, group))
query = common.pop('query')
return {'query': query,
'otherstuff': [common],
'results': list(results)}
So, now we just use that function:
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([process_group(group) for key, group in groups])
[{'otherstuff': [{'detail': 'cool'}],
'query': 'Q1',
'results': [{'rank': 1, 'url': 'awesome1'},
{'rank': 2, 'url': 'awesome2'},
{'rank': 3, 'url': 'awesome3'}]},
{'otherstuff': [{'detail': 'same'}],
'query': 'Q#2',
'results': [{'rank': 1, 'url': 'newurl1'},
{'rank': 2, 'url': 'newurl2'},
{'rank': 3, 'url': 'newurl3'}]}]
This obviously isn't as trivial as the original version, but hopefully it all still makes sense. There are only two new tricks. First, we have to iterate over groups
multiple times (once to find the common keys, and then again to extract the remaining keys)
这篇关于在python或javascript中正确使用折叠或缩小函数来处理长到宽的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!