在python或javascript中正确使用折叠或缩小函数来处理长到宽的数据? [英] Correct use of a fold or reduce function to long-to-wide data in python or javascript?

查看:204
本文介绍了在python或javascript中正确使用折叠或缩小函数来处理长到宽的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

试图学会像功能性程序员一样多思考---我想用我认为是折叠或缩减操作的数据集来转换数据集。在R中,我认为这是一种重塑操作,但我不确定如何翻译这种想法。



我的数据是一个json字符串,看起来像这样:

  s = 
'[
{query:Q1,detail: cool,rank:1,url:awesome1},
{query:Q1,detail:cool,rank:2,url: awesome2},
{query:Q1,detail:cool,rank:3,url:awesome3},
{query Q#2,detail:same,rank:1,url:newurl1},
{query:Q#2,detail: ,rank:2,url:newurl2},
{query:Q#2,detail:same,rank:3,url: newurl3}
]'

我想把它变成这样的东西,其中查询是定义行的主键,嵌套与排名值和url字段对应的唯一行:

 '[
{query:Q1,
results:[
{rank:1,url:awesome1},
{rank:2,url:awesom e2},
{rank:3,url:awesome3}
]},
{query:Q#2,
结果:[
{rank:1,url:newurl1},
{rank:2,url:newurl2},
{等级:3,url:newurl3},
]}
]'

我知道我可以迭代,但我怀疑有一个功能操作可以完成这个转换,对吗?



也会好奇地知道如何

 '[
{query:Q1,
所有结果通用:[
{detail:cool}
],
结果:[
{rank:1, url:awesome1},
{rank:2,url:awesome2},
{rank:3,url:awesome3}
]},
{query:Q#2,
所有结果通用:[
{detail:same}
],
结果:[
{rank: 1,url:newurl1},
{rank:2,url:newurl2},
{rank:3,url:newurl3}
]}
]'

在第二个版本中,我想在同一个查询下重复所有数据,并将其放入其他东西容器中,其中排名下唯一的所有项目都将位于结果容器中。



我正在使用mongodb中的json对象,并且可以使用python或javascript来尝试这个转换。



编辑

任何建议,例如这个转换的正确名称,可能是在大型数据集上执行此操作的最快方法, / h1>

在下面引入@abarnert的优秀解决方案,试图让我的Version2成为其他任何处理同一类问题的人,要求在一个级别下分出一些密钥, ...



以下是我试过的内容:

  from functools import部分
groups = itertools.groupby(initial,operator.itemgetter('query'))
def filterkeys(d,mylist):
return {k:v for k,v in d。如果在mylist中有k个元素,那么items(){

结果=((key,map(partial(filterkeys,mylist = ['rank','url']),group))) )
other_stuff =((key,map(partial(filterkeys,mylist = ['detail']),gr oup))为钥匙,小组分组)

???

哦,不!!

解决方案

我知道这不是你所要求的折叠式解决方案,但我会用 itertools 来做到这一点,除非你认为Haskell的功能不如Lisp ...),也可能是解决这个问题的最好方法。

这个想法是把你的序列看作一个懒惰的列表,并对其进行一系列惰性转换,直到获得您想要的列表。



这里的关键步骤是 groupby

 >>> initial = json.loads(s)
>>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> print([key,list(group)for key,group in groups])
[('Q1',
[{'detail':'cool','query':'Q1',' rank':1,'url':'awesome1'},
''detail':'cool','query':'Q1','rank':2,'url':'awesome2'},
{'detail':'cool','query':'Q1','rank':3,'url':'awesome3'}]),
('Q#2',$ b $ {['detail':'same','query':'Q#2','rank':1,'url':'newurl1'},
{'detail':'same' ,'query':'Q#2','rank':2,'url':'newurl2'},
''detail':'same','query':'Q#2','等级':3,'url':'newurl3'}])]

我们已经在一步之内了。

重组每个键,将其组合成您想要的dict格式:

 >>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>>打印([{query:key,results:list(group)} for key,group in groups])
[{'query':'Q1',
'results':[ {'detail':'cool',
'query':'Q1',
'rank':1,
'url':'awesome1'},
{详细信息':'cool',
'query':'Q1',
'rank':2,
'url':'awesome2'},
{'detail' :'cool',
'query':'Q1',
'rank':3,
'url':'awesome3'}]},
{'query' :'Q#2',
'results':[{'detail':'same',
'query':'Q#2',
'rank':1,
'url':'newurl1'},
{'detail':'same',
'query':'Q#2',
'rank':2,
'url':'newurl2 },
{'detail':'same',
'query':'Q#2',
'rank':3,
'url':'newurl3'但是,等等,仍然有那些你想要摆脱的额外领域。}}}]

Easy:

 >>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> def filterkeys(d):
... return {k:v for k,v in d.items()if k in('rank','url')}
>>> ; (key,map(filterkeys,group))为key,group为组)
>>> )$ b $ {['query':'Q1',$ b $'results':[{'query:key,results:list(group)} for key, {'rank':1,'url':'awesome1'},
{'rank':2,'url':'awesome2'},
{'rank':3,'url' :'awesome3'}]},
{'query':'Q#2',
'results':[{'rank':1,'url':'newurl1'},
{'rank':2,'url':'newurl2'},
{'rank':3,'url':'newurl3'}]}]

唯一要做的就是调用 json.dumps 而不是 print






为了您的后续行动,您希望获取所有相同的值在每行中使用相同的查询并将它们分组为 otherstuff ,然后列出 results 。



因此,对于每个组,首先我们要获取公共密钥。我们可以通过迭代组中任何成员的键来完成此操作(不在第一个成员中的任何成员不能在所有成员中),因此:

  def common_fields(group):
def in_all_members(key,value):
返回所有(组[1:]中成员的成员[key] ==值)
return {key:key的值,group [0]中的值.items()如果in_all_members(key,value)}

或者,或者,如果我们将每个成员转换为集合的键值对,而不是字典,那么我们就可以将 intersect 它们全部。这意味着我们最终可以使用 reduce ,所以我们试试看:

  def common_fields(group):
返回dict(functools.reduce(set.intersection,(set(d.items())for d))

我认为在 dict set 之间来回转换, code>可能会降低可读性,并且这也意味着您的值必须是可哈希的(对于您的示例数据来说不是问题,因为这些值都是字符串)......但它确实更简洁。



当然,这通常包含 query 作为通用字段,但我们稍后会处理。 (另外,你希望 otherstuff 是一个 list ,其中一个 dict ,所以我们会在它周围增加一对括号)。



同时, results 是除了 filterkeys 过滤掉所有常用字段,而不是过滤除 rank 网址。把它放在一起:

  def process_group(group):
group = list(group)
common = dict(functools.reduce(set.intersection,(set(d.items())for d))
def filterkeys(member):
return {k:v for k,v in member.items()if k not common}
results = list(map(filterkeys,group))
query = common.pop('query')
return {'query':查询,
'otherstuff':[common],
'results':list(results)}

因此,现在我们只使用该函数:

 >>> groups = itertools.groupby(initial,operator.itemgetter('query'))
>>> print([process_group(group)for key,group in groups])
[{'otherstuff':[{'detail':'cool'}],
'query':'Q1',
'results':[{'rank':1,'url':'awesome1'},
{'rank':2,'url':'awesome2'},
{ rank':3,'url':'awesome3'}]},
{'otherstuff':[{'detail':'same'}],
'query':'Q#2' ,
'results':[{'rank':1,'url':'newurl1'},
{'rank':2,'url':'newurl2'},
{'rank':3,'url':'newurl3'}]}]

并不像原来的版本那样微不足道,但希望这一切仍然有意义。只有两个新的技巧。首先,我们必须迭代 groups 多次(一次找到常用键,然后再解压剩下的键)。

Trying to learn to think like a functional programmer a little more---I'd like to transform a data set with what I think is either a fold or a reduce operation. In R, I would think of this as a reshape operation, but I'm not sure how to translate that thinking.

My data is a json string that looks like this:

s = 
'[
{"query":"Q1", "detail" : "cool", "rank":1,"url":"awesome1"},
{"query":"Q1", "detail" : "cool", "rank":2,"url":"awesome2"},
{"query":"Q1", "detail" : "cool", "rank":3,"url":"awesome3"},
{"query":"Q#2", "detail" : "same", "rank":1,"url":"newurl1"},
{"query":"Q#2", "detail" : "same", "rank":2,"url":"newurl2"},
{"query":"Q#2", "detail" : "same", "rank":3,"url":"newurl3"}
]'

I'd like to turn it into something like this, where query is the master key defining the 'row', nesting the unique "rows" corresponding to the "rank" values and "url" fields:

'[
{ "query" : "Q1",
    "results" : [
        {"rank" : 1, "url": "awesome1"},
        {"rank" : 2, "url": "awesome2"},
        {"rank" : 3, "url": "awesome3"}        
    ]},
{ "query" : "Q#2",
    "results" : [
        {"rank" : 1, "url": "newurl1"},
        {"rank" : 2, "url": "newurl2"},
        {"rank" : 3, "url": "newurl3"},        
    ]}
]'

I know I can iterate through, but I suspect there is a functional operation that does this transformation, right?

Would also be curious to know how to get to something more like this, Version2:

'[
{ "query" : "Q1",
    "Common to all results" : [
        {"detail" : "cool"}
    ],
    "results" : [
        {"rank" : 1, "url": "awesome1"},
        {"rank" : 2, "url": "awesome2"},
        {"rank" : 3, "url": "awesome3"}        
    ]},
{ "query" : "Q#2",
    "Common to all results" : [
        {"detail" : "same"}
    ],
    "results" : [
        {"rank" : 1, "url": "newurl1"},
        {"rank" : 2, "url": "newurl2"},
        {"rank" : 3, "url": "newurl3"}        
    ]}
]'

In this second version, I'd like to take all data repeating under the same query, and shove it into an "other stuff" container, where all the items unique under "rank" would be in the "results" container.

I'm working on json objects in mongodb, and can use either python or javascript to try out this transform.

Any advice, such as the proper name for this transformation, what might be the fastest way to do this on a large data set, is appreciated!

EDIT

Incorporating @abarnert's excellent solution below, trying to get my Version2 above for anyone else working on the same kind of problem, requiring bifurcating some keys under one level, other keys under another...

Here's what I tried:

from functools import partial
groups = itertools.groupby(initial, operator.itemgetter('query'))
def filterkeys(d,mylist):
    return {k: v for k, v in d.items() if k in mylist}

results = ((key, map(partial(filterkeys, mylist=['rank','url']),group)) for key, group in groups)
other_stuff = ((key, map(partial(filterkeys, mylist=['detail']),group)) for key, group in groups)

???

Oh no!

解决方案

I know this isn't the fold-style solution you were asking for, but I would do this with itertools, which is just as functional (unless you think Haskell is less functional than Lisp…), and also probably the most pythonic way to solve this.

The idea is to think of your sequence as a lazy list, and apply a chain of lazy transformations to it until you get the list you want.

The key step here is groupby:

>>> initial = json.loads(s)
>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([key, list(group) for key, group in groups])
[('Q1',
  [{'detail': 'cool', 'query': 'Q1', 'rank': 1, 'url': 'awesome1'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 2, 'url': 'awesome2'},
   {'detail': 'cool', 'query': 'Q1', 'rank': 3, 'url': 'awesome3'}]),
 ('Q#2',
  [{'detail': 'same', 'query': 'Q#2', 'rank': 1, 'url': 'newurl1'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 2, 'url': 'newurl2'},
   {'detail': 'same', 'query': 'Q#2', 'rank': 3, 'url': 'newurl3'}])]

You can see how close we are already, in just one step.

To restructure each key, group pair into the dict format you want:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([{"query": key, "results": list(group)} for key, group in groups])
[{'query': 'Q1',
  'results': [{'detail': 'cool',
               'query': 'Q1',
               'rank': 1,
               'url': 'awesome1'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 2,
               'url': 'awesome2'},
              {'detail': 'cool',
               'query': 'Q1',
               'rank': 3,
               'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'detail': 'same',
               'query': 'Q#2',
               'rank': 1,
               'url': 'newurl1'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 2,
               'url': 'newurl2'},
              {'detail': 'same',
               'query': 'Q#2',
               'rank': 3,
               'url': 'newurl3'}]}]

But wait, there's still those extra fields you want to get rid of. Easy:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> def filterkeys(d):
...     return {k: v for k, v in d.items() if k in ('rank', 'url')}
>>> filtered = ((key, map(filterkeys, group)) for key, group in groups)
>>> print([{"query": key, "results": list(group)} for key, group in filtered])
[{'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

The only thing left to do is to call json.dumps instead of print.


For your followup, you want to take all values that are identical across every row with the same query and group them into otherstuff, and then list whatever remains in the results.

So, for each group, first we want to get the common keys. We can do this by iterating the keys of any member of the group (anything that's not in the first member can't be in all members), so:

def common_fields(group):
    def in_all_members(key, value):
        return all(member[key] == value for member in group[1:])
    return {key: value for key, value in group[0].items() if in_all_members(key, value)}

Or, alternatively… if we turn each member into a set of key-value pairs, instead of a dict, we can then just intersect them all. And this means we finally get to use reduce, so let's try that:

def common_fields(group):
    return dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))

I think the conversion back and forth between dict and set may make this less readable, and it also means that your values have to be hashable (not a problem for you sample data, since the values are all strings)… but it's certainly more concise.

This will, of course, always include query as a common field, but we'll deal with that later. (Also, you wanted otherstuff to be a list with one dict, so we'll throw an extra pair of brackets around it).

Meanwhile, results is the same as above, except that filterkeys filters out all of the common fields, instead of filtering out everything but rank and url. Putting it together:

def process_group(group):
    group = list(group)
    common = dict(functools.reduce(set.intersection, (set(d.items()) for d in group)))
    def filterkeys(member):
        return {k: v for k, v in member.items() if k not in common}
    results = list(map(filterkeys, group))
    query = common.pop('query')
    return {'query': query,
            'otherstuff': [common],
            'results': list(results)}

So, now we just use that function:

>>> groups = itertools.groupby(initial, operator.itemgetter('query'))
>>> print([process_group(group) for key, group in groups])
[{'otherstuff': [{'detail': 'cool'}],
  'query': 'Q1',
  'results': [{'rank': 1, 'url': 'awesome1'},
              {'rank': 2, 'url': 'awesome2'},
              {'rank': 3, 'url': 'awesome3'}]},
 {'otherstuff': [{'detail': 'same'}],
  'query': 'Q#2',
  'results': [{'rank': 1, 'url': 'newurl1'},
              {'rank': 2, 'url': 'newurl2'},
              {'rank': 3, 'url': 'newurl3'}]}]

This obviously isn't as trivial as the original version, but hopefully it all still makes sense. There are only two new tricks. First, we have to iterate over groups multiple times (once to find the common keys, and then again to extract the remaining keys)

这篇关于在python或javascript中正确使用折叠或缩小函数来处理长到宽的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆