从Pandas数据框创建矩阵以显示连通性 [英] Creating a matrix from Pandas dataframe to display connectedness

查看:96
本文介绍了从Pandas数据框创建矩阵以显示连通性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在熊猫数据框中使用这种格式的数据:

I have my data in this format in a pandas dataframe:

Customer_ID  Location_ID
Alpha             A
Alpha             B
Alpha             C
Beta              A
Beta              B
Beta              D

我想研究客户的流动性模式.我的目标是确定客户最常去的位置集群.我认为以下矩阵可以提供此类信息:

I want to study the mobility patterns of the customers. My goal is to determine the clusters of locations that are most frequented by customers. I think the following matrix can provide such information:

   A  B  C  D
A  0  2  1  1
B  2  0  1  1
C  1  1  0  0
D  1  1  0  0

如何在Python中这样做?

How do I do so in Python?

我的数据集非常大(成千上万的客户和大约一百个位置).

My dataset is quite large (hundreds of thousands of customers and about a hundred locations).

推荐答案

这里是一种考虑了多次访问的方法(例如,如果客户X两次访问LocA和LocB,他将为相应的访问者贡献2最终矩阵中的位置.

Here is one approach that takes into account the multiplicity of visits (e.g. if Customer X visits both LocA and LocB twice, he will contribute 2 to the corresponding position in the final matrix).

想法:

  1. 对于每个位置,计算客户的访问次数.
  2. 对于每个位置对,请找出访问过这两个位置的每个客户的最小访问次数之和.
  3. 使用unstack并进行清理.
  1. For each location, count visits by customer.
  2. For each location pair, find the sum of minimal numbers of visits for each customer who visited both.
  3. Use unstack and cleanup.

Counter在这里可以很好地发挥作用,因为计数器支持许多自然算术运算,例如addmax等.

Counter plays nicely here because counters support many natural arithmetic operations, like add, max etc.

import pandas as pd
from collections import Counter
from itertools import product

df = pd.DataFrame({
    'Customer_ID': ['Alpha', 'Alpha', 'Alpha', 'Beta', 'Beta'],
    'Location_ID': ['A', 'B', 'C', 'A', 'B'],
    })


ctrs = {location: Counter(gp.Customer_ID) for location, gp in df.groupby('Location_ID')}


# In [7]: q.ctrs
# Out[7]:
# {'A': Counter({'Alpha': 1, 'Beta': 1}),
#  'B': Counter({'Alpha': 1, 'Beta': 1}),
#  'C': Counter({'Alpha': 1})}


ctrs = list(ctrs.items())
overlaps = [(loc1, loc2, sum(min(ctr1[k], ctr2[k]) for k in ctr1))
    for i, (loc1, ctr1) in enumerate(ctrs, start=1)
    for (loc2, ctr2) in ctrs[i:] if loc1 != loc2]
overlaps += [(l2, l1, c) for l1, l2, c in overlaps]


df2 = pd.DataFrame(overlaps, columns=['Loc1', 'Loc2', 'Count'])
df2 = df2.set_index(['Loc1', 'Loc2'])
df2 = df2.unstack().fillna(0).astype(int)


#      Count
# Loc2     A  B  C
# Loc1
# A        0  2  1
# B        2  0  1
# C        1  1  0

如果您想忽略多重性,请将Counter(gp.Customer_ID)替换为Counter(set(gp.Customer_ID)).

If you like to disregard multiplicities, replace Counter(gp.Customer_ID) with Counter(set(gp.Customer_ID)).

这篇关于从Pandas数据框创建矩阵以显示连通性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆