如何使用Python基于交易数据有效地创建用户图? [英] How can I efficiently create a user graph based on transaction data using Python?

查看:68
本文介绍了如何使用Python基于交易数据有效地创建用户图?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用networkx包在Python中创建用户图.我的原始数据是个人付款交易,其中付款数据包括用户,付款工具,IP地址等.我的节点是用户,如果有两个用户共享IP地址,我将创建边缘.

I'm attempting to create a graph of users in Python using the networkx package. My raw data is individual payment transactions, where the payment data includes a user, a payment instrument, an IP address, etc. My nodes are users, and I am creating edges if any two users have shared an IP address.

根据交易数据,我创建了一个具有唯一[user,IP]对的Pandas数据框.要创建边缘,我需要找到两个用户共享IP的[user_a,user_b]对.让我们将此DataFrame称为"df",并将其列为"user"和"ip".

From that transaction data, I've created a Pandas dataframe of unique [user, IP] pairs. To create edges, I need to find [user_a, user_b] pairs where both users share an IP. Let's call this DataFrame 'df' with columns 'user' and 'ip'.

我一直遇到内存问题,并尝试了之前概述的几种不同解决方案.作为参考,原始交易列表约为500,000,其中包括约130,000用户,约30,000 IP和约30,000,000个链接.

I keep running into memory problems, and have tried a few different solutions outlined before. For reference, the raw transaction list was ~500,000, includes ~130,000 users, ~30,000 IPs, and likely ~30,000,000 links.

  1. 将df加入自身,对进行排序并删除重复项(这样[X, Y]和[Y,X]都不同时显示为唯一对.

  1. Join df to itself, sort pairs and remove duplicates (so that [X, Y] and [Y, X] don't both show up as unique pairs).

df_pairs = df.join(df, how='inner', lsuffix='l', rsuffix='r')
df_sorted_pairs = [np.sort([df_pairs['userl'][i], df_pairs['userr'][i]]) for i in range(len(df_pairs))]
edges = np.asarray(pd.DataFrame(df_sorted_pairs).drop_duplicates())

这很好用,但是很快就给我一个内存错误, 将表连接到自身变得非常快.

This works pretty well, but gives me a Memory Error fairly quickly, as joining a table to itself grows very quickly.

创建一个矩阵,其中用户是行,IP是列, 如果该用户在IP上进行交易,则矩阵元素为1;而矩阵元素为0 否则.那么X.dot(X.transpose())是一个方阵 元素(i,j)代表用户i和用户共享了多少IP

Create a matrix, where users are the rows, IPs are the columns, and matrix elements are 1 if that user transacted on the IP and 0 otherwise. Then X.dot(X.transpose()) is a square matrix whose elements (i,j) represent how many IPs were shared by user i and user j.

user_list = df['user'].unique()
ip_list = df['ip'].unique()
df_x = pd.DataFrame(index=user_list, columns=ip_list)
df_x.fillna(0, inplace=True)
for row in range(len(df)):
    df_x[df['ip'][row]][df['user'][row]] = 1
df_links = df_x.dot(df_x.transpose())

除非len(ip_list)> 5000,否则此方法效果非常好.只需创建 例如,500,000行x 200,000列的空数据框给出了 内存错误.

This works extremely well unless len(ip_list) > 5000. Just creating the empty dataframe of say, 500,000 rows x 200,000 columns gives a Memory Error.

蛮力.一次又一次地遍历用户.对于每个 用户,找到不同的IP.对于每个IP,找到不同的用户. 因此,这些最终用户将链接到 当前迭代.将该[User1,User2]列表添加到的主列表 链接.

Brute force. Iterate across the users one by one. For each user, find the distinct IPs. For each IP, find the distinct users. Those resulting users are therefore linked to the user in the current iteration. Add that [User1, User2] list to master list of links.

user_list = df['user'].unique()
ip_list = df['ip'].unique()
links=[]
for user in user_list:
    related_ip_list = df[df['user'] == user]['ip'].unique()
    for ip in related_ip_list:
        related_user_list = df[df['ip'] == ip]['user'].unique()
        for related_user in related_user_list:
            if related_user != user:
                links.append([user, related_user])

这有效,但是非常慢.跑了三个小时,终于给了 我出现内存错误.因为链接是一路保存的,所以我 可以检查它的大小-大约23,000,000个链接.

This works, but extremely slow. It ran for 3 hours and finally gave me a Memory Error. Because links was being saved along the way, I could check how big it got - about 23,000,000 links.

任何建议将不胜感激.我是否只是在大数据"方面走得太远,而上述传统方法并不会削减它呢?我不认为有500,000笔交易符合大数据"的条件,但是我想存储一个130,000 x 30,000矩阵或创建包含30,000,000个元素的列表会很大吗?

Any advice would be appreciated. Have I simply gone too far into "Big Data" where traditional methods like the above aren't going to cut it? I didn't think having 500,000 transactions qualified as "Big Data" but I guess storing a 130,000 x 30,000 matrix or creating a list with 30,000,000 elements is pretty large?

推荐答案

我认为您的问题是矩阵表示法不会削减它:

I think your problem is that a matrix representation is not going to cut it:

请注意,在内存方面,您的工作效率很低.例如,您创建一个具有很多零的矩阵,需要在RAM中分配这些零.对于不存在的连接(而不是零浮动),RAM中没有任何对象会更加有效.您滥用"线性代数数学来解决您的问题,这使您使用了大量的RAM. (矩阵中的元素数量为130k * 30k = gazilion,但您仅"拥有您真正关心的30m链接)

Note that memory wise, you do very inefficient stuff. For example, you create a matrix with a lot of zeros that need to be allocated in RAM. It would be a lot more efficient to not have any object in RAM for a connection that does not exist instead of a zero float. You "abuse" linear algebra math to solve your problem, which makes you use a lot of RAM. (The amount of elements is in your matrix is 130k*30k = a gazilion, but you "only" have 30m links that you actually care about)

我真的对您有感觉,因为熊猫是我学习的第一个图书馆,并且我正尝试解决熊猫的几乎所有问题.随着时间的推移,我注意到矩阵方法对于许多问题并不是最佳选择.

I truly feel for you, because pandas was the first library I learned and I was trying to solve almost every problem with pandas. I noticed over time though that the matrix approach is not optimal for a lot of problems.

在numpy中有一个备用矩阵",但是我们不要去那里.

There is a "spare matrix" somewhere in numpy, but let's not go there.

让我建议另一种方法:

使用简单的默认字典:

from collections import defaultdict

# a dict that makes an empty set if you add a key that doesnt exist
shared_ips = defaultdict(set)

# for each ip, you generate a set of users
for k, row in unique_user_ip_pairs.iterrows():
    shared_ips[row['ip']].add(row['user'])

#filter the the dict for ips that have more than 1 user
shared_ips = {k, v for k, v in shared_ips.items() if len(v) > 1}

我不确定这是否能100%解决您的用例,但请注意效率:

I'm not sure if this is 100% going to solve your usercase, but note the efficiency:

这最多将复制您最初的唯一用户IP对对象中的RAM使用情况. 但是您将获得在哪些用户之间共享了ip​​的信息.

This will at most duplicate the RAM usage from your initial unique user-ip pairs object. But you will get the information which ip was shared amongst which users.

如果矩阵中的大多数单元格表示相同类型的信息,则在遇到内存问题时不要使用矩阵方法

我已经看到了许多熊猫问题解决方案,这些问题可以通过简单使用python内置类型(例如 dict set frozenset 计数器.特别是从统计工具箱(例如MATLAB和R或Excel)进入Python的人们非常容易使用它(他们肯定喜欢表格).我建议人们尽量不要让熊猫成为他的个人内置图书馆,而他首先求助于此.

I've seen so many pandas solutions for problems that could have been done with the simple usage of pythons builtin types like dict, set, frozenset and Counters. Especially people coming to Python from statistical toolboxes like MATLAB and R or Excel are very very prone to it (they sure like them tables). I suggest that one tries to not make pandas his personal builtin library where he resorts to first...

这篇关于如何使用Python基于交易数据有效地创建用户图?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆