代码运行时的内存问题(Python、Networkx) [英] Memory problems while code is running (Python, Networkx)
问题描述
我编写了一个代码来生成一个有 379613734 条边的图.
I made a code for generate a graph with 379613734 edges.
但是由于内存问题,代码无法完成.当它经过6200万行时,它会占用大约97%的服务器内存.所以我杀了它.
But the code couldn't be finished because of memory. It takes about 97% of server memory when it go through 62 million lines. So I killed it.
你有解决这个问题的想法吗?
Do you have any idea to solve this problem?
我的代码是这样的:
import os, sys
import time
import networkx as nx
G = nx.Graph()
ptime = time.time()
j = 1
for line in open("./US_Health_Links.txt", 'r'):
#for line in open("./test_network.txt", 'r'):
follower = line.strip().split()[0]
followee = line.strip().split()[1]
G.add_edge(follower, followee)
if j%1000000 == 0:
print j*1.0/1000000, "million lines done", time.time() - ptime
ptime = time.time()
j += 1
DG = G.to_directed()
# P = nx.path_graph(DG)
Nn_G = G.number_of_nodes()
N_CC = nx.number_connected_components(G)
LCC = nx.connected_component_subgraphs(G)[0]
n_LCC = LCC.nodes()
Nn_LCC = LCC.number_of_nodes()
inDegree = DG.in_degree()
outDegree = DG.out_degree()
Density = nx.density(G)
# Diameter = nx.diameter(G)
# Centrality = nx.betweenness_centrality(PDG, normalized=True, weighted_edges=False)
# Clustering = nx.average_clustering(G)
print "number of nodes in G\t" + str(Nn_G) + '\n' + "number of CC in G\t" + str(N_CC) + '\n' + "number of nodes in LCC\t" + str(Nn_LCC) + '\n' + "Density of G\t" + str(Density) + '\n'
# sys.exit()
# j += 1
边缘数据是这样的:
1000 1001
1000245 1020191
1000 10267352
1000653 10957902
1000 11039092
1000 1118691
10346 11882
1000 1228281
1000 1247041
1000 12965332
121340 13027572
1000 13075072
1000 13183162
1000 13250162
1214 13326292
1000 13452672
1000 13844892
1000 14061830
12340 1406481
1000 14134703
1000 14216951
1000 14254402
12134 14258044
1000 14270791
1000 14278978
12134 14313332
1000 14392970
1000 14441172
1000 14497568
1000 14502775
1000 14595635
1000 14620544
1000 14632615
10234 14680596
1000 14956164
10230 14998341
112000 15132211
1000 15145450
100 15285998
1000 15288974
1000 15300187
1000 1532061
1000 15326300
最后,有没有人有分析推特链接数据的经验?我很难用有向图计算节点的平均/中值入度和出度.有什么帮助或想法吗?
Lastly, is there anybody who has an experience to analyze Twitter link data? It's quite hard to me to take a directed graph and calculate average/median indegree and outdegree of nodes. Any help or idea?
推荐答案
首先,您应该考虑是否可以添加更多 RAM.对内存使用情况进行一些估计,方法是根据您拥有的数据进行计算,或者通过读取各种大小的数据的子样本来衡量事物的规模.几 GB RAM 的适度成本可能会为您节省大量时间和麻烦.
First, you should consider whether you could just add more RAM. Make some estimates of memory usage, either by calculating based on the data you have or by reading in subsamples of the data of various sizes to measure how things scale. The modest cost of a few GB of RAM might spare you lots of time and trouble.
其次,考虑是否需要实际构建整个图.例如,您可以通过遍历文件并计数来确定顶点的数量及其度数 - 您一次只需要在内存中保留一行,加上计数,这将比图形小很多.知道度数后,您可以在找到最大的连通分量时从图中省略度数为 1 的任何顶点,然后对省略的节点进行校正.您正在进行数据分析,而不是实现一些通用算法:学习有关数据的简单知识以进行更复杂的分析.
Second, consider whether you need to actually build the whole graph. For example, you could determine the number of vertices and their degrees just by iterating through the file and counting - you'd only need to keep one line at a time in memory, plus the counts, which will be a lot smaller than the graph. Knowing the degrees, you could omit any vertices with degree one from the graph when finding the largest connected component, then correct for the omitted nodes afterwards. You're doing data analysis, not implementing some general algorithm: learn simple things about the data to enable more complicated analyses.
这篇关于代码运行时的内存问题(Python、Networkx)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!