优化大型数据集的迭代和替换 [英] Optimising iteration and substitution over large dataset
问题描述
我在此处发表了一篇文章,因为我现在没有答案,我想也许也可以在这里尝试一下,因为我发现它很相关.
I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.
我有以下代码:
import pandas as pd
import numpy as np
import itertools
from pprint import pprint
# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)
# This function will make the subsets of a list
def subsets(m,n):
z = []
for i in m:
z.append(list(itertools.combinations(i, n)))
return(z)
# Make the subsets of size 2
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l))
# Modify the pairs:
mod=[':'.join(x) for x in Pairs]
# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod
# Make substitions
result = []
for v1, v2 in zip(t1, t2):
out = []
for i in t0:
common = set(v1).intersection(i)
if set(v1) == common:
out.append(tuple(list(set(i) - common) + [v2]))
else:
out.append(tuple(i))
result.append(out)
pprint(result, width=200)
# Delete duplicates
d = {tuple(x): x for x in result}
remain= list(d.values())
它的作用如下:首先,我们要在此处.您会看到它是元素的列表,对于每个元素,我们发现大小为2的子集.然后,我们对子集进行修改,并将其称为mod
.它要做的是说('a','b')
并将其转换为'a:b'
.然后,对于每一对,我们都要遍历原始数据,并在任何可以找到对的地方替换它们.最后,我们删除给出的所有重复项.
What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod
. What it does is to take say ('a','b')
and convert it to 'a:b'
. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.
该代码适用于少量数据.但是问题是我拥有的文件有30082对,其中每个应扫描〜49000列表的列表并替换成对.我在Jupyter中运行此程序,一段时间后内核死了.我不知道该如何优化?
The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?
推荐答案
在整个文件上进行了测试.
Tested on entire file.
您在这里:
= ^ .. ^ =
=^..^=
import pandas as pd
import numpy as np
import itertools
# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)
# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200
for index, row in df.iterrows():
# clean data
clean_list = [x for x in list(row.values) if str(x) != 'nan']
# create combinations
items_combinations = list(itertools.combinations(clean_list, 2))
# create set combinations
joint_items_combinations = [':'.join(x) for x in items_combinations]
# collect rest of item names
# handle firs element
if index == 0:
additional_names = list(df.loc[1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
else:
additional_names = list(df.loc[index-1].values)
additional_names = [x for x in additional_names if str(x) != 'nan']
# get set data
result = []
for combination, joint_combination in zip(items_combinations, joint_items_combinations):
set_data = [item for item in clean_list if item not in combination] + [joint_combination]
result.append((set_data, additional_names))
# add data to data frame
data = pd.DataFrame({"result": result})
df2 = df2.append(data)
df2 = df2.reset_index().drop(columns=['index'])
对于行:
chicken cinnamon ginger onion soy_sauce
cardamom coconut pumpkin
输出:
result
0 ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1 ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2 ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3 ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4 ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5 ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6 ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7 ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8 ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9 ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10 ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11 ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12 ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
这篇关于优化大型数据集的迭代和替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!