优化大型数据集的迭代和替换 [英] Optimising iteration and substitution over large dataset

查看：67 发布时间：2021/2/9 19:44:58 python pandas performance numpy itertools

本文介绍了优化大型数据集的迭代和替换的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在此处发表了一篇文章，因为我现在没有答案，我想也许也可以在这里尝试一下，因为我发现它很相关.

I've made a post here, yet as I got no answer as per now I thought maybe to try it also here as I've found it relevant.

我有以下代码:

import pandas as pd
import numpy as np
import itertools 
from pprint import pprint

# Importing the data
df=pd.read_csv('./GPr.csv', sep=',',header=None)
data=df.values
res = np.array([[i for i in row if i == i] for row in data.tolist()], dtype=object)

# This function will make the subsets of a list 
def subsets(m,n):
    z = []
    for i in m:
        z.append(list(itertools.combinations(i, n)))
    return(z)

# Make the subsets of size 2 
l=subsets(res,2)
l=[val for sublist in l for val in sublist]
Pairs=list(dict.fromkeys(l)) 

# Modify the pairs: 
mod=[':'.join(x) for x in Pairs]

# Define new lists
t0=res.tolist()
t0=map(tuple,t0)
t1=Pairs
t2=mod

# Make substitions
result = []
for v1, v2 in zip(t1, t2):
    out = []
    for i in t0:
        common = set(v1).intersection(i)
        if set(v1) == common:
            out.append(tuple(list(set(i) - common) + [v2]))
        else:
            out.append(tuple(i))
    result.append(out)

pprint(result, width=200)  

# Delete duplicates
d = {tuple(x): x for x in result} 
remain= list(d.values())

它的作用如下:首先，我们要在此处.您会看到它是元素的列表，对于每个元素，我们发现大小为2的子集.然后，我们对子集进行修改，并将其称为mod.它要做的是说('a','b')并将其转换为'a:b'.然后，对于每一对，我们都要遍历原始数据，并在任何可以找到对的地方替换它们.最后，我们删除给出的所有重复项.

What it does is as follows: First, we import the csv file we want to work with in here. You can see that it is a list of elements, for each element we find the subsets of size two. We then write a modification to the subsets and call it mod. What it does is to take say ('a','b') and convert it to 'a:b'. We then, for each pair, go through the original data and where ever we find the pairs we substitute them. Finally we delete all the duplicates as it is given.

该代码适用于少量数据.但是问题是我拥有的文件有30082对，其中每个应扫描〜49000列表的列表并替换成对.我在Jupyter中运行此程序，一段时间后内核死了.我不知道该如何优化?

The code works fine for small set of data. Yet the problem is that the file I have, has 30082 pairs where for each the list of ~49000 list should be scanned and pairs being replaced. I run this in Jupyter and after some time the Kernel dies. I wonder how one can optimise this?

推荐答案

在整个文件上进行了测试.

Tested on entire file.

您在这里:

= ^ .. ^ =

=^..^=

import pandas as pd
import numpy as np
import itertools

# Importing the data
df=pd.read_csv('./GPr_test.csv', sep=',',header=None)

# set new data frame
df2 = pd.DataFrame()
pd.options.display.max_colwidth = 200


for index, row in df.iterrows():
    # clean data
    clean_list = [x for x in list(row.values) if str(x) != 'nan']
    # create combinations
    items_combinations = list(itertools.combinations(clean_list, 2))
    # create set combinations
    joint_items_combinations = [':'.join(x) for x in items_combinations]

    # collect rest of item names
    # handle firs element
    if index == 0:
        additional_names = list(df.loc[1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']
    else:
        additional_names = list(df.loc[index-1].values)
        additional_names = [x for x in additional_names if str(x) != 'nan']

    # get set data
    result = []
    for combination, joint_combination in zip(items_combinations, joint_items_combinations):
        set_data = [item for item in clean_list if item not in combination] + [joint_combination]
        result.append((set_data, additional_names))

    # add data to data frame
    data = pd.DataFrame({"result": result})
    df2 = df2.append(data)


df2 = df2.reset_index().drop(columns=['index'])

对于行:

chicken cinnamon    ginger  onion   soy_sauce
cardamom    coconut pumpkin

输出:

                                                                      result
0   ([ginger, onion, soy_sauce, chicken:cinnamon], [cardamom, coconut, pumpkin])
1   ([cinnamon, onion, soy_sauce, chicken:ginger], [cardamom, coconut, pumpkin])
2   ([cinnamon, ginger, soy_sauce, chicken:onion], [cardamom, coconut, pumpkin])
3   ([cinnamon, ginger, onion, chicken:soy_sauce], [cardamom, coconut, pumpkin])
4   ([chicken, onion, soy_sauce, cinnamon:ginger], [cardamom, coconut, pumpkin])
5   ([chicken, ginger, soy_sauce, cinnamon:onion], [cardamom, coconut, pumpkin])
6   ([chicken, ginger, onion, cinnamon:soy_sauce], [cardamom, coconut, pumpkin])
7   ([chicken, cinnamon, soy_sauce, ginger:onion], [cardamom, coconut, pumpkin])
8   ([chicken, cinnamon, onion, ginger:soy_sauce], [cardamom, coconut, pumpkin])
9   ([chicken, cinnamon, ginger, onion:soy_sauce], [cardamom, coconut, pumpkin])
10  ([pumpkin, cardamom:coconut], [chicken, cinnamon, ginger, onion, soy_sauce])
11  ([coconut, cardamom:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])
12  ([cardamom, coconut:pumpkin], [chicken, cinnamon, ginger, onion, soy_sauce])

这篇关于优化大型数据集的迭代和替换的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

优化大型数据集的迭代和替换 [英] Optimising iteration and substitution over large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

优化大型数据集的迭代和替换 [英] Optimising iteration and substitution over large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭