如何通过从另一个更大的数据框中选择一些数据列表来有效地构建 pandas 数据框(或字典)? [英] How to build a pandas dataframe (or dict) in an efficient way by selecting some lists of data from another bigger dataframe?

查看:67
本文介绍了如何通过从另一个更大的数据框中选择一些数据列表来有效地构建 pandas 数据框(或字典)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要创建一个DataFrame或字典.如果N = 3(其他列表内的列表数),则预期输出为:

I need to create a DataFrame or dictionary. If N = 3 (number of lists inside other list) the expected output is this:

d = {
    'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
    'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
    'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
    'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
    'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
    'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}

为此,我编写了以下代码.但是我需要以下代码才能更快地运行:

For this I have programmed the following code. But I need this code to run faster:

import numpy as np
import pandas as pd

df_dict = {
    'X1': [1, 2, 3, 4, 5, 6, 7, 8, np.nan],
    'Y1': [9, 29, 39, 49, np.nan, 69, 79, 89, 99],
    'X2': [11, 12, 13, 14, 15, 16, 17, 18, np.nan],
    'Y2': [119, 129, 139, 149, np.nan, 169, 179, 189, 199],
    'X3': [21, 22, 23, 24, 25, 26, 27, 28, np.nan],
    'Y3': [219, 229, 239, 249, np.nan, 269, 279, 289, 299],
    'S': [123, 11, 123, 11, 123, 123, 123, 35, 123],
    'C': [9, 8, 7, 6, 5, 4, 3, 2, 1],
    'F': [1, 1, 1, 1, 2, 3, 3, 3, 3],
    'OTHER': [10, 20, 30, 40, 50, 60, 70, 80, 90],
}
bigger_df = pd.DataFrame(df_dict)

plots = [
    { 'x': 'X1', 'y': 'Y1', },
    { 'x': 'X2', 'y': 'Y2', },
    { 'x': 'X3', 'y': 'Y3', }
]

N = 3
d = {}
s_list = [123, 145, 35]
n = 0
for p in plots:
    # INITIALIZATES THE DICTIONARY ELEMENTS
    d['xs{}'.format(n)] = [[] for x in range(N)]
    d['ys{}'.format(n)] = [[] for x in range(N)]        

    # BUILDS THE LISTS FOR THOSE ELEMENTS
    for index in range(3):
        df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
        df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == s_list[index])]
        df.sort_values(['C'], ascending=[True], inplace=True)

        d['xs{}'.format(n)][index] = list(df[p['x']])
        d['ys{}'.format(n)][index] = list(df[p['y']])
    n += 1
print(d)

我想知道是否可以对熊猫或numpy进行一些技巧,而不是在循环中构建字典?如果结果是pandas数据框而不是字典,这对我也有好处,甚至更好,但是如果效率更高,我就不会.

I am wondering if instead of building the dictionary on a loop I could do some trick with pandas or numpy. If the result is a pandas dataframe rather than a dictionary is also good for me, or even better, but I do not if it will be more efficient.

一些想法?

推荐答案

根据您的输入和期望的输出(每个键在列表中两次出现两个相同的值?),至少可以替换您的通过:

Depending on your input and your expected output (three time the same couple of values in your list for each key?), at least you can replace your for p in plots by:

for p in plots:
    # Select the data you want
    df = bigger_df.filter([p['x'], p['y'], 'S', 'F', 'C'])        # selects the minimum of columns needed
    df = df[df['F'].isin([2, 3, 4, 9]) & df[p['x']].notnull() & df[p['y']].notnull() & (df.S == 123)]   # I have used 123 to simplify, actually the value is an integer variable
    df.sort_values(['C'], ascending=[True], inplace=True)
    # fill the dictionary
    d['xs{}'.format(n)] = [list(df[p['x']]) for x in range(N)]
    d['ys{}'.format(n)] = [list(df[p['y']]) for x in range(N)]
    n += 1

至少保存for index in range(3)并在bigger_df上执行相同的操作3次.使用timeit时,我的代码从210毫秒降至70.5毫秒(约三分之一).

At least you save the for index in range(3) and doing the same operation on your bigger_df 3 times. With timeit I dropped from 210 ms with your code to 70.5 ms (around a third) with this one.

编辑:通过重新定义问题的方式,我认为这可以完成您想要的工作:

EDIT: with the way you redefine your question, I think this might do the job you want:

# put this code after the definition of plots
s_list = [123, 145, 35]
# create an empty DF to add your results in the loop
df_output = pd.DataFrame(index=s_list, columns=['xs0','ys0', 'xs1', 'ys1', 'xs2', 'ys2']) 
n = 0
for p in plots:
    # Select the data you want and sort them on the same line
    df_p = bigger_df[bigger_df['F'].isin([2, 3, 4, 9]) & bigger_df[p['x']].notnull() & bigger_df[p['y']].notnull() & bigger_df['S'].isin(s_list)].sort_values(['C'], ascending=[True])
    # on bigger df I would do a bit differently if the isin on F and S are the same for the three plots, 
    # I would create a df_select_FS outside of the loop before (might be faster)

    #  Now, you can do groupby on S and then you create a list of element in column p['x'] (and same for p['y'])
    # and you add them in you empty df_output in the right column
    df_output['xs{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['x']]))
    df_output['ys{}'.format(n)] = df_p.groupby('S').apply(lambda x: list(x[p['y']]))
    n += 1

两个注意事项:首先,如果在您的s_list中具有两倍的相同值,则可能无法按您希望的方式工作,其次,在条件不满足的情况下(例如在S中的示例145中),那么您就有了nan

Two notes: first if in your s_list you have twice the same value, it might not work the way you want, second where the condition are not meet (like in your example 145 in S) then you have nan in your df_output

这篇关于如何通过从另一个更大的数据框中选择一些数据列表来有效地构建 pandas 数据框(或字典)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆