如何将.loc与groupby一起使用,以使基于分组数据创建新列不会被视为副本? [英] How do I use .loc with groupby so that creating a new column based on grouped data won't be considered a copy?

查看:165
本文介绍了如何将.loc与groupby一起使用,以使基于分组数据创建新列不会被视为副本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含数据组的CSV文件,并且正在使用 groupby()方法进行隔离。每个组都由一些简单的数学运算处理,包括使用 min() max()几列,再加上一些减法和乘法就可以创建新的数据列。然后,我给每个组画图。多数情况下都可以,但是我对代码有两点抱怨–图形是独立的,没有按照我的意愿组合在一起;我得到 SettingWithCopyWarning每个小组。从我的搜索中,我相信解决方案要么是使用 .loc ,要么是使用更好的拆分应用(可能是合并)方法。
我可以在Excel中做到这一点,但是我正在尝试学习Python,并且在我的代码正常运行的同时,我想对其进行改进。

I have a CSV file with groups of data, and am using the groupby() method to segregate them. Each group is processed by a bit of simple math that includes the use of min() and max() for a couple of columns, along with a bit of subtraction and multiplication to create a new column of data. I then graph each group. This mostly works okay, but I have two complaints about my code - graphs are individual, not combined as I would prefer; I get "SettingWithCopyWarning" with each group. From my searching, I believe the solution is either with the use of .loc or with a better split-apply (and possibly combine) method. I can do this in Excel, but am trying to learn Python and, while my code is functioning, I'd like to improve it.

import os.path
import sys
import pandas as pd

filename = "data/cal_data.csv"
df = pd.read_csv(filename, header=0) #one line of headers

df['Test']="Model "+df['Model No'] +", SN "+ df['Serial No'].values.astype(str) +", Test time "+ df['Test time'].values.astype(str) # combining several columns into a single column that makes grouping straight-forward, and simplifies titles of graphs. Not completely necessary.

df = df[df.index <= df.groupby('Test')['Test Point'].transform('idxmax')]#drop rows after each max test point

for title, group in df.groupby('Test'):
    x1, x2 = min(group["Test Reading"]),max(group["Test Reading"])
    x4, x3 = max(group["Test Point"]),min(group["Test Point"]) #min is usually zero
    R=(x2-x1)/(x4-x3) #linearize
    
    group['Test Point Error']=100*(group['Test Reading']- (group['Test Point']*R+x1))
    
    ax=group.plot(x='Test Point', y='Test Point Error', title=title, grid=True)
    ax.set_ylabel("% error (+/-"+str(Error_Limit)+"% limit)")

输出错误:

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

编辑-添加了的输出df.head(20),以及两幅图的图像:

edit- added output from df.head(20), and an image of a couple of plots:

 Test Point Test Reading    Test
0   0   0.10453 Model LC-500, SN 937618, Test time 17:20:10
1   20  0.17271 Model LC-500, SN 937618, Test time 17:20:10
2   50  0.27838 Model LC-500, SN 937618, Test time 17:20:10
3   100 0.45596 Model LC-500, SN 937618, Test time 17:20:10
4   150 0.63435 Model LC-500, SN 937618, Test time 17:20:10
5   200 0.81323 Model LC-500, SN 937618, Test time 17:20:10
6   250 0.99252 Model LC-500, SN 937618, Test time 17:20:10
7   300 1.17222 Model LC-500, SN 937618, Test time 17:20:10
8   350 1.35219 Model LC-500, SN 937618, Test time 17:20:10
9   400 1.53260 Model LC-500, SN 937618, Test time 17:20:10
10  450 1.71312 Model LC-500, SN 937618, Test time 17:20:10
11  500 1.89382 Model LC-500, SN 937618, Test time 17:20:10
14  0   0.10468 Model LC-500, SN 937618, Test time 17:31:46
15  20  0.17284 Model LC-500, SN 937618, Test time 17:31:46
16  50  0.27856 Model LC-500, SN 937618, Test time 17:31:46
17  100 0.45609 Model LC-500, SN 937618, Test time 17:31:46
18  150 0.63457 Model LC-500, SN 937618, Test time 17:31:46
19  200 0.81341 Model LC-500, SN 937618, Test time 17:31:46
20  250 0.99277 Model LC-500, SN 937618, Test time 17:31:46
21  300 1.17237 Model LC-500, SN 937618, Test time 17:31:46

编辑/更新7/23/2020:
我做了一个几个解决方法,使这项工作,但我会将不胜感激。这是修改后的for循环代码,将每个组写入一个新的csv文件中以供以后阅读(这样,我可以添加在此处创建的新列),还可以删除临时文件(如果已经存在):

Edit/update 7/23/2020: I made a couple of workarounds that make this work, but I would still appreciate any help. Here is the revised for loop code, writing each group to a new csv file to read later (this way I can add the new column created here), also removing the temporary file if it exists already:

if os.path.exists("data/temp.csv"):
    os.remove("data/temp.csv")
for title, group in df.groupby('Test'):

    x1 = min(group["Test Reading"].head(1))
    x2 = max(group["Test Reading"].tail(1))
    x3 = min(group["Test Point"].head(1))
    x4 = max(group["Test Point"].tail(1))
    R=(x2-x1)/(x4-x3) #linearization scalar
    group['Test Point Error'] =100*(group['Test Reading']- (group['Test Point']*R+x1))/(x2-x1)
    file = open('data/temp.csv','a')
    group.to_csv('data/temp.csv', mode="a", index=False, columns=columns, header=False)#, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.'))
    file.close()

然后,在读取临时csv时,我使用了seaborn( import seaborn as sns 导入matplotlib.pyplot作为plt 可以将多个组组合在一起,按序列号分组,每行4个子图。

Then, reading the temporary csv, I used seaborn (import seaborn as sns and import matplotlib.pyplot as plt to plot multiple groups together, grouped by serial number, 4 subplots per row.

df = pd.read_csv('data/temp.csv', header=0)
df['Model/SN']=df['Model No']+" / "+df['Serial No'].values.astype(str)
g = sns.FacetGrid(df, col='Model/SN', hue='Test', col_wrap=4, sharey=False, sharex=False)

g.map(plt.axhline, y=Error_Limit, ls='--', c='red')
g.map(plt.axhline, y=-Error_Limit, ls='--', c='red')

g = g.map(sns.lineplot, 'Test Point', 'Test Point Error', ci=None)

总和这些修复方法并不理想;它们是可解决的解决方案,但我仍然可以看到 SettingWithCopyWarning错误。

Sum up- these fixes are not ideal; they are work-around solutions and I still get the "SettingWithCopyWarning" error.

推荐答案

所以您要输入:


  1. 如何停止为副本设置值。

  2. 如何在matplotlib中为每个组创建带有子图的图。

SettingWithCopyWarning发生这种情况的原因是,您要在每个组上创建一列并设置值,这本身就是DataFrame某些行的副本。与其在每个循环上设置值,不如将 Test_Point_Error存储在一系列列表和 pd.concat(list)退出for循环后,然后将其添加到DF。

The "SettingWithCopyWarning" happens because you are creating a column and setting values on each group, which is itself a copy of some rows of the DataFrame. Instead of setting the values on each loop I would store 'Test_Point_Error' on a list of series and pd.concat(list) after exiting for-loop, then add that to the DF.

-编辑---
尝试替换:

---Edit--- Try replacing:

group['Test Point Error']=100*(group['Test Reading']- (group['Test Point']*R+x1))

with

error_list.append(100 * (group['Test Reading']- (group['Test Point']*R+x1)))

这将为每个组添加一系列,索引与df.index相匹配。完成后,df中的每一行都会有一行错误。因此,在退出for循环之后:

This will append a series for each group, with Indexes matching df.index. When you're done it will have exactly one row of error for each row in df. Therefore after you exit for-loop:

df.assign(test_point_error=pd.concat(error_list))

无论df上的排序如何,都将完全匹配每一行。

Will match each row exactly regardless of any sorting on df.

-编辑结束---

子绘图的问题类似,您在循环时分别绘制每个组。如果您退出for循环后进行绘图,则

The subplotting issue is similar, you are plotting each group separately while looping. If you plot after exiting for-loop then

df.groupby().plot(subplots=True)

将返回您想要的内容。

在一个单独的主题上,我取消了 Test的字符串连接并这样做:

On a separate topic, I would do away with the string concatenation for 'Test' and do:

df.groupby(['Model No', 'Serial No', 'Test Time'])

如果有很多行,这可能会使您的代码快很多。

This might make your code a lot faster if there are many rows.

这篇关于如何将.loc与groupby一起使用,以使基于分组数据创建新列不会被视为副本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆