嵌套查询/在Pandas中有效比较多个数据集 [英] Nested queries / comparing multiple datasets efficiently in Pandas

查看:106
本文介绍了嵌套查询/在Pandas中有效比较多个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Pandas(第一次)来确定人员上课时是否满足前提条件.下面的代码返回了预期的结果,但是我相信有更好的方法可以达到相同的结果.

I am using Pandas (first time) to determine whether personnel meet prerequisites when it comes to course attendance. The code below returns the desired results however I am sure there are much better ways to achieve the same outcome.

确定您是否可以从事物理学的标准如下;

The criteria to determine if you can undertake physics is as follows;

(Math_A OR Math_B OR Math_C) AND (Eng_A OR Eng_B) AND NOT (Physics)

我的问题是,可以采用什么效率或替代方法来完成这项任务.

My question is what efficiencies or alternate methods can be applied to achieve the task.

在阅读嵌套查询之类的内容时,我无法提出一种在一个查询中比较多个查询的方法.理想情况下,我想发表一条声明来检查该人是否满足前提条件,但是到目前为止我还没有通过.

Reading up on nested queries and alike, I am not able to come up with a way to compare multiple queries in the one query. Ideally, I'd like to have one statement that checks if the person satisfies the prerequisites however I have failed at this so far.

数据集-通常将包含> 20,000条记录

Dataset - Will normally contain > 20,000 records

Emplid,Name,CourseId,CourseName
123,David,P12,Mathematics A
123,David,P13,Mathematics B
123,David,P14,Mathematics C
123,David,P32,Engineering A
456,Sally,P33,Engineering B
789,Roger,P99,Physics

代码

经过修订以简化可读性-Thx Boud.

import pandas as pd

def physics_meets_prereqs():

    df = pd.DataFrame({'Emplid':['123','123', '123', '123', '456', '789'],
                   'Name':['David','David','David','David','Sally','Roger'],
                   'CourseId':['P12','P13','P14','P32','P33','P99'],
                   'CourseName':['Mathematics A','Mathematics B','Mathematics C','Engineering A','Engineering B', 'Physics']
                    })

    # Get datasets of individually completed courses
    has_math = df.query('CourseId == "P12" or CourseId == "P13" or CourseId == "P14"')
    has_eng = df.query('CourseId == "P32" or CourseId == "P33"')
    has_physics = df.query('CourseId == "P99"')

    # Get personnel who have completed math and engineering
    has_math_and_eng = has_math[(has_math['Emplid'].isin(has_eng['Emplid']))]

    # Remove personnel who have completed physics
    has_math_and_eng_no_physics = has_math_and_eng[~(has_math_and_eng['Emplid'].isin(has_physics['Emplid']))]

    print(has_math_and_eng_no_physics)

physics_meets_prereqs()

输出

  CourseId     CourseName Emplid   Name
0      P12  Mathematics A    123  David
1      P13  Mathematics B    123  David
2      P14  Mathematics C    123  David

输出结果导致David被确定为符合物理课程的前提条件.它确实列出了他3次,但我还没有弄清楚如何限制.我实现这一目标的方式当然可以改善.

The output is resulting in David being identified as meeting the prerequisites for the physics course. It does list him 3 times which I have not figured out how to limit as yet. The way I am achieving this though can definitely be improved.

简而言之

向我显示至少完成一门数学课程,至少一门工程课程并且尚未完成物理课程的人员的列表.

Show me a list of people who have completed at least one of the maths courses, at least one of the engineering courses and have not yet completed the physics course.

推荐答案

pivoted = df.groupby(['Name', df.CourseName.str.split().str[0]]) \
            .CourseId.size().gt(0).unstack(fill_value=False)

pivoted

matches = pivoted.query('Engineering & Mathematics & ~Physics')
matches

df.query('Name in @matches.index')

这篇关于嵌套查询/在Pandas中有效比较多个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆