如何在尝试获得最高观看次数的同时使用不同的 group by 组合 [英] How to use different combinations of group by while trying to get the top most viewed

查看:36
本文介绍了如何在尝试获得最高观看次数的同时使用不同的 group by 组合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用多列的 groupby 获得最高评级,如果没有该特定 groupby 的组合,则会向我抛出错误.如何进行多种组合?

Am trying to get the top most rating using groupby of multiple columns and if there is no combination of that particular groupby, its throwing me an error . how to do multiple combinations ?

数据:

maritalstatus   gender age_range occ    rating
ma                 M    young   student  PG
ma                 F    adult   teacher  R
sin                M    young   student  PG
sin                M    adult   teacher  R
ma                 M    young   student  PG
sin                F    adult   teacher  R

代码:

def get_top( maritalstatus, gender,age_range, occ):        
    m = df.groupby(['maritalstatus',' gender', 'age_range', 'occ'])
    ['rating'].apply(lambda x: x.value_counts().index[0 ])      
    mpaa = m[maritalstatus][gender][age_range][occ]    
    return mpaa

输入:

get_top('ma', 'M', 'young','teacher)

输出:因为没有这样的组合,所以给我一个错误.

output: throws me an error as there is no such combination.

这里如果没有这样的组合,我的职能应该仅限于已婚、男性和年轻,而不是老师,因为没有这样的组合.

Here if there is no such combination my function should limit to, married, male and young and not teacher as there is no such combination.

推荐答案

pandas 绝对是处理详细表格数据的 goto 库.对于那些寻求非pandas 选项的人,您可以构建自己的mappingreduction 函数.我使用这些术语来表示以下含义:

pandas is definitely the goto library for handling detailed tabular data. For those seeking a non-pandas option, you can build your own mapping and reduction functions. I use these terms to mean the following:

  • mapping:重新组织按所需查询分组的数据
  • reduction:聚合函数,用于将多个值汇总或压缩为一个
  • mapping: reorganize data grouped by a desired query
  • reduction: an aggregation function, used to tally or condense many values to one

pandas 类似的groupby/聚合概念.

给定

清理数据,其中多个空格已被单个分隔符替换,例如",".

Cleaned data where multiple spaces have been replaced with a single delimiter, e.g. ",".

%%file "test.txt"
status,gender,age_range,occ,rating
ma,M,young,student,PG
ma,F,adult,teacher,R
sin,M,young,student,PG
sin,M,adult,teacher,R
ma,M,young,student,PG
sin,F,adult,teacher,R

代码

import csv
import collections as ct

第一步:读取数据

def read_file(fname):
    with open(fname, "r") as f:
        reader = csv.DictReader(f)
        for line in reader:
            yield line


iterable = [line for line in read_file("test.txt")]
iterable

输出

[OrderedDict([('status', 'ma'),
              ('gender', 'M'),
              ('age_range', 'young'),
              ('occ', 'student'),
              ('rating', 'PG')]),
 OrderedDict([('status', 'ma'),
              ('gender', 'F'),
              ('age_range', 'adult'),
              ...]
 ...
] 

第 2 步:重新映射数据

def mapping(data, column):
    """Return a dict of regrouped data."""
    dd = ct.defaultdict(list)
    for d in data:
        key = d[column]
        value = {k: v for k, v in d.items() if k != column}
        dd[key].append(value)
    return dict(dd)


mapping(iterable, "gender")

输出

{'M': [
   {'age_range': 'young', 'occ': 'student', 'rating': 'PG', ...},
   ...]
 'F': [
   {'status': 'ma', 'age_range': 'adult', ...},
   ...]
} 

第 3 步:减少数据

def reduction(data):
    """Return a reduced mapping of Counters."""
    final = {}
    for key, val in data.items():
        agg = ct.defaultdict(ct.Counter)
        for d in val:
            for k, v in d.items():
                agg[k][v] += 1
        final[key] = dict(agg)
    return final

reduction(mapping(iterable, "gender"))

输出

{'F': {
   'age_range': Counter({'adult': 2}),
   'occ': Counter({'teacher': 2}),
   'rating': Counter({'R': 2}),
   'status': Counter({'ma': 1, 'sin': 1})},
 'M': {
   'age_range': Counter({'adult': 1, 'young': 3}),
   'occ': Counter({'student': 3, 'teacher': 1}),
   'rating': Counter({'PG': 3, 'R': 1}),
   'status': Counter({'ma': 2, 'sin': 2})}
 }

演示

有了这些工具,您就可以构建数据管道并查询数据,将结果从一个函数传送到另一个函数:

With these tools in place, you can build a data pipeline and to query the data, feeding results from one function into another:

# Find the top age range amoung males
pipeline = reduction(mapping(iterable, "gender"))
pipeline["M"]["age_range"].most_common(1)
# [('young', 3)]

# Find the top ratings among teachers
pipeline = reduction(mapping(iterable, "occ"))
pipeline["teacher"]["rating"].most_common()
# [('R', 3)]

# Find the number of married people
pipeline = reduction(mapping(iterable, "gender"))
sum(v["status"]["ma"] for k, v in pipeline.items())
# 3

总体而言,您可以根据定义归约函数的方式来定制输出.

Overall, you tailor your output based on how you define your reduction function.

注意,这个通用过程的代码比 前面的例子 尽管它对许多数据列有强大的应用.pandas 简洁地封装了这些概念.尽管最初的学习曲线可能更陡峭,但它可以极大地加快数据分析.

Note, the code from this generalized process is more verbose than a former example despite its powerful application to many data columns. pandas succinctly encapsulates these concepts. Although the learning curve may initially be more steep, it can greatly expedite data analysis.

详情

  1. 读取数据 - 我们使用csv.DictReader,它将标题名称维护为字典的键.这种结构有助于更轻松地按名称访问列.
  2. 重新映射数据 - 我们将数据分组为字典.
    • 键是选定/查询列中的项目,例如"M", "F".
    • 每个值都是一个字典列表.每个字典代表一行所有剩余的列数据(不包括键).
  1. Read data - we parse each line of a cleaned file using csv.DictReader, which maintains the header names as keys of a dictionary. This structure facilitates easier column access by name.
  2. Remap data - we group data as a dictionary.
    • The keys are items in the selected/queried column, e.g. "M", "F".
    • The values are each a list of dictionaries. Each dictionary represents a row of all remaining columnar data (excluding the key).

申请

管道是可选的.在这里,我们将构建一个处理串行请求的函数:

Pipelines are optional. Here we will build a single function that processes serial requests:

def serial_reduction(iterable, val_queries):
    """Return a `Counter` that is reduced after serial queries."""
    q1, *qs = val_queries 
    val_to_key = {v:k for k, v in iterable[0].items()}
    values_list = mapping(iterable, val_to_key[q1])[q1]

    counter = ct.Counter()
    # Process queries for dicts in each row and build a counter
    for q in qs:    
        try:
            for row in values_list[:]:
                if val_to_key[q] not in row:
                    continue
                else:
                    reduced_vals = {v for v in row.values() if v not in qs}
            for val in reduced_vals:
                counter[val] += 1
        except KeyError:
            raise ValueError("'{}' not found. Try a new query.".format(q))
    return counter


c = serial_reduction(iterable, "ma M young".split())
c.most_common()
# [('student', 2), ('PG', 2)]
serial_reduction(iterable, "ma M young teacher".split())
# ValueError: 'teacher' not found. Try a new query.

这篇关于如何在尝试获得最高观看次数的同时使用不同的 group by 组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆