如何在 pandas/matplotlib/seaborn 中绘制分类和连续数据 [英] how to plot categorical and continuous data in pandas/matplotlib/seaborn

查看:59
本文介绍了如何在 pandas/matplotlib/seaborn 中绘制分类和连续数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图弄清楚如何绘制这些数据:

第1列['genre']:这些是表格中所有类型的值计数

剧情2453喜剧 2319行动1590恐怖915冒险 586惊悚491纪录片 432动画403犯罪 380幻想272科幻小说214浪漫 186家庭 144谜底 125音乐 100电视电影 78战争59历史 44西餐42国外9名称:流派,dtype:int64

第2栏['release_year']:这些是不同类型的所有发行年份的值计数

2014 6992013 6562015 6272012 5842011 5402009 5312008 4952010 4872007 4382006 4082005 3632004 3072003 2812002 2662001 2412000 2261999 2241998 2101996 2031997 1921994 1841993 1781995年1741988 1451989年1361992 1331991 1331990 1321987年1251986 1211985年1091984年1051981 821982年811983 801980 781978 651979 571977 571971 551973 551976年471974 461966 461975 441964 421970 401967 401972年401968 391965年351963年341962 321960 321969 311961年31日名称:release_year,数据类型:int64

我需要回答类似的问题-哪种类型每年最流行?等等

可以使用什么样的图,因为在一个图表中会有很多箱,所以最好的方法是什么?

seaborn 是否更适合绘制此类变量?

我应该将年度数据划分为2个十年(1900年和2000年)吗?

 表的样本:id流行度运行时类型投票_计数投票_平均发行_年0 135397 32.985763 124 行动 5562 6.5 20151 76341 28.419936 120行动6185 7.1 19952 262500 13.112507 119冒险2480 6.3 20153 140607 11.173104 136惊悚片5292 7.5 20134 168259 9.335014 137 行动 2947 7.3 2005

解决方案

你可以这样做:

I am trying to figure out how could I plot this data:

column 1 ['genres']: These are the value counts for all the genres in the table

Drama              2453
Comedy             2319
Action             1590
Horror              915
Adventure           586
Thriller            491
Documentary         432
Animation           403
Crime               380
Fantasy             272
Science Fiction     214
Romance             186
Family              144
Mystery             125
Music               100
TV Movie             78
War                  59
History              44
Western              42
Foreign               9
Name: genres, dtype: int64

column 2 ['release_year']: These are the value counts for all the release years for different kind of genres

2014    699
2013    656
2015    627
2012    584
2011    540
2009    531
2008    495
2010    487
2007    438
2006    408
2005    363
2004    307
2003    281
2002    266
2001    241
2000    226
1999    224
1998    210
1996    203
1997    192
1994    184
1993    178
1995    174
1988    145
1989    136
1992    133
1991    133
1990    132
1987    125
1986    121
1985    109
1984    105
1981     82
1982     81
1983     80
1980     78
1978     65
1979     57
1977     57
1971     55
1973     55
1976     47
1974     46
1966     46
1975     44
1964     42
1970     40
1967     40
1972     40
1968     39
1965     35
1963     34
1962     32
1960     32
1969     31
1961     31
Name: release_year, dtype: int64

I need to answer the questions like - What genre is most popular from year to year? and so on

what kind of plots can be used and what is the best way to do this since there would be a lot of bins ins a single chart?

Is seaborn better for plotting such variables?

Should I divide the year data into 2 decades(1900 and 2000)?

Sample of the table: 
    id   popularity runtime genres  vote_count  vote_average    release_year
0   135397  32.985763   124 Action     5562     6.5             2015
1   76341   28.419936   120 Action     6185     7.1             1995
2   262500  13.112507   119 Adventure  2480     6.3             2015
3   140607  11.173104   136 Thriller   5292     7.5             2013
4   168259  9.335014    137 Action     2947     7.3             2005

解决方案

You could do something like this:

Plotting histogram using seaborn for a dataframe

Personally i prefer seaborn for this kind of plots, because it's easier. But you can use matplotlib too.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# sample data
samples = 300
ids = range(samples)
gind = np.random.randint(0, 4, samples)
years = np.random.randint(1990, 2000, samples)

# create sample dataframe
gkeys = {1: 'Drama', 2: 'Comedy', 3: 'Action', 4: 'Adventure', 0: 'Thriller'}
df = pd.DataFrame(zip(ids, gind, years),
                  columns=['ID', 'Genre', 'Year'])
df['Genre'] = df['Genre'].replace(gkeys)

# count the year groups
res = df.groupby(['Year', 'Genre']).count()
res = res.reset_index()

# only the max values
# res_ind = res.groupby(['Year']).idxmax()
# res = res.loc[res_ind['ID'].tolist()]

# viz
sns.set(style="white")
g = sns.catplot(x='Year',
                y= 'ID',
                hue='Genre',
                data=res,
                kind='bar',
                ci=None,
                   )
g.set_axis_labels("Year", "Count")
plt.show()

If this are to many bins in a plot, just split it up.

这篇关于如何在 pandas/matplotlib/seaborn 中绘制分类和连续数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆