如何用字数和列名注释堆积的条形图 [英] How to annotate a stacked bar chart with word count and column name

查看:53
本文介绍了如何用字数和列名注释堆积的条形图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题是关于在堆积的条形图中绘制单词频率,而不是在条形上带有标签的数字.假设我有这些话

My question is about plotting in a stacked bar plot the words frequency rather than numbers with labels on the bar. Let's suppose that I have these words

Date    Text     Count
01/01/2020  cura    25
           destra   24
             fino   18
            guerra  13
        americani   13
02/01/2020  italia  137
            turismo 112
            nuovi   109
             pizza  84
            moda    79

通过按日期分组并按Text聚合创建,然后选择前5个(head(5)):

created by grouping by date and aggregating by Text, then selecting the top 5 (head(5)):

尝试:

(我的尝试:这会生成一个堆叠图,但颜色和标签不是我想要的)

(my attempt: this generates a stacked plot, but colours and labels are not what I would like to expect)

data.groupby('Date').agg({'Text': 'value_counts'}).rename(columns={'Text': 'Count'}).groupby('Date').head(5).unstack().plot(kind='bar', stacked=True)

请求:我的预期输出将是一个条形图,其中在x轴上有日期,在y轴上有单词频率(同一日期的每个单词都应以不同的方式进行着色,例如在堆积图中,并且每个条形都应显示单词及其频率).

Request: My expected output would be a bar chart where on the x-axis there are the dates and on the y-axis the words frequency (each word on the same date should be coloured in a different way like in a stacked plot and each bar should show words and their frequency).

示例:请参阅下面的堆叠图示例,这将有助于解释我想做的事情(如果可能的话).在条形图中,不是数字 (340, 226,...),我想要上面的代码选择的最常用词的名称及其频率.在x轴上将显示我之前显示给您的日期,而不是年份(我在网上找不到更好的图表).第一个条形显示前 4 个单词(它们应该是 5 个,但我只找到了一个包含 4 个组的条形图)以及我希望如何可视化结果.对于图表的大小,您能记住我有200个日期吗?将其可视化会很有用.

Example: Please see below an example of stacked plot that it will be useful to explain what I would like to do (if it is possible). In the bars, instead of the numbers (340, 226,...), I would like to have the name of the top words selected by that code above and their frequency. On the x-axis there will be the date that I have shown you previously, not the year (I could not find a better plot on the web). The first bar shows the top 4 words (they should be 5 but I found only a bar chart with 4 groups) and how I would like to visualise the results. For the size of the chart, could you please keep in mind that I have 200 dates? It would be useful for visualising it.

如果您想向我展示如何做到这一点,即使使用另一个数据集,也很好.提前非常感谢您为我花费的时间.

If you would like to show me how to do it, even using another dataset, it would be great. Thank you so much in advance for the time you will spend helping me.

推荐答案

创建数据框

import pandas as pd
import matplotlib.pyplot as plt

# data and dataframe
data = {'Date': ['01/01/2020', '01/01/2020', '01/01/2020', '02/01/2020', '02/01/2020', '02/01/2020'],
        'Text': [['cura']*25, ['destra']*24, ['fino']*18, ['italia']*137, ['turismo']*112, ['nuovi']*109]}

df = pd.DataFrame(data)

df = df.explode('Text')

df.Date = pd.to_datetime(df.Date)

groupby 和绘图

  • 为了绘制单词,请注意每个日期行都将所有单词作为列.
  • 即使有些字数为 0,绘图 API 仍包含该信息
  • api会为所有日期绘制第一列,然后为所有日期绘制下一列,依此类推.
  • 因此,用于文本注释的 cols 列表必须在 df_gb 中存在的日期中重复每个单词.
  • 如果您需要使用 head(),请将以下行替换为 df_gb:
    • df_gb = df.groupby('Date').agg({'Text':'value_counts'}).rename(columns = {'Text':'Count'}).groupby('Date').head(2).unstack()
    • groupby and plot

      • In order to plot the words, note that each date row has all the words as columns.
      • Even though some words are 0 count, the plotting api still includes that information
      • The api plots the first column for all dates, then the next column for all dates, and so on.
      • As such, the cols list, used for the text annotations, must have each word repeated for as many dates exist in df_gb.
      • If you need to use head(), swap the following line for df_gb:
        • df_gb = df.groupby('Date').agg({'Text': 'value_counts'}).rename(columns={'Text': 'Count'}).groupby('Date').head(2).unstack()
        • df_gb = df.groupby(['Date']).agg({'Text': 'value_counts'}).rename(columns={'Text': 'Count'}).unstack('Text')
          
          print(df_gb)
          
                     Count                                   
          Text        cura destra  fino italia  nuovi turismo
          Date                                               
          2020-01-01  25.0   24.0  18.0    NaN    NaN     NaN
          2020-02-01   NaN    NaN   NaN  137.0  109.0   112.0
          
          # create list of words of appropriate length; all words repeat for each date
          cols = [x[1] for x in df_gb.columns for _ in range(len(df_gb))]
          
          # plot df_gb
          ax = df_gb.plot.bar(stacked=True)
          
          # annotate the bars
          for i, rect in enumerate(ax.patches):
              # Find where everything is located
              height = rect.get_height()
              width = rect.get_width()
              x = rect.get_x()
              y = rect.get_y()
          
              # The height of the bar is the count value and can used as the label
              label_text = f'{height:.0f}: {cols[i]}'
          
              label_x = x + width / 2
              label_y = y + height / 2
          
              # don't include label if it's equivalently 0
              if height > 0.001:
                  ax.text(label_x, label_y, label_text, ha='center', va='center', fontsize=8)
          
          # rename xtick labels; remove time
          ticks, labels = plt.xticks(rotation=90)
          labels = [label.get_text()[:10] for label in labels]
          plt.xticks(ticks=ticks, labels=labels)
          
          ax.get_legend().remove()
          plt.show()
          

          • See SO: How to annotate each segment of a stacked bar chart? for another example.

          这篇关于如何用字数和列名注释堆积的条形图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆