绘制多个类别 [英] Plot many categories

查看:26
本文介绍了绘制多个类别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下数据,每个实验都会导致一个组合的出现,每个组合都属于一个或多个类别.我想绘制每个组合的出现次数:

I've data as follow, each experiment lead to the apparition of a composition, and each composition belong to one or many categories. I want to plot occurence number of each composition:

DF <- read.table(text = " Comp         Category

Comp1             1
Comp2             1   
Comp3             4,2
Comp4             1,3
Comp1             1,2
Comp3             3 ", header = TRUE)

barplot(table(DF$Comp))

所以这对我来说非常有效.

So this worked perfectly for me.

之后,作为组合属于一个或多个类别.类别之间有逗号分隔.我想对 X 中的组合和 Y 中的组合的 nb 进行条形图,并且对于每个条,每个类别的百分比.

After that, as composition belong to one or many categories. there's comma separations between categories.I Want to barplot the compo in X and nb of compo in Y, and for each bar the % of each category.

我的想法是复制有逗号的行,因此重复 N+1 逗号的数量.

My Idea was to duplicate the line where there is comma, so to repete it N+1 the number of comma.

DF = table(DF$Category,DF$Comp)
cats <- strsplit(rownames(DF), ",", fixed = TRUE)
DF <- DF[rep(seq_len(nrow(DF)), sapply(cats, length)),]
DF <- as.data.frame(unclass(DF))
DF$cat <- unlist(cats)
DF <- aggregate(. ~ cat, DF, FUN = sum)

它会给我例如:对于 Comp1

it will give me for example: for Comp1

          1     2     3     4
Comp1     2     1     0     0

但是如果我应用这个方法,类别(3)的总数将不会对应于作品的总数(comp1=2).

But If I apply this method, the total number of category (3) won't correspond to the total number of compositions (comp1=2).

在这种情况下如何进行?解决方案是除以逗号 +1 的 nb 吗?如果是,如何在我的代码中执行此操作,是否有最简单的方法?

How to proceed in such case ? is the solution is to devide by the nb of comma +1 ? if yes, how to do this in my code, and is there a simpliest way ?

非常感谢!

推荐答案

制作情节需要两个步骤,正如您已经注意到的.首先需要准备数据,然后才能创建绘图.

Producing your plot requires two steps, as you already noticed. First, one needs to prepare the data, then one can create the plot.

您已经展示了将数据转化为合适形式的努力,但让我提出另一种方法.

You have already shown your efforts of bringing the data to a suitable form, but let me propose an alternative way.

首先,我必须确保数据框的 Category 列是一个字符而不是一个因素.我还存储了数据框中出现的所有类别的向量:

First, I have to make sure that the Category column of the data frame is a character and not a factor. I also store a vector of all the categories that appear in the data frame:

DF$Category <- as.character(DF$Category)
cats <- unique(unlist(strsplit(DF$Category, ",")))

然后我需要总结数据.为此,我需要一个函数,该函数为 Comp 中的每个值提供每个类别缩放的百分比,以便值的总和给出具有该 Comp 的原始数据中的行数.

I then need to summarise the data. For this purpose, I need a function that gives for each value in Comp the percentage for each category scaled such, that the sum of values gives the number of rows in the original data with that Comp.

下面的函数以另一个数据帧的形式返回整个数据帧的这个信息(输出需要是数据帧,因为我以后想用do()的函数).

The following function returns this information for the entire data frame in the form of another data frame (the output needs to be a data frame, because I want to use the function with do() later).

cat_perc <- function(cats, vec) {
  # percentages
  nums <- sapply(cats, function(cat) sum(grepl(cat, vec)))
  perc <- nums/sum(nums)
  final <- perc * length(vec)
  df <- as.data.frame(as.list(final))
  names(df) <- cats
  return(df)
}

在完整的数据框上运行该函数给出:

Running the function on the complete data frame gives:

cat_perc(cats, DF$Category)
##          1         4        2        3
## 1 2.666667 0.6666667 1.333333 1.333333

值之和为 6,这确实是原始数据框中的总行数.

The values sum up to six, which is indeed the total number of rows in the original data frame.

现在我们要为 Comp 的每个值运行该函数,这可以使用 dplyr 包来完成:

Now we want to run that function for each value of Comp, which can be done using the dplyr package:

library(dplyr)
plot_data <-
group_by(DF, Comp) %>%
  do(cat_perc(cats, .$Category))
plot_data
## plot_data
## Source: local data frame [4 x 5]
## Groups: Comp [4]
## 
##     Comp        1         4         2         3
##   (fctr)    (dbl)     (dbl)     (dbl)     (dbl)
## 1  Comp1 1.333333 0.0000000 0.6666667 0.0000000
## 2  Comp2 1.000000 0.0000000 0.0000000 0.0000000
## 3  Comp3 0.000000 0.6666667 0.6666667 0.6666667
## 4  Comp4 0.500000 0.0000000 0.0000000 0.5000000

这首先按Comp 对数据进行分组,然后将函数cat_perc 仅应用于具有给定Comp 的数据帧的子集.

This first groups the data by Comp and then applies the function cat_perc to only the subset of the data frame with a given Comp.

我将使用 ggplot2 包绘制数据,这要求数据采用所谓的长格式.这意味着要绘制的每个数据点都应对应于数据框中的一行.(就像现在一样,每行包含 4 个数据点.)这可以通过 tidyr 包来完成,如下所示:

I will plot the data with the ggplot2 package, which requires the data to be in the so-called long format. This means that each data point to be plotted should correspond to a row in the data frame. (As it is now, each row contains 4 data points.) This can be done with the tidyr package as follows:

library(tidyr)
plot_data <-  gather(plot_data, Category, value, -Comp)
head(plot_data)
## Source: local data frame [6 x 3]
## Groups: Comp [4]
## 
##     Comp Category    value
##   (fctr)    (chr)    (dbl)
## 1  Comp1        1 1.333333
## 2  Comp2        1 1.000000
## 3  Comp3        1 0.000000
## 4  Comp4        1 0.500000
## 5  Comp1        4 0.000000
## 6  Comp2        4 0.000000

如您所见,现在每行只有一个数据点,由 CompCategory 和相应的 value 表征.

As you can see, there is now a single data point per row, characterised by Comp, Category and the corresponding value.

现在一切都已读取,我们可以使用 ggplot 绘制数据:

Now that everything is read, we can plot the data using ggplot:

library(ggplot2)
ggplot(plot_data, aes(x = Comp, y = value, fill = Category)) +
  geom_bar(stat = "identity")

这篇关于绘制多个类别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆