为什么我的ggplot2中的堆积面积图为空 [英] Why is my stacked area graph in ggplot2 empty
问题描述
我正在尝试使用以下命令在r中生成堆积面积图:
I am trying to generate a stacked area graph in r using below command:
ggplot(p_ash_r_100,aes(x=SMPL_TIME,y=SMPL_CNT,col=EVENT,group=1))+ geom_area()
这里EVENT是我想根据ORACLE DB中的时间和样本计数绘制的第三个变量.
Here EVENT is the 3rd variable which I want to chart out based on time and sample counts in the ORACLE DB.
但是带有上述命令的图形将返回空.
But the graph with above commands is returning empty.
我的问题是:
-
如何解决空图问题.
How to fix the empty graph problem.
如何在显示或更早的显示时仅根据数据量过滤10个顶部变量?正如我在图像文件中所示,我们可以在excel中轻松做到这一点.
How to filter only the 10 top variables based on the amount of data when displaying or earlier? We can do it easily in excel as I show here in the image file.
我的数据集如下:
> p_ash_r_100
SMPL_TIME SQL_ID MODULE EVENT SMPL_CNT
1 11-APR-17 09:00 03d5x9busf1d8 SQL*Plus CPU 1
2 11-APR-17 09:00 2pb7bzzadj0pn OGG-RCASI004-OPEN_DATA_SOURCE db file sequential read 1
3 11-APR-17 09:00 NO_SQL GoldenGate CPU 1
4 11-APR-17 09:00 NO_SQL MMON_SLAVE CPU 1
5 11-APR-17 09:00 NO_SQL NO_SQL Log archive I/O 1
6 11-APR-17 09:00 NO_SQL XStream CPU 1
7 11-APR-17 09:00 acuzxh557cq81 GoldenGate db file sequential read 1
8 11-APR-17 09:00 cqtby4bsrmxzh GoldenGate CPU 1
9 11-APR-17 09:00 dgzp3at57cagd GoldenGate db file sequential read 2
10 11-APR-17 09:00 fjp9t92a5yx1v GoldenGate db file sequential read 1
11 11-APR-17 09:00 guh1sva39p9db GoldenGate db file sequential read 1
12 11-APR-17 09:01 0hz0dhgwk12cd GoldenGate direct path write 1
13 11-APR-17 09:01 2jafq5d4n0akv GoldenGate CPU 1
14 11-APR-17 09:01 37cspa0acgqxp GoldenGate db file sequential read 2
15 11-APR-17 09:01 79rugrngrvpt1 OGG-RADDR025-OPEN_DATA_SOURCE db file sequential read 1
16 11-APR-17 09:01 7k6zp92kbv28m GoldenGate CPU 1
17 11-APR-17 09:01 7nvtkfc0bt8vv GoldenGate db file sequential read 1
18 11-APR-17 09:01 7pvpzvd1g769d GoldenGate CPU 1
19 11-APR-17 09:01 9gduk46rmt5jy GoldenGate db file sequential read 1
20 11-APR-17 09:01 NO_SQL GoldenGate CPU
7
在下面添加数据集的图像以便于理解
Adding image of the dataset below for ease of understanding
我想从excel =>那里得到这样的最终图
The end graph which I want to get it something like this one from excel=>
excel中的值过滤器可获取excel中排名前10位的事件=>
Value filters in excel to get Top 10 events in excel =>
推荐答案
我将从第二个问题开始,这很容易.使用 dplyr
包,可以使用 top_n
获取给定列的n个最大行.例如:
I'll start with the second question, which is easier. Using the dplyr
package, you can use top_n
to get the n largest rows for a given column. For example:
> top_n(p_ash_r_100a, 3, SMPL_CNT) %>% arrange(desc(SMPL_CNT))
# A tibble: 3 × 5
SMPL_TIME SQL_ID MODULE EVENT SMPL_CNT
<dttm> <chr> <chr> <chr> <int>
1 2017-04-11 09:01:00 NO_SQL GoldenGate CPU 7
2 2017-04-11 09:00:00 dgzp3at57cagd GoldenGate db file sequential read 2
3 2017-04-11 09:01:00 37cspa0acgqxp GoldenGate db file sequential read 2
请注意,如果并列第n位,您将获得n行以上.因此,由于17位并列第4位, top_n(p_ash_r_100,10,SMPL_CNT)
将返回整个样本数据集.
Note that you will get more than n rows if there are ties for nth place. Thus top_n(p_ash_r_100, 10, SMPL_CNT)
will return the entire sample data set because of the 17-way tie for 4th.
关于第一个问题, geom_area
的文档提供了一个线索:
As for the first question, the documentation for geom_area
provides a clue:
面积图是堆积条形图的连续模拟(请参见geom_bar),并且可以用来显示整体的组成如何变化在x的范围内.
An area plot is the continuous analog of a stacked bar chart (see geom_bar), and can be used to show how composition of the whole varies over the range of x.
这表明 geom_area
希望映射到x的列应该是数字.根据 p_ash_r_100
的列表, SMPL_TIME
似乎是一个字符向量.使用 lubridate
包,我们可以使用 dmy_hm
将 SMPL_TIME 转换为日期时间:
This suggests that geom_area
expects the column mapped to x should be numeric. Based on the listing for p_ash_r_100
, SMPL_TIME
appears to be a character vector. With the lubridate
package, we can convert SMPL_TIME
to a date-time with dmy_hm
:
p_ash_r_100a <- p_ash_r_100 %>%
mutate_at(vars(SMPL_TIME), dmy_hm)
但是,这不足以获取所需的绘图,因为 x
和 fill
y 值>(这是 geom_area
的正确美学,而不是" col
").在绘制之前,我们需要对数据进行汇总:
However, this isn't enough to get the plot you want since there are multiple values of y
for each combination of x
and fill
(which is the correct aesthetic for geom_area
, not "col
"). We need to summarise the data before plotting:
p_ash_r_100a %>%
group_by(SMPL_TIME, EVENT) %>%
summarise(total = sum(SMPL_CNT)) %>%
ggplot(aes(SMPL_TIME, total, fill = EVENT)) +
geom_area()
但是情节仍然不正确.这是因为 SMPL_TIME
和 EVENT
的每种组合未在数据集中表示.我们需要明确告诉 geom_area
,对于那些丢失的行, y
等于零.一种方法是在 tidyr :: spread
中使用方便的 fill
参数.
Yet the plot is still not correct. This is because every combination of SMPL_TIME
and EVENT
is not represented in the data set. We need to explicitly tell geom_area
that y
is equal to zero for those missing rows. One way is to use the handy fill
argument in tidyr::spread
.
group_by(p_ash_r_100a, SMPL_TIME, EVENT) %>%
summarise(smpl_sum = sum(SMPL_CNT)) %>%
spread(EVENT, smpl_sum, fill = 0) %>%
gather(EVENT, smpl_sum, CPU, `db file sequential read`,
`direct path write`,
`Log archive I/O`) %>%
ggplot(aes(x = SMPL_TIME, y = smpl_sum, fill = EVENT)) +
geom_area()
这篇关于为什么我的ggplot2中的堆积面积图为空的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!