使ggplot2将密度直方图绘制为线 [英] Making ggplot2 plot density histograms as lines

查看:113
本文介绍了使ggplot2将密度直方图绘制为线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于我在3个网站上收集的评分的简单表格(比如说OpenTable,Yelp,TripAdvisor)。评级从1到5,因此评级是一个因子列,而网站是另一个因子列(只允许3个值)。我只有这2列和我的所有观察。该结构是一个名为 all 的数据框,其中包含上述列。示例:

 网站评分
_________________________
Yelp 1
TripAdvisor 2
Yelp 3
OpenTable 2

我想要做的是制作一个彩色密度图。



我的问题看起来与在此主题中发布的问题完全相同:



但是,该解决方案不起作用为了我。我试过了,只用
$ b ggplot(all,aes(rating,color = website,group = website))+ geom_density ()



但它不起作用。我没有给出插值曲线,下面是我得到的结果:



'p>它看起来,我认为我具有相同的数据结构,在链接的线程OP:一个数据帧(所有)具有两个因子列(网站评分)。

 >模式(全部)
[1]list
> (所有网站)
[1] TripAdvisor TripAdvisor TripAdvisor TripAdvisor TripAdvisor点击此处巴士网站点评的最受欢迎的旅游推荐TripAdvisor网站OpenTable Yelp
&头(全部为$等级)
[1] 1 2 1 4 5 2
等级:1 2 3 4 5

我的问题是:为什么我的行为不同?我能做些什么来获得同样的情节?作为一种奖励/不同的解决方案,我也会尝试用直线插入我的点,而不是使用更复杂的内核,但我需要保持密度,因为我对一个网站的观察数量比其他2个网站的总和要多。

b
$ b

数据示例:

 > (所有[sample(nrow(all),200),])
structure(list(website = structure(c(3L,3L,3L,3L,3L,3L,
3L,3L, 1L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 1L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,
2L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,1L,3L,3L,3L, 3L,3L,1L,3L,
3L,3L,3L,3L,2L,3L,1L,3L,3L,3L,3L,3L,2L,3L,1L, 3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 1L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L, 1L,1L,3L,3L,3L,3L,1L,2L,3L,1L,3L,3L,
3L,3L),。标签= c(TripAdvisor,OpenTable ),类= 因子),
等级= C(2,4,5,3,5,3,2,4,4,5,5,2,5,5,4,2,
5,4,5,5,4,4,3,5,3,2,4,4,4,2,4,5,3,4,5,
4,4, 3,5,4,5,2,5,5,4,3,1,5,5,5,5,2,4,1,
1,4,4,4,3,1, 5,4,4,5,4,4,5,4,1,1,3,4,5,
5,5,4,5,2,3,4,2,4,4, 4,3,2,4,4,4,4,5,4,
5,3,1,5,2,3,5,1,5,4,4,5,5,4, 4,4,4,5,4,
4,4,3,3,5,2,4,3,5,3,3,3,5,4,1,3,3,5, 4,
4,2,2,4,3,2,5,5,5,4,5,1,2,5,2,4,2,5,3,
4, 4,3,4,5,3,3,5,4,2,4,5,4,1,4,5,1,5,1,
2,5,3,3,4, 5,4,4,3,3,4,4,3,3,4,3,4,3,4,
5,3,2,5,3,4,4,1,5, ),.Names = c(website,
rating),row.names = c(2736944L,3701156L,4217688L,5350640L,
3600261L,2944052L, 3522393L,5443298L,3965562L,490821L,4706825L,
1694078L,3395609L,2220568L,2886121L,4329867L,3414341L,4911507L,
2629 607L,2547491L,5254750L,5089579L,922864L,643065L,1797579L,
782480L,686194L,5035633L,998745L,553929L,888404L,730158L,
4357257L,1824206L,4941425L,2910113L,2006209L,643302L,1534660L,
3489947L,202175L,2483374L,820339L,3411547L,4792406L,1379214L,
3900503L,1000939L,3823518L,5340233L,1330743L,5333146L,3638755L,
2445636L,1057389L,5092709L,5092040L,3841598L, 3739264L,1482807L,
1314908L,2522682L,1757427L,723017L,4809829L,4636027L,1728575L,
2974897L,3485658L,2592565L,3207974L,2721825L,4295506L,4953206L,
3325724L,4706765L,455090L, 5386094L,612504L,3483673L,881132L,
1715784L,4478951L,1995026L,1640553L,4213693L,925338L,4541407L,
3602299L,5233082L,727017L,4954392L,270757L,3436121L,3793314L,
824985L, 1558576L,3659425L,2131835L,1721671L,32696L,3405602L,
2736827L,4403647L,2171731L,2954043L,976434L,3680791L,30799L,
4833704L,3895171L,4469617L,2517017L, 4236947L,733711L,1480361L,
255671L,4847331L,355851L,2933805L,5470569L,3045714L,3423394L,
475428L,4460007L,4668961L,1560070L,3314368L,2150067L,4480758L,
781676L,3659111L, 4799721L,3509779L,5320687L,5179115L,852931L,
4141898L,4768793L,1356381L,3881247L,1685112L,2232222L,315374L,
1721551L,1464571L,2472040L,3198238L,4719488L,2763751L,2999152L,
2042160L,1374928L,1703496L,1805583L,5192311L,3558389L,925026L,
5497787L,2464617L,1850617L,1047932L,186007L,3168546L,1433736L,
1548105L,5450L,5288180L,2476807L,997242L,4693332L,5107109L,
3338800L,2722363L,58422L,3408902L,4537803L,2780976L,2129998L,
376274L,1773109L,5138810L,2364642L,1087043L,3318862L,1567254L,
418564L,726387L,4128160L,4669905L,1194602L, 2315020L,211234L,
818018L,3378122L,462827L,1516313L,3120210L,4257323L,5214034L
),class =data.frame)


解方案

由于@joran在他的评论中指出,这一切似乎是带宽的问题。如果我用低带宽绘制您的示例数据,它看起来像您提供的图像:

  ggplot(all,aes(等级,color = website,group = website))+ geom_density(adjust = 0.1)


但是对于高带宽,它似乎很有趣不同:

  ggplot(all,aes(rating,color = website,group = website))+ geom_density(adjust = 2) 


  all.prop<  -  data.frame(prop.table(table(website = all $ website,rating = all $ rating),1))
ggplot(all.prop,aes(x = rating,y = Freq))+ geom_line(aes(group = website,color = website))


I have a simple table about ratings I collected on 3 websites (let's say OpenTable, Yelp, TripAdvisor). The ratings go from 1 to 5 and therefore Rating is a factor column and Website is another factor column (only 3 values allowed). I have only such 2 columns and all my observations. The structure is a data frame named all containing the aforementioned columns. Example:

Website           Rating
_________________________
Yelp                 1
TripAdvisor          2
Yelp                 3
OpenTable            2

What I would like to do is to have a colored density plot.

My problem looks EXACTLY the same as the one posted in this thread: Create a density plot with ggplot2 using a factor

However, that solution is not working for me. I tried it by just substituting my variables names using

ggplot(all, aes(rating, colour=website, group=website)) + geom_density()

but it does not work. Instead of giving me an interpolated curve, here is what I get:

It looks to me that I have the same data structure as the OP in the linked thread: a data frame (all) with two factor columns (website and rating).

> mode(all)
[1] "list"
> head(all$website)
[1] TripAdvisor TripAdvisor TripAdvisor TripAdvisor TripAdvisor TripAdvisor
Levels: TripAdvisor OpenTable Yelp
> head(all$rating)
[1] 1 2 1 4 5 2
Levels: 1 2 3 4 5

My question is: why is my behavior different? And what can I do to get the same plot? As a bonus/different solution, I would also try and interpolate my points with straight lines instead of using more complex kernels but I need to keep densities since I have many more observations for one websites than the other 2 combined.

Data sample:

> dput(all[sample(nrow(all), 200),])
structure(list(website = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
3L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 
2L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 
3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 1L, 3L, 
1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 
3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 2L, 1L, 3L, 3L, 3L, 
1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 3L, 3L, 1L, 3L, 3L, 
3L, 3L, 3L, 3L, 1L, 1L, 3L, 3L, 3L, 3L, 1L, 2L, 3L, 1L, 3L, 3L, 
3L, 3L), .Label = c("TripAdvisor", "OpenTable", "Yelp"), class = "factor"), 
    rating = c(2, 4, 5, 3, 5, 3, 2, 4, 4, 5, 5, 2, 5, 5, 4, 2, 
    5, 4, 5, 5, 4, 4, 3, 5, 3, 2, 4, 4, 4, 2, 4, 5, 3, 4, 5, 
    4, 4, 3, 5, 4, 5, 2, 5, 5, 4, 3, 1, 5, 5, 5, 5, 2, 4, 1, 
    1, 4, 4, 4, 3, 1, 5, 4, 4, 5, 4, 4, 5, 4, 1, 1, 3, 4, 5, 
    5, 5, 4, 5, 2, 3, 4, 2, 4, 4, 4, 3, 2, 4, 4, 4, 4, 5, 4, 
    5, 3, 1, 5, 2, 3, 5, 1, 5, 4, 4, 5, 5, 4, 4, 4, 4, 5, 4, 
    4, 4, 3, 3, 5, 2, 4, 3, 5, 3, 3, 3, 5, 4, 1, 3, 3, 5, 4, 
    4, 2, 2, 4, 3, 2, 5, 5, 5, 4, 5, 1, 2, 5, 2, 4, 2, 5, 3, 
    4, 4, 3, 4, 5, 3, 3, 5, 4, 2, 4, 5, 4, 1, 4, 5, 1, 5, 1, 
    2, 5, 3, 3, 4, 5, 4, 4, 3, 3, 4, 4, 3, 3, 4, 3, 4, 3, 4, 
    5, 3, 2, 5, 3, 4, 4, 1, 5, 4, 3, 5, 3)), .Names = c("website", 
"rating"), row.names = c(2736944L, 3701156L, 4217688L, 5350640L, 
3600261L, 2944052L, 3522393L, 5443298L, 3965562L, 490821L, 4706825L, 
1694078L, 3395609L, 2220568L, 2886121L, 4329867L, 3414341L, 4911507L, 
2629607L, 2547491L, 5254750L, 5089579L, 922864L, 643065L, 1797579L, 
782480L, 686194L, 5035633L, 998745L, 553929L, 888404L, 730158L, 
4357257L, 1824206L, 4941425L, 2910113L, 2006209L, 643302L, 1534660L, 
3489947L, 202175L, 2483374L, 820339L, 3411547L, 4792406L, 1379214L, 
3900503L, 1000939L, 3823518L, 5340233L, 1330743L, 5333146L, 3638755L, 
2445636L, 1057389L, 5092709L, 5092040L, 3841598L, 3739264L, 1482807L, 
1314908L, 2522682L, 1757427L, 723017L, 4809829L, 4636027L, 1728575L, 
2974897L, 3485658L, 2592565L, 3207974L, 2721825L, 4295506L, 4953206L, 
3325724L, 4706765L, 455090L, 5386094L, 612504L, 3483673L, 881132L, 
1715784L, 4478951L, 1995026L, 1640553L, 4213693L, 925338L, 4541407L, 
3602299L, 5233082L, 727017L, 4954392L, 270757L, 3436121L, 3793314L, 
824985L, 1558576L, 3659425L, 2131835L, 1721671L, 32696L, 3405602L, 
2736827L, 4403647L, 2171731L, 2954043L, 976434L, 3680791L, 30799L, 
4833704L, 3895171L, 4469617L, 2517017L, 4236947L, 733711L, 1480361L, 
255671L, 4847331L, 355851L, 2933805L, 5470569L, 3045714L, 3423394L, 
475428L, 4460007L, 4668961L, 1560070L, 3314368L, 2150067L, 4480758L, 
781676L, 3659111L, 4799721L, 3509779L, 5320687L, 5179115L, 852931L, 
4141898L, 4768793L, 1356381L, 3881247L, 1685112L, 2232222L, 315374L, 
1721551L, 1464571L, 2472040L, 3198238L, 4719488L, 2763751L, 2999152L, 
2042160L, 1374928L, 1703496L, 1805583L, 5192311L, 3558389L, 925026L, 
5497787L, 2464617L, 1850617L, 1047932L, 186007L, 3168546L, 1433736L, 
1548105L, 5450L, 5288180L, 2476807L, 997242L, 4693332L, 5107109L, 
3338800L, 2722363L, 58422L, 3408902L, 4537803L, 2780976L, 2129998L, 
376274L, 1773109L, 5138810L, 2364642L, 1087043L, 3318862L, 1567254L, 
418564L, 726387L, 4128160L, 4669905L, 1194602L, 2315020L, 211234L, 
818018L, 3378122L, 462827L, 1516313L, 3120210L, 4257323L, 5214034L
), class = "data.frame")

解决方案

As @joran pointed out in his comment, it all seems to be a matter of bandwidth. If I plot your sample data with a low bandwidth, it looks like the image you provided :

ggplot(all, aes(rating, colour=website, group=website)) + geom_density(adjust=0.1)

But with a high bandwidth, it seems quite different :

ggplot(all, aes(rating, colour=website, group=website)) + geom_density(adjust=2)

If you want to just plot your relative frequencies connected with lines, I think you must compute them beforehand. For example :

all.prop <- data.frame(prop.table(table(website=all$website, rating=all$rating),1))
ggplot(all.prop, aes(x=rating, y=Freq)) + geom_line(aes(group=website, color=website))

这篇关于使ggplot2将密度直方图绘制为线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆