在R中随时间绘制字符串匹配的频率 [英] Plotting the frequency of string matches over time in R

查看:157
本文介绍了在R中随时间绘制字符串匹配的频率的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编译了过去几个月左右发送的推文语料库,这看起来像这样(实际语料库有更多的列,显然有更多的行,但你明白了)

  id时间日月份年处理什么
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john不能等到周末
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet

现在我想在R中做的是,使用grep进行字符串匹配,随时间绘制某些单词/主题标签的频率,理想情况下用该月/日/小时/任何推文的数量标准化。但我不知道该怎么做。



我知道如何使用grep创建这个数据框的子集,例如对于包括#lfc hashtag在内的所有推文,我真的不知道该从哪里去。



另一个问题是,无论我的x-轴(小时/天/月等)需要是数字,'when'列不是。我已经尝试将2月13日的'day'和'month'列连接成类似'2.13'的列,但是这导致了R将2.13视为'更早'的问题,可以这么说,比2.7(2月7日)在数学基础上。

所以基本上,我'd喜欢制作像这样的情节,其中字符串x的频率与时间相关



谢谢!

解决方案

这里有一种方法可以计算白天的推文。我已经说明了一个简化的假数据集:

  library(dplyr)
library(lubridate)

#假数据
set.seed(485)
dat = data.frame(time = seq(as.POSIXct(2016-01-01),as.POSixct( ),
what = sample(LETTERS,10000,replace = TRUE))

tweet.summary = dat%>%group_by (day = date(time))%>%#按月汇总:group_by(month = month(time,label = TRUE))
summary(total.tweets = n(),
A .tweets = sum(grepl(A,what)),
pct.A = A.tweets / total.tweets,
B.tweets = sum(grepl(B,what)) ,
pct.B = B.tweets / total.tweets)

tweet.summary




  day total.tweets A.tweets pct.A B.tweets pct.B 
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...


以下是使用 ggplot2 绘制数据的方法。我还使用 dplyr reshape2 包总结了ggplot中的数据框:

  library(ggplot2)
library(reshape2)
library(scales)

ggplot (A = sum(grepl(A,what))/ n(),$ b $(数据%>%group_by(月=月(time,label = TRUE))%>%
汇总b =总和(grepl(B,what))/ n())%>%
melt(id.var =Month),
aes(Month,value,color =变量,组=变量))+
geom_line()+
theme_bw()+
scale_y_continuous(limits = c(0,0.06),labels = percent_format())+
labs(color =,y =)



关于日期格式问题,以下介绍如何获取数字日期:您可以将日期月份和年份列转换为日期 as.Date 和/或使用 as.POSIXct 。两者都具有附加日期类的基础数值,因此R会将它们作为绘制函数和其他函数的日期。一旦你完成了这个转换,你可以运行上面的代码按日,月等计算推文。

 #假时间数据
dat2 = data.frame(day = sample(1:28,10),month = sample(1:12,10),year = 2016,
time = paste0(sample(c (paste0(0,0:9),10:12),10),:,sample(10:50,10)))

#从现有日期创建日期时间格式列/ month / year / time列
dat2 $ posix.date = with(dat2,as.POSIXct(paste0(year, - ,
sprintf(%02d,month), - ,
sprintf(%02d,day),,
time)))

#创建日期格式列
dat2 $ date = with(dat2 ,as.date(paste0(year, - ,
sprintf(%02d,month), - ,
sprintf(%02d,day))))

dat2




 一天一个月的时间e posix.date date 
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28 :00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10 :13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12- 02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016- 03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27


您可以通过执行 as.numeric(dat2 $ posix.date)来看到POSIXct日期的基础值是数值(自1970年1月1日午夜以来经过的秒数) C $ C>。同样,对于Date对象(自1970年1月1日以来经过的天数): as.numeric(dat2 $ date)


I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)

id      when            time        day month   year    handle  what
UK1.1   Sat Feb 20 2016 12:34:02    20  2       2016    dave    Great goal by #lfc
UK1.2   Sat Feb 20 2016 15:12:42    20  2       2016    john    Can't wait for the weekend 
UK1.3   Sat Mar 01 2016 12:09:21    1   3       2016    smith   Generic boring tweet

Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.

I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.

The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.

So basically, I'd like to make plots like these, where frequency of string x is plotted against time

Thanks!

解决方案

Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:

library(dplyr)
library(lubridate)

# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000), 
                 what = sample(LETTERS, 10000, replace=TRUE))

tweet.summary = dat %>% group_by(day = date(time)) %>%  # To summarise by month: group_by(month = month(time, label=TRUE))
  summarise(total.tweets = n(),
            A.tweets = sum(grepl("A", what)),
            pct.A = A.tweets/total.tweets,
            B.tweets = sum(grepl("B", what)),
            pct.B = B.tweets/total.tweets)            

tweet.summary 

          day total.tweets A.tweets      pct.A B.tweets      pct.B
1  2016-01-01           28        3 0.10714286        0 0.00000000
2  2016-01-02           27        0 0.00000000        1 0.03703704
3  2016-01-03           28        4 0.14285714        1 0.03571429
4  2016-01-04           27        2 0.07407407        2 0.07407407
...

Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:

library(ggplot2)
library(reshape2)
library(scales)

ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
         summarise(A = sum(grepl("A", what))/n(),
                   B = sum(grepl("B", what))/n()) %>%
         melt(id.var="Month"),
       aes(Month, value, colour=variable, group=variable)) +
  geom_line() +
  theme_bw() +
  scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
  labs(colour="", y="")

Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.

# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016, 
                  time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))

# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-", 
                                         sprintf("%02d",month),"-", 
                                         sprintf("%02d", day)," ", 
                                         time)))

# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-", 
                                      sprintf("%02d",month),"-", 
                                      sprintf("%02d", day))))

dat2

   day month year  time          posix.date       date
1   28    10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2   22     6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3    3     4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4   15     8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5    6     2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6    2    12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7    4    11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8   12     3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9   24     5 2016 08:47 2016-05-24 08:47:00 2016-05-24 
10  27     1 2016 04:22 2016-01-27 04:22:00 2016-01-27

You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).

这篇关于在R中随时间绘制字符串匹配的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆