在R中随时间绘制字符串匹配的频率 [英] Plotting the frequency of string matches over time in R
问题描述
我编译了过去几个月左右发送的推文语料库,这看起来像这样(实际语料库有更多的列,显然有更多的行,但你明白了)
id时间日月份年处理什么
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john不能等到周末
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
现在我想在R中做的是,使用grep进行字符串匹配,随时间绘制某些单词/主题标签的频率,理想情况下用该月/日/小时/任何推文的数量标准化。但我不知道该怎么做。
我知道如何使用grep创建这个数据框的子集,例如对于包括#lfc hashtag在内的所有推文,我真的不知道该从哪里去。
另一个问题是,无论我的x-轴(小时/天/月等)需要是数字,'when'列不是。我已经尝试将2月13日的'day'和'month'列连接成类似'2.13'的列,但是这导致了R将2.13视为'更早'的问题,可以这么说,比2.7(2月7日)在数学基础上。
所以基本上,我'd喜欢制作像这样的情节,其中字符串x的频率与时间相关
谢谢!
这里有一种方法可以计算白天的推文。我已经说明了一个简化的假数据集:
library(dplyr)
library(lubridate)
#假数据
set.seed(485)
dat = data.frame(time = seq(as.POSIXct(2016-01-01),as.POSixct( ),
what = sample(LETTERS,10000,replace = TRUE))
tweet.summary = dat%>%group_by (day = date(time))%>%#按月汇总:group_by(month = month(time,label = TRUE))
summary(total.tweets = n(),
A .tweets = sum(grepl(A,what)),
pct.A = A.tweets / total.tweets,
B.tweets = sum(grepl(B,what)) ,
pct.B = B.tweets / total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
以下是使用 ggplot2
绘制数据的方法。我还使用 dplyr
和 reshape2
包总结了ggplot中的数据框:
library(ggplot2)
library(reshape2)
library(scales)
ggplot (A = sum(grepl(A,what))/ n(),$ b $(数据%>%group_by(月=月(time,label = TRUE))%>%
汇总b =总和(grepl(B,what))/ n())%>%
melt(id.var =Month),
aes(Month,value,color =变量,组=变量))+
geom_line()+
theme_bw()+
scale_y_continuous(limits = c(0,0.06),labels = percent_format())+
labs(color =,y =)
一天一个月的时间e posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28 :00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10 :13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12- 02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016- 03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
您可以通过执行 as.numeric(dat2 $ posix.date)$>来看到POSIXct日期的基础值是数值(自1970年1月1日午夜以来经过的秒数) C $ C>。同样,对于Date对象(自1970年1月1日以来经过的天数):
as.numeric(dat2 $ date)
。
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B 1 2016-01-01 28 3 0.10714286 0 0.00000000 2 2016-01-02 27 0 0.00000000 1 0.03703704 3 2016-01-03 28 4 0.14285714 1 0.03571429 4 2016-01-04 27 2 0.07407407 2 0.07407407 ...
Here's a way to plot the data using ggplot2
. I've also summarized the data frame on the fly within ggplot, using the dplyr
and reshape2
packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date
and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct
. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date 1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28 2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22 3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03 4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15 5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06 6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02 7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04 8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12 9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24 10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date)
. Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date)
.
这篇关于在R中随时间绘制字符串匹配的频率的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!