不知道为什么dcast()这个数据集会导致变量丢失 [英] Not sure why dcast() this data set results in dropping variables
问题描述
我有一个看起来像这样的数据框:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
每一行都描述了从一个用户发送到另一用户的消息(长度).
我想做的是创建一个可视化的图像(通过D3中的和弦图),以显示每个国家之间发送的消息.
有将近200个国家.我使用dcast函数,如下所示:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
当我有一个较小的数据集和较少的变量时,此方法对我有用,但是此数据集超过3M行,可以这么说,调试起来不容易.
无论如何,我现在得到的是一个不是正方形的矩阵,我不知道为什么不这样做.我期望得到的基本上是一个矩阵,其中(i,j)th
单元代表从country i
发送到country j
的消息.我最终得到的结果与此非常接近,但是明显缺少一些行和列,这很容易发现,因为US-> US邮件显示的是向上移动了一行或一列.
这是我的问题.我正在做的任何事情显然是错误的吗?如果没有,我应该在数据集中寻找某种奇怪"的东西来解决这个问题吗?
请确保您的"from_country"和"to_country"变量是因子,并且它们具有相同的级别.使用您共享的示例数据:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
如果您的"from_country"和"to_country"变量已经是因素,但级别不同,则可以在第一步中执行以下操作:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
为什么这是必要的?如果它们已经是 因素,则c(chats$from_country, chats$to_country)
会将这些因素强制转换为数字,并且由于与这些因素的任何字符值都不匹配,因此将导致<NA>
. /p>
I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th
cell represents the messages sent from country i
to country j
. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country)
will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>
.
这篇关于不知道为什么dcast()这个数据集会导致变量丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!