不知道为什么dcast()这个数据集会导致变量丢失 [英] Not sure why dcast() this data set results in dropping variables

查看:109
本文介绍了不知道为什么dcast()这个数据集会导致变量丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框:

   id fromuserid touserid from_country to_country length
1   1   54525953 47195889           US         US      2
2   2   54525953 54361607           US         US      1
3   3   54525953 53571081           US         US      2
4   4   41943048 55379244           US         US      1
5   5   47185938 53140304           US         PR      1
6   6   47185938 54121387           US         US      1
7   7   54525974 50928645           GB         GB      1
8   8   54525974 53495302           GB         GB      1
9   9   51380247 45214216           SG         SG      2
10 10   51380247 43972484           SG         US      2

每一行都描述了从一个用户发送到另一用户的消息(长度).

我想做的是创建一个可视化的图像(通过D3中的和弦图),以显示每个国家之间发送的消息.

有将近200个国家.我使用dcast函数,如下所示:

countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)

当我有一个较小的数据集和较少的变量时,此方法对我有用,但是此数据集超过3M行,可以这么说,调试起来不容易.

无论如何,我现在得到的是一个不是正方形的矩阵,我不知道为什么不这样做.我期望得到的基本上是一个矩阵,其中(i,j)th单元代表从country i发送到country j的消息.我最终得到的结果与此非常接近,但是明显缺少一些行和列,这很容易发现,因为US-> US邮件显示的是向上移动了一行或一列.

这是我的问题.我正在做的任何事情显然是错误的吗?如果没有,我应该在数据集中寻找某种奇怪"的东西来解决这个问题吗?

解决方案

请确保您的"from_country"和"to_country"变量是因子,并且它们具有相同的级别.使用您共享的示例数据:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(chats$from_country, 
                                               chats$to_country)))
chats$to_country <- factor(chats$to_country, 
                           levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
#   from_country US GB SG PR
# 1           US  5  0  0  1
# 2           GB  0  2  0  0
# 3           SG  1  0  1  0
# 4           PR  0  0  0  0

如果您的"from_country"和"to_country"变量已经是因素,但级别不同,则可以在第一步中执行以下操作:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(levels(chats$from_country), 
                                               levels(chats$to_country)))

为什么这是必要的?如果它们已经是 因素,则c(chats$from_country, chats$to_country)会将这些因素强制转换为数字,并且由于与这些因素的任何字符值都不匹配,因此将导致<NA>. /p>

I have a data frame that looks like:

   id fromuserid touserid from_country to_country length
1   1   54525953 47195889           US         US      2
2   2   54525953 54361607           US         US      1
3   3   54525953 53571081           US         US      2
4   4   41943048 55379244           US         US      1
5   5   47185938 53140304           US         PR      1
6   6   47185938 54121387           US         US      1
7   7   54525974 50928645           GB         GB      1
8   8   54525974 53495302           GB         GB      1
9   9   51380247 45214216           SG         SG      2
10 10   51380247 43972484           SG         US      2

Each row describes a number of messages (length) sent from one user to another user.

What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.

There are almost 200 countries. I use the function dcast as follows:

countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)

This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.

At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.

So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?

解决方案

Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(chats$from_country, 
                                               chats$to_country)))
chats$to_country <- factor(chats$to_country, 
                           levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
#   from_country US GB SG PR
# 1           US  5  0  0  1
# 2           GB  0  2  0  0
# 3           SG  1  0  1  0
# 4           PR  0  0  0  0

If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:

chats$from_country <- factor(chats$from_country, 
                             levels = unique(c(levels(chats$from_country), 
                                               levels(chats$to_country)))

Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.

这篇关于不知道为什么dcast()这个数据集会导致变量丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆