距离对象R的CSV [英] CSV of Distances to Dist Object R
问题描述
我有一个非常大的csv文件一个for循环花费的时间太长,在关键字之间的相似性,当我读入一个data.frame看起来像:
I have a very large csv file (so a for loop takes too long in R) of similarities between keywords that when I read into a data.frame looks like:
> df
kwd1 kwd2 similarity
a b 1
b a 1
c a 2
a c 2
我想将其转换为dist对象,如下所示:
I would like to convert this to a dist object, like this:
> dObject
a b
b 1
c 2 0
这个工作:
将数据帧转换为类dist的对象没有实际计算R中的距离
另一个想法是使用Matrix()创建一个稀疏矩阵,但我不确定如何有效地填充矩阵因为我的csv相当大 - 也许是一个apply函数?
Another idea I had was to create a sparse matrix using Matrix(), but I am unsure how to populate the matrix efficiently because my csv is fairly large - maybe an apply function?
也许reshape()?
Maybe reshape()?
--- - 更新----
这似乎在上面的玩具数据集上工作:
http://stats.stackexchange.com/questions/6827/efficient-way-to-populate-matrix-in-r
---- Update ---- This seems to work on the toy dataset above: http://stats.stackexchange.com/questions/6827/efficient-way-to-populate-matrix-in-r
然而,在这个例子中,他们使用一个matrix(),但我想使用稀疏的内存原因的Matrix()。
However, in this example, they use a matrix(), but I would like to use Matrix() that is sparse for memory reasons.
- 另外----
有一个类似的帖子从前。但是,我不认为它的建议适用于这种情况下,它们不是数据集中的每个元素之间的链接 - csv不包含所有关键字之间的成对相似性,如在上一篇文章:
将数据帧转换为类dist的对象而不是实际计算R中的距离
--- Furthermore ---- There is a similar posting from before. However, I don't think that the advice from it works for this case in which their isn't a link between every element in the dataset - the csv doesn't contain the pairwise similarities between all keywords as in the previous post: Convert a dataframe to an object of class "dist" without actually calculating distances in R
推荐答案
尝试此
# Generate some dummy data (since you didn't provide your data)
df <- data.frame(V1=sample(letters, 10, TRUE),
V2=sample(letters, 10, TRUE),
V3=sample(200, 10, TRUE))
$ b b
df $ V1
和 df $ V2
现在是可能具有不同级别的因素,所以我们需要使他们相当,例如请确保 V1
中的a
与a
V2
。
df$V1
and df$V2
are now factors, possibly with different levels, so we need to make them comparable, e.g. make sure "a"
in V1
is the same as "a"
in V2
.
# Convert letters to integers
my.objects <- sort(unique(c(as.character(df$V1), as.character(df$V2))))
df$V1 <- match(df$V1, my.objects)
df$V2 <- match(df$V2, my.objects)
创建一个空距离矩阵,并在 V3
中的 V1
code> V2 。最后,我们将它转换为一个合适的 dist
对象。
Create an empty distance matrix and populate it with the values in V3
at the locations specified by V1
and V2
. Finally we convert it to a proper dist
object.
# Create an empty distance matrix
n <- length(my.objects)
dist.mat <- matrix(NA, n, n)
i <- as.matrix(df[-3])
dist.mat[i] <- dist.mat[i[,2:1]] <- df$V3
my.dist <- as.dist(dist.mat)
这篇关于距离对象R的CSV的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!