为R中的dataframe中的每行数据创建哈希值 [英] create hash value for each row of data in dataframe in R
问题描述
我正在探索如何在R中更有效地比较两个数据框架,我想出了散列。
我的计划是为每行数据创建散列包含相同列的两个数据框,使用 digest
包中的 digest
我尝试使用以下代码为每行数据指定唯一的散列:
为(以loop.ssi(1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1 [loop.ssi,散列]< ; - 消化(as.character(ssi.10q3.v1 [loop.ssi,))
打印(粘贴(loop.ssi,nrow(ssi.10q3.v1)09月=/))
flush.console()
}
p>
我的方法比较dataframe是否正确?如果是,任何建议加速上面的代码?非常感谢。
UPDATE
我已更新下列代码:
ssi.10q3.v1 [,UID< - 1:nrow(ssi.10q3.v1)
ssi.10q3 .v1.hash< - ddply(ssi.10q3.v1,
C(UID),
功能(DF)
{DF [UID]< - NULL
hash< - digest(as.character(df))
data.frame(hash = hash)
},
.progress =text)
我自生成一个 uid
。
如果我得到你想要的东西,digest可以直接使用apply:
library(digest)
ssi.10q3.v1.hash< - data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1,1,digest))
I am exploring how to compare two dataframe in R more efficiently, and I come up with hash.
My plan is to create hash for each row of data in two dataframe with same columns, using digest
in digest
package, and I suppose hash should be the same for any 2 identical row of data.
I tried to give and unique hash for each row of data, using the code below:
for (loop.ssi in (1:nrow(ssi.10q3.v1)))
{ssi.10q3.v1[loop.ssi,"hash"] <- digest(as.character(ssi.10q3.v1[loop.ssi,]))
print(paste(loop.ssi,nrow(ssi.10q3.v1),sep="/"))
flush.console()
}
But this is very slow.
Is my approach in comparing dataframe correct? If yes, any suggestion for speeding up the code above? Thanks.
UPDATE
I have updated the code as below:
ssi.10q3.v1[,"uid"] <- 1:nrow(ssi.10q3.v1)
ssi.10q3.v1.hash <- ddply(ssi.10q3.v1,
c("uid"),
function(df)
{df[,"uid"]<- NULL
hash <- digest(as.character(df))
data.frame(hash=hash)
},
.progress="text")
I self-generated a uid
column for the "unique" purpose.
If I get what you want properly, digest will work directly with apply:
library(digest)
ssi.10q3.v1.hash <- data.frame(uid = 1:nrow(ssi.10q3.v1), hash = apply(ssi.10q3.v1, 1, digest))
这篇关于为R中的dataframe中的每行数据创建哈希值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!