R中的ngram表示和距离矩阵 [英] ngram representation and distance matrix in R
本文介绍了R中的ngram表示和距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
假设我们有这些数据:
a <- c("ham","bamm","comb")
对于 1-gram,这是上面列表的矩阵表示.
for 1-gram, this is the matrix representation of the above list.
# h a m b c o
# 1 1 1 0 0 0
# 0 1 2 1 0 0
# 0 0 1 1 1 1
我知道 table(strsplit(a,split = "")[i]) for i in 1:length(a)
将给出它们每个的分离计数.但我不知道如何使用 rbind
将它们作为一个整体,因为长度和列名不同.
I know that table(strsplit(a,split = "")[i]) for i in 1:length(a)
will give the separated count for each of them. But I don't know how use rbind
to make them as a whole since the lengths and column names are different.
在那之后,我想使用欧几里得距离或曼哈顿距离来找到它们每个的相似度矩阵:
After that, I want to use either Euclidean or Manhattan distance to find the similarity matrix for each of them as:
# ham bamm comb
# ham 0 3 5
# bamm 3 0 4
# comb 5 4 0
推荐答案
您也可以使用 stringdist
包.
library(stringdist)
a <- c("ham","bamm","comb")
# stringdistmatrix with qgram calculations
stringdistmatrix(a, a, method = 'qgram')
[,1] [,2] [,3]
[1,] 0 3 5
[2,] 3 0 4
[3,] 5 4 0
用 stringdist
重新创建 1-gram
recreating the 1-gram with stringdist
# creates the total count of the 1-gram
qgrams(a, q = 1L)
h m o a b c
V1 1 4 1 2 2 1
# create a named vector if you want a nice table
names(a) <- a
qgrams(a, .list = a, q = 1L)
#V1 is the total line
h m o a b c
V1 1 4 1 2 2 1
ham 1 1 0 1 0 0
bamm 0 2 0 1 1 0
comb 0 1 1 0 1 1
这篇关于R中的ngram表示和距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文