R中的ngram表示和距离矩阵 [英] ngram representation and distance matrix in R

查看:67
本文介绍了R中的ngram表示和距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有这些数据:

a <- c("ham","bamm","comb")

对于 1-gram,这是上面列表的矩阵表示.

for 1-gram, this is the matrix representation of the above list.

#  h a m b c o
#  1 1 1 0 0 0
#  0 1 2 1 0 0 
#  0 0 1 1 1 1

我知道 table(strsplit(a,split = "")[i]) for i in 1:length(a) 将给出它们每个的分离计数.但我不知道如何使用 rbind 将它们作为一个整体,因为长度和列名不同.

I know that table(strsplit(a,split = "")[i]) for i in 1:length(a) will give the separated count for each of them. But I don't know how use rbind to make them as a whole since the lengths and column names are different.

在那之后,我想使用欧几里得距离或曼哈顿距离来找到它们每个的相似度矩阵:

After that, I want to use either Euclidean or Manhattan distance to find the similarity matrix for each of them as:

#     ham  bamm comb  
# ham  0    3    5
# bamm 3    0    4
# comb 5    4    0 

推荐答案

您也可以使用 stringdist 包.

library(stringdist)
a <- c("ham","bamm","comb")

# stringdistmatrix with qgram calculations
stringdistmatrix(a, a, method = 'qgram')

     [,1] [,2] [,3]
[1,]    0    3    5
[2,]    3    0    4
[3,]    5    4    0

stringdist 重新创建 1-gram

recreating the 1-gram with stringdist

# creates the total count of the 1-gram
qgrams(a, q = 1L)
   h m o a b c
V1 1 4 1 2 2 1

# create a named vector if you want a nice table
names(a) <- a
qgrams(a, .list = a, q = 1L)

#V1 is the total line
     h m o a b c
V1   1 4 1 2 2 1
ham  1 1 0 1 0 0
bamm 0 2 0 1 1 0
comb 0 1 1 0 1 1

这篇关于R中的ngram表示和距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆