如何在MATLAB中实现频谱内核功能? [英] How to implement a spectrum kernel function in MATLAB?

查看:99
本文介绍了如何在MATLAB中实现频谱内核功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

频谱内核函数通过对两个字符串之间的相同n-gram进行计数来对字符串进行操作.例如,"tool"具有三个2克("to","oo"和"ol"),并且"tool"和"fool"之间的相似度为2.(常见的"oo"和"ol" ).

A spectrum kernel function operates on strings by counting the same n-grams in between two strings. For example, 'tool' has three 2-grams ('to', 'oo', and 'ol'), and the similarity between 'tool' and 'fool' is 2. ('oo' and 'ol' in common).

如何编写可以计算该指标的MATLAB函数?

How can I write a MATLAB function that could calculate this metric?

推荐答案

第一步是创建一个可以为给定字符串生成n元语法的函数.以矢量化方式执行此操作的一种方法是使用一些巧妙的索引.

The first step would be to create a function that can generate an n-gram for a given string. One way to do this in a vectorized fashion is with some clever indexing.

function [subStrings, counts] = n_gram(fullString, N)
  if (N == 1)
    [subStrings, ~, index] = unique(cellstr(fullString.'));  %.'# Simple case
  else
    nString = numel(fullString);
    index = hankel(1:(nString-N+1), (nString-N+1):nString);
    [subStrings, ~, index] = unique(cellstr(fullString(index)));
  end
  counts = accumarray(index, 1);
end

这使用函数 HANKEL 首先创建一个矩阵索引,这些索引将从给定的字符串中选择每组唯一的N长度子字符串.用此索引矩阵索引给定的字符串将创建一个字符数组,每行一个N长度的子字符串.函数 CELLSTR 然后将字符数组的每一行放入一个单元格中单元阵列的数量.函数 UNIQUE 然后删除重复的子字符串,函数 ACCUMARRAY 用于计算每个唯一子字符串的出现次数(如果需要出于任何原因).

This uses the function HANKEL to first create a matrix of indices that will select each set of unique N-length substrings from the given string. Indexing the given string with this index matrix will create a character array with one N-length substring per row. The function CELLSTR then places each row of the character array into a cell of a cell array. The function UNIQUE then removes repeated substrings, and the function ACCUMARRAY is used to count the occurrences of each unique substring (if they are needed for any reason).

使用上述功能,您可以使用相交函数:

With the above function you can then easily count the number of n-grams shared between two strings using the INTERSECT function:

subStrings1 = n_gram('tool',2);
subStrings2 = n_gram('fool',2);
sharedStrings = intersect(subStrings1,subStrings2);
nShared = numel(sharedStrings);

这篇关于如何在MATLAB中实现频谱内核功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆