XML距离量度 [英] XML Distance measure

查看:72
本文介绍了XML距离量度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

hai ..任何人都告诉我用于计算XML文档之间距离的编码

在此先感谢

hai..any one tell me the coding for calculating the distance between XML documents

Thanks in advance

推荐答案



您可以尝试链接
Hi,

U can try this Link


请参阅文档距离问题定义 [
See Document Distance Problem Definition[^] for the algorithm.

From that, it should be easy to write a C# program calculating that, e.g. in pseudo code:
double DocumentDistance(string textA, string textB)
{
    Dictionary<string, int> binsA = CalculateWordFrequencies(textA);
    Dictionary<string, int> binsB = CalculateWordFrequencies(textB);
    double innerProduct = CalculateInnerProduct(binsA, binsB);
    double normA = CalculateNorm(binsA);
    double normB = CalculateNorm(binsB);
    return Math.Acos(innerProduct / (normA * normB));
}
Dictionary<string, int> CalculateWordFrequencies(string text)
{
    Dictionary<string, int> bins = new Dictionary<string, int>();
    foreach(string word in GetWords(text))
    {
        if (bins.ContainsKey(word)) bins[word]++;
        else bins.Add(word, 1);
    }
    return bins;
}
IEnumerable<string> GetWords(string text)
{
    return Regex.Matches(text, @"\b\w+\b").Cast<Match>().Select(m=>m.Value);
}
double CalculateInnerProduct(Dictionary<string, int> binsA, Dictionary<string, int> binsB)
{
    double product = 0.0;
    foreach(string word in binsA.Keys.Concat(binsB.Keys).Unique())
    {
        int frequencyA = binA.ContainsKey(word) ? binA[word] : 0;
        int frequencyB = binB.ContainsKey(word) ? binB[word] : 0;
        product += (double)(frequencyA * frequencyB);
    }
    return product;
}
double CalculateNorm(Dictionary<string, int> bins)
{
    double sum = 0.0;
    foreach(int frequency in bins.Values)
    {
       sum += (double)(frequency * frequency);
    }
    return Math.Sqrt(sum);
}



据我了解,它适用于纯文本文件和XML文件:分词算法还将标签和属性当作单词-如果它们匹配到100%,则距离将为0.如果某些元素或属性不同,则距离将大于0.

干杯
安迪

PS:上面的伪代码遵循引用文档的描述-优化工作留给您进行练习(例如,可以通过使用两个垃圾箱键的 Intersection 来改善计算内积的过程,而无需这样做来检查每个垃圾箱中是否存在垃圾邮件.原因:仅位于其中一个垃圾箱中的所有单词都不会对乘积产生影响-它们均为0).



To my understanding, it works for plain text files as well as for XML files: the word splitting algorithm takes also tags and attributes as words - if they match to 100%, the distance will be 0. If some elements or attributes differ, the distance will be greater than 0.

Cheers
Andi

PS: The pseudo code above follows the description of the referenced document - optimization is left as exercise to you (e.g. calculating the inner product can be improved by taking the Intersection of both bins'' Keys and no need to check for existance in each of the bins. Reason: all words that are only in one of the bins do not contribute to the product - they are 0).


这篇关于XML距离量度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆