如何在大量维中最佳实现C#中的K近邻? [英] How to best implement K-nearest neighbours in C# for large number of dimensions?

查看:73
本文介绍了如何在大量维中最佳实现C#中的K近邻?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在C#中实现K最近邻分类算法,用于训练和测试集,每个集约20,000个样本和25个维度.

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.

在我的实现中只有两个类,分别由"0"和"1"表示.现在,我有以下简单的实现:

There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :

// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
    Console.WriteLine("Performing KNN with K = "+K);

    var testResults = new int[testSamples.Count()]; 

    var testNumber = testSamples.Count();
    var trainNumber = trainSamples.Count();
    // Declaring these here so that I don't have to 'new' them over and over again in the main loop, 
    // just to save some overhead
    var distances = new double[trainNumber][]; 
    for (var i = 0; i < trainNumber; i++)
    {
       distances[i] = new double[2]; // Will store both distance and index in here
    }

    // Performing KNN ...
    for (var tst = 0; tst < testNumber; tst++)
    {
        // For every test sample, calculate distance from every training sample
        Parallel.For(0, trainNumber, trn =>
        {
            var dist = GetDistance(testSamples[tst], trainSamples[trn]);
            // Storing distance as well as index 
            distances[trn][0] = dist;
            distances[trn][1] = trn;
        });

        // Sort distances and take top K (?What happens in case of multiple points at the same distance?)
        var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);

        // Do a 'majority vote' to classify test sample
        var yea = 0.0;
        var nay = 0.0;

        foreach (var voter in votingDistances)
        {
            if (trainClasses[(int)voter[1]] == 1)  
               yea++;
            else
               nay++;
        }
        if (yea > nay)
            testResults[tst] = 1;
        else
            testResults[tst] = 0;

    }

    return testResults;
}

// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
    var distance = 0.0;
    // assume sample1 and sample2 are valid i.e. same length 

    for (var i = 0; i < sample1.Count; i++)
    {   
        var temp = sample1[i] - sample2[i];
        distance += temp * temp;
    }
    return distance;
}

这需要花费很多时间才能执行.在我的系统上,大约需要80秒才能完成.我如何优化它,同时确保它也可以扩展到更多的数据样本?如您所见,我已经尝试过使用PLINQ和parallel for循环,这确实有所帮助(没有这些,大约需要120秒).我还能做什么?

This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?

我已经了解到KD树通常对于KNN是有效的,但是我阅读的每个资料都指出,对于更高的维度,KD树并不有效.

I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.

我还发现了此stackoverflow讨论,但似乎已经3岁了,我希望现在有人能对这个问题有更好的解决方案.

I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.

我已经研究了C#中的机器学习库,但是由于各种原因,我不想从C#程序中调用R或C代码,而我看到的其他一些库并没有比我所使用的代码更有效.书面.现在,我只是想弄清楚如何自己编写最优化的代码.

I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.

已编辑以添加-我无法使用PCA或其他方法减少尺寸数量.对于此特定模型,需要25个尺寸.

Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

推荐答案

每当您尝试提高代码性能时,第一步就是分析当前性能以查看其确切位置.花时间.一个好的分析器对此至关重要.在上一份工作中,我可以使用 dotTrace剖析器达到良好效果; Visual Studio还具有内置探查器.一个好的分析器会准确地告诉您代码在哪里花费时间,甚至是逐行.

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.

话虽如此,在阅读您的实现时会想到以下几点:

That being said, a few things come to mind in reading your implementation:

  1. 您正在并行处理一些内部循环.您可以并行化外循环吗?与委托调用相关的费用很小但不为零(请参见此处此处),在"Parallel.For"回调中打你.

  1. You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.

类似地,使用IList接口通过数组进行索引也会对性能造成小的影响.您可以考虑将数组参数明确声明为"GetDistance()".

Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.

K与训练数组的大小相比有多大?您已经对距离"数组进行了完全排序并获得了前K,但是如果K比数组大小小得多,则使用选择例如使用 SortedSet SortedSet 并在设置大小超过K时替换最小的元素.

How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

这篇关于如何在大量维中最佳实现C#中的K近邻?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆