如何在大量维中最佳实现C#中的K近邻? [英] How to best implement K-nearest neighbours in C# for large number of dimensions?

查看：73 发布时间：2020/4/26 11:02:43 c# optimization classification knn

本文介绍了如何在大量维中最佳实现C#中的K近邻?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在C#中实现K最近邻分类算法，用于训练和测试集，每个集约20,000个样本和25个维度.

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.

在我的实现中只有两个类，分别由"0"和"1"表示.现在，我有以下简单的实现:

There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :

// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
    Console.WriteLine("Performing KNN with K = "+K);

    var testResults = new int[testSamples.Count()]; 

    var testNumber = testSamples.Count();
    var trainNumber = trainSamples.Count();
    // Declaring these here so that I don't have to 'new' them over and over again in the main loop, 
    // just to save some overhead
    var distances = new double[trainNumber][]; 
    for (var i = 0; i < trainNumber; i++)
    {
       distances[i] = new double[2]; // Will store both distance and index in here
    }

    // Performing KNN ...
    for (var tst = 0; tst < testNumber; tst++)
    {
        // For every test sample, calculate distance from every training sample
        Parallel.For(0, trainNumber, trn =>
        {
            var dist = GetDistance(testSamples[tst], trainSamples[trn]);
            // Storing distance as well as index 
            distances[trn][0] = dist;
            distances[trn][1] = trn;
        });

        // Sort distances and take top K (?What happens in case of multiple points at the same distance?)
        var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);

        // Do a 'majority vote' to classify test sample
        var yea = 0.0;
        var nay = 0.0;

        foreach (var voter in votingDistances)
        {
            if (trainClasses[(int)voter[1]] == 1)  
               yea++;
            else
               nay++;
        }
        if (yea > nay)
            testResults[tst] = 1;
        else
            testResults[tst] = 0;

    }

    return testResults;
}

// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
    var distance = 0.0;
    // assume sample1 and sample2 are valid i.e. same length 

    for (var i = 0; i < sample1.Count; i++)
    {   
        var temp = sample1[i] - sample2[i];
        distance += temp * temp;
    }
    return distance;
}

这需要花费很多时间才能执行.在我的系统上，大约需要80秒才能完成.我如何优化它，同时确保它也可以扩展到更多的数据样本?如您所见，我已经尝试过使用PLINQ和parallel for循环，这确实有所帮助(没有这些，大约需要120秒).我还能做什么?

This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?

我已经了解到KD树通常对于KNN是有效的，但是我阅读的每个资料都指出，对于更高的维度，KD树并不有效.

I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.

我还发现了此stackoverflow讨论，但似乎已经3岁了，我希望现在有人能对这个问题有更好的解决方案.

I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.

我已经研究了C#中的机器学习库，但是由于各种原因，我不想从C#程序中调用R或C代码，而我看到的其他一些库并没有比我所使用的代码更有效.书面.现在，我只是想弄清楚如何自己编写最优化的代码.

I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.

已编辑以添加-我无法使用PCA或其他方法减少尺寸数量.对于此特定模型，需要25个尺寸.

Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

如何在大量维中最佳实现C#中的K近邻? [英] How to best implement K-nearest neighbours in C# for large number of dimensions?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

如何在大量维中最佳实现C#中的K近邻? [英] How to best implement K-nearest neighbours in C# for large number of dimensions?

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭