CUDA推力 - 计数匹配子阵 [英] CUDA Thrust - Counting matching subarrays

查看:158
本文介绍了CUDA推力 - 计数匹配子阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图找出是否有可能有效地计算一组使用CUDA数字的条件熵。您可以通过将一个数组到Windows,然后计算为匹配不同长度的子阵列/串的数量计算条件熵。对于每个子阵列的长度,可以通过添加计算熵一起匹配子阵计算时间的计数的日志。然后,不管你拿到的最低熵是条件熵。

I'm trying to figure out if it's possible to efficiently calculate the conditional entropy of a set of numbers using CUDA. You can calculate the conditional entropy by dividing an array into windows, then counting the number of matching subarrays/substrings for different lengths. For each subarray length, you calculate the entropy by adding together the matching subarray counts times the log of those counts. Then, whatever you get as the minimum entropy is the conditional entropy.

为了让我的意思更明显的例子,这里充满了计算:

To give a more clear example of what I mean, here is full calculation:


  1. 最初的数组为[1,2,3,5,1,2,5]。假定窗口尺寸是3,这必须被分成五个窗口:[1,2,3],[2,3,5],[3,5,1],[5,1,2]和[1 1,2,5]。

  1. The initial array is [1,2,3,5,1,2,5]. Assuming the window size is 3, this must be divided into five windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5].

接下来,看着每个窗口,我们要找到每个长度匹配的子阵。

Next, looking at each window, we want to find the matching subarrays for each length.

长度为1的子阵列是[1],[2],[3],[5],[1]。有两个1,并且彼此号码中的一个。所以熵是log(2)的 2 + 4 的(对数(1)* 1)= 0.6

The subarrays of length 1 are [1],[2],[3],[5],[1]. There are two 1s, and one of each other number. So the entropy is log(2)2 + 4(log(1)*1) = 0.6.

长度为2的子阵列是[1,2],[2,3],[3,5],[5,1]和[1,2]。有两种[1,2] s和四个独特的子阵列。熵是相同的长度1,登录(2)的 2 + 4 的(日志(1)* 1)= 0.6

The subarrays of length 2 are [1,2], [2,3], [3,5], [5,1], and [1,2]. There are two [1,2]s, and four unique subarrays. The entropy is the same as length 1, log(2)2 + 4(log(1)*1) = 0.6.

长度为3的子阵列是完整的窗口:[1,2,3],[2,3,5],[3,5,1],[5,1,2]和[ 1,2,5]。所有五个窗口是唯一的,因此熵是5 *(日志(1)* 1)= 0。

The subarrays of length 3 are the full windows: [1,2,3], [2,3,5], [3,5,1], [5,1,2], and [1,2,5]. All five windows are unique, so the entropy is 5*(log(1)*1) = 0.

最小熵为0,这意味着它是这个数组的条件熵。

The minimum entropy is 0, meaning it is the conditional entropy for this array.

这也可以作为psented一棵树,其中每个节点重新present计数多少匹配存在$ P $。对于每个子阵列长度的熵是等效于熵为树的每个级别。

This can also be presented as a tree, where the counts at each node represent how many matches exist. The entropy for each subarray length is equivalent to the entropy for each level of the tree.

如果可能的话,我想立刻在许多阵列执行该计算,并且还并行执行计算本身。有没有人对如何做到这一点建议吗?难道推力是有用的?请让我知道,如果有,我应该提供的其他信息。

If possible, I'd like to perform this calculation on many arrays at once, and also perform the calculation itself in parallel. Does anyone have suggestions on how to accomplish this? Could thrust be useful? Please let me know if there is any additional information I should provide.

推荐答案

我试着用推力解决您的问题。它的工作原理,但它造成的推力调用的很多
由于您输入的大小是相当小的,你应该并行处理多个阵列。
但是,这样做会导致大量的簿记努力,你会看到这个在下面code。

I tried solving your problem using thrust. It works, but it results in a lot of thrust calls. Since your input size is rather small, you should process multiple arrays in parallel. However, doing this results in a lot of book-keeping effort, you will see this in the following code.

您输入范围仅限于 [1,5] ,相当于 [0,4] 。总的想法是,(理论上)的元组不在此范围内(如 {1,2,3} 可重新在基地4(psented多项$ P $例如: 1 + 2 * 4 + 3 * 16 = 57 )。
在实践中,我们通过整数类型的大小的限制。对于32位无符号整数,这将导致最大规模的元组 16 。这也是最大窗口大小以下code可以处理(更改为64位无符号整数会导致一个最大的元组大小 32 )。

Your input range is limited to [1,5], which is equivalent to [0,4]. The general idea is that (theoretically) any tuple out of this range (e.g. {1,2,3} can be represented as a number in base 4 (e.g. 1+2*4+3*16 = 57). In practice we are limited by the size of the integer type. For a 32bit unsigned integer this will lead to a maximum tuple size of 16. This is also the maximum window size the following code can handle (changing to a 64bit unsigned integer will lead to a maximum tuple size of 32).

假设输入数据的结构是这样的:
我们有我们想要并行处理 2 阵列,每个阵列尺寸 5 和窗口大小为 3

Let's assume the input data is structured like this: We have 2 arrays we want to process in parallel, each array is of size 5 and window size is 3.

{{0,0,3,4,4},{0,2,1,1,3}}

我们现在可以生成所有窗口:

We can now generate all windows:

{{0,0,3},{0,3,4},{3,4,4}},{{0,2,1},{2,1,1},{1,1,3}}

使用每个元组的preFIX总和与施加每个元组的前述重新presentation作为一个单一的基-4-数,我们得到:

Using a per tuple prefix sum and applying the aforementioned representation of each tuple as a single base-4 number, we get:

{{0,0,48},{0,12,76},{3,19,83}},{{0,8,24},{2,6,22},{1,5,53}}

现在我们重新排序的值,使我们有重新present号码特定长度的彼此相邻子阵列

Now we reorder the values so we have the numbers which represent a subarray of a specific length next to each other:

{{0,0,3},{0,12,19},{48,76,83}},{0,2,1},{8,6,5},{24,22,53}}

我们再排序每组中:

{{0,0,3},{0,12,19},{48,76,83}},{0,1,2},{5,6,8},{22,24,53}}

现在,我们可以指望多久每个组中某一个数字:

Now we can count how often a number occurs in each group:

2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1

应用对数公式的结果。

Applying the log-formula results in

0.60206,0,0,0,0,0

现在我们获取每个阵列中的最小值:

Now we fetch the minimum value per array:

0,0


#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <iostream>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/scan.h>
#include <thrust/gather.h>
#include <thrust/sort.h>
#include <math.h>

#include <chrono>

#ifdef PRINT_ENABLED
#define PRINTER(name) print(#name, (name))
#else
#define PRINTER(name)
#endif

template <template <typename...> class V, typename T, typename ...Args>
void print(const char* name, const V<T,Args...> & v)
{
    std::cout << name << ":\t";
    thrust::copy(v.begin(), v.end(), std::ostream_iterator<T>(std::cout, "\t"));
    std::cout << std::endl;
}


template <typename Integer, Integer Min, Integer Max>
struct random_filler
{
    __device__
    Integer operator()(std::size_t index) const
    {
        thrust::default_random_engine rng;
        thrust::uniform_int_distribution<Integer> dist(Min, Max);
        rng.discard(index);
        return dist(rng);
    }
};

template <std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          typename T,
          std::size_t WindowCount = ArraySize - (WindowSize-1),
          std::size_t PerArrayCount = WindowSize * WindowCount>

__device__ __inline__          
thrust::tuple<T,T,T,T> calc_indices(const T& i0)
{
    const T i1 = i0 / PerArrayCount;
    const T i2 = i0 % PerArrayCount;
    const T i3 = i2 / WindowSize;
    const T i4 = i2 % WindowSize;
    return thrust::make_tuple(i1,i2,i3,i4);
}

template <typename Iterator, 
          std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          std::size_t WindowCount = ArraySize - (WindowSize-1),
          std::size_t PerArrayCount = WindowSize * WindowCount,
          std::size_t TotalCount =  PerArrayCount * ArrayCount
          >
class sliding_window
{
    public:

    typedef typename thrust::iterator_difference<Iterator>::type difference_type;

    struct window_functor : public thrust::unary_function<difference_type,difference_type>
    {
        __host__ __device__
        difference_type operator()(const difference_type& i0) const
        { 
            auto t = calc_indices<ArraySize, ArrayCount,WindowSize>(i0);

            return thrust::get<0>(t) * ArraySize + thrust::get<2>(t)  + thrust::get<3>(t);
        }
    };

    typedef typename thrust::counting_iterator<difference_type>                   CountingIterator;
    typedef typename thrust::transform_iterator<window_functor, CountingIterator>   TransformIterator;
    typedef typename thrust::permutation_iterator<Iterator,TransformIterator>     PermutationIterator;

    typedef PermutationIterator iterator;

    sliding_window(Iterator first) : first(first){}

    iterator begin(void) const
    {
        return PermutationIterator(first, TransformIterator(CountingIterator(0), window_functor()));
    }

    iterator end(void) const
    {
        return begin() + TotalCount;
    }

    protected:
    Iterator first;
};

template <std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          typename Iterator>
sliding_window<Iterator, ArraySize, ArrayCount, WindowSize>
make_sliding_window(Iterator first)
{
    return sliding_window<Iterator, ArraySize, ArrayCount, WindowSize>(first);
}


template <typename KeyType, 
          std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize> 
struct key_generator : thrust::unary_function<KeyType, thrust::tuple<KeyType,KeyType> >
{
    __device__
    thrust::tuple<KeyType,KeyType> operator()(std::size_t i0) const
    {        

        auto t = calc_indices<ArraySize, ArrayCount,WindowSize>(i0);
        return thrust::make_tuple(thrust::get<0>(t),thrust::get<2>(t));
    }
};


template <typename Integer,
          std::size_t Base,
          std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize>
struct base_n : thrust::unary_function<thrust::tuple<Integer, Integer>, Integer>
{
    __host__ __device__ 
    Integer operator()(const thrust::tuple<Integer, Integer> t) const
    {    
        const auto i = calc_indices<ArraySize, ArrayCount, WindowSize>(thrust::get<0>(t));

        // ipow could be optimized by precomputing a lookup table at compile time
        const auto result =  thrust::get<1>(t)*ipow(Base, thrust::get<3>(i));
        return result;
    }

    // taken from http://stackoverflow.com/a/101613/678093
    __host__ __device__ __inline__
    Integer ipow(Integer base, Integer exp) const
    {
        Integer result = 1;
        while (exp)
        {
            if (exp & 1)
                result *= base;
            exp >>= 1;
            base *= base;
        }

        return result;
    }
};

template <std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          typename T,
          std::size_t WindowCount = ArraySize - (WindowSize-1),
          std::size_t PerArrayCount = WindowSize * WindowCount>

__device__ __inline__          
thrust::tuple<T,T,T,T> calc_sort_indices(const T& i0)
{
    const T i1 = i0 % PerArrayCount;
    const T i2 = i0 / PerArrayCount;
    const T i3 = i1 % WindowCount;
    const T i4 = i1 / WindowCount;
    return thrust::make_tuple(i1,i2,i3,i4);
}

template <typename Integer,
          std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          std::size_t WindowCount = ArraySize - (WindowSize-1),
          std::size_t PerArrayCount = WindowSize * WindowCount>
struct pre_sort : thrust::unary_function<Integer, Integer>
{
    __device__
    Integer operator()(Integer i0) const
    {

        auto t = calc_sort_indices<ArraySize, ArrayCount,WindowSize>(i0);


        const Integer i_result = ( thrust::get<2>(t)  * WindowSize  + thrust::get<3>(t) ) + thrust::get<1>(t) * PerArrayCount;
        return i_result;
    }
};


template <typename Integer,
          std::size_t ArraySize, 
          std::size_t ArrayCount,
          std::size_t WindowSize,
          std::size_t WindowCount = ArraySize - (WindowSize-1),
          std::size_t PerArrayCount = WindowSize * WindowCount>
struct generate_sort_keys : thrust::unary_function<Integer, Integer>
{
    __device__
    thrust::tuple<Integer,Integer> operator()(Integer i0) const
    {

        auto t = calc_sort_indices<ArraySize, ArrayCount,WindowSize>(i0);

        return thrust::make_tuple( thrust::get<1>(t), thrust::get<3>(t));
    }
};


template<typename... Iterators>
__host__ __device__
thrust::zip_iterator<thrust::tuple<Iterators...>> zip(Iterators... its)
{
    return thrust::make_zip_iterator(thrust::make_tuple(its...));
}


struct calculate_log : thrust::unary_function<std::size_t, float>
{
  __host__ __device__
  float operator()(std::size_t i) const
  {
    return i*log10f(i);
  }
};


int main()
{
    typedef int Integer;
    typedef float Real;

    const std::size_t array_count = ARRAY_COUNT;
    const std::size_t array_size = ARRAY_SIZE;

    const std::size_t window_size = WINDOW_SIZE;

    const std::size_t window_count = array_size - (window_size-1);

    const std::size_t input_size = array_count * array_size;

    const std::size_t base = 4;

    thrust::device_vector<Integer> input_arrays(input_size);

    thrust::counting_iterator<Integer> counting_it(0);

    thrust::transform(counting_it,
                      counting_it + input_size,
                      input_arrays.begin(),
                      random_filler<Integer,0,base>());

    PRINTER(input_arrays);

    const int runs = 100;

    auto start = std::chrono::high_resolution_clock::now();
    for (int k = 0 ; k < runs; ++k)
    {
      auto sw = make_sliding_window<array_size, array_count, window_size>(input_arrays.begin());

      const std::size_t total_count = window_size * window_count * array_count;

      thrust::device_vector<Integer> result(total_count);

      thrust::copy(sw.begin(), sw.end(), result.begin());

      PRINTER(result);


      auto ti_begin = thrust::make_transform_iterator(counting_it, key_generator<Integer, array_size, array_count, window_size>());

      auto base_4_ti = thrust::make_transform_iterator(zip(counting_it, sw.begin()), base_n<Integer, base, array_size, array_count, window_size>());

      thrust::inclusive_scan_by_key(ti_begin, ti_begin+total_count, base_4_ti, result.begin());
      PRINTER(result);

      thrust::device_vector<Integer> result_2(total_count);

      auto ti_pre_sort = thrust::make_transform_iterator(counting_it, pre_sort<Integer, array_size, array_count, window_size>());

      thrust::gather(ti_pre_sort,
                    ti_pre_sort+total_count,
                    result.begin(),
                    result_2.begin());

      PRINTER(result_2);


      thrust::device_vector<Integer> sort_keys_1(total_count);
      thrust::device_vector<Integer> sort_keys_2(total_count);

      auto zip_begin = zip(sort_keys_1.begin(),sort_keys_2.begin());

      thrust::transform(counting_it,
                        counting_it+total_count,
                        zip_begin,
                        generate_sort_keys<Integer, array_size, array_count, window_size>());

      thrust::stable_sort_by_key(result_2.begin(), result_2.end(), zip_begin);
      thrust::stable_sort_by_key(zip_begin, zip_begin+total_count, result_2.begin());

      PRINTER(result_2);

      thrust::device_vector<Integer> key_counts(total_count);
      thrust::device_vector<Integer> sort_keys_1_reduced(total_count);
      thrust::device_vector<Integer> sort_keys_2_reduced(total_count);

      // count how often each sub array occurs
      auto zip_count_begin = zip(sort_keys_1.begin(), sort_keys_2.begin(), result_2.begin());
      auto new_end = thrust::reduce_by_key(zip_count_begin,
                                          zip_count_begin + total_count,
                                          thrust::constant_iterator<Integer>(1),
                                          zip(sort_keys_1_reduced.begin(), sort_keys_2_reduced.begin(), thrust::make_discard_iterator()),
                                          key_counts.begin()
                                          );

      std::size_t new_size = new_end.second - key_counts.begin();
      key_counts.resize(new_size);
      sort_keys_1_reduced.resize(new_size);
      sort_keys_2_reduced.resize(new_size);
      PRINTER(key_counts);
      PRINTER(sort_keys_1_reduced);
      PRINTER(sort_keys_2_reduced);


      auto log_ti = thrust::make_transform_iterator (key_counts.begin(), calculate_log());

      thrust::device_vector<Real> log_result(new_size);
      auto zip_keys_reduced_begin = zip(sort_keys_1_reduced.begin(), sort_keys_2_reduced.begin());
      auto log_end = thrust::reduce_by_key(zip_keys_reduced_begin,
                                          zip_keys_reduced_begin + new_size,
                                          log_ti,
                                          zip(sort_keys_1.begin(),thrust::make_discard_iterator()),
                                          log_result.begin()
                            );
      std::size_t final_size = log_end.second - log_result.begin();
      log_result.resize(final_size);
      sort_keys_1.resize(final_size);
      PRINTER(log_result);

      thrust::device_vector<Real> final_result(final_size);
      auto final_end = thrust::reduce_by_key(sort_keys_1.begin(),
                                            sort_keys_1.begin() + final_size,
                                            log_result.begin(),
                                            thrust::make_discard_iterator(),
                                            final_result.begin(),
                                            thrust::equal_to<Integer>(),
                                            thrust::minimum<Real>()
                            );

      final_result.resize(final_end.second-final_result.begin());

      PRINTER(final_result);
    }

    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now() - start);
    std::cout << "took " << duration.count()/runs << "milliseconds" << std::endl;

    return 0;
}

编译使用

nvcc -std=c++11 conditional_entropy.cu -o benchmark -DARRAY_SIZE=1000 -DARRAY_COUNT=1000 -DWINDOW_SIZE=10 && ./benchmark

此配置对我的GPU(GTX 680)133毫秒,因此每个阵列约为0.1毫秒。

This configuration takes 133 milliseconds on my GPU (GTX 680), so around 0.1 milliseconds per array.

的实施绝对可以得到优化,例如使用precomputed查找表为基-4-转换和也许一些推力呼叫可避免

The implementation can definitely be optimized, e.g. using a precomputed lookup table for the base-4 conversion and maybe some of the thrust calls can be avoided.

这篇关于CUDA推力 - 计数匹配子阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆