使用tensorflow和gpu计数像素 [英] counting pixels using tensorflow and gpu

查看:101
本文介绍了使用tensorflow和gpu计数像素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的蒙版图像的大小为(N,256、256),其中N为1000-10000之间的值.
每个像素都有一个介于0到2之间的整数值(0只是背景).
不幸的是,遮罩图像未编码为(N,256,256,2)
我有几千个这些口罩.我的目标是找到最快的方法来计算每个标签(1和2)每帧的像素.
在下面使用numpy在大约6000帧的一张蒙版图像上运行时,< 2秒

I have mask images with size (N, 256, 256), where N is a value between 1000-10000.
Each pixel has an integer value between 0-2 (0 is just background).
Unfortunately, the mask image is not encoded as (N,256,256,2)
I have a few thousands of these masks. My goal is to find the quickest method counting pixels per frame for each label (1 and 2).
Running below on one mask image with roughly 6000 frames using numpy takes < 2s.

np.sum(ma == 1,axis =(1,2))
np.sum(ma == 2,axis =(1,2))

np.sum(ma==1,axis=(1,2))
np.sum(ma==2,axis=(1,2))

如果我使用单个进程,我预计将需要几个小时来运行全部数据,如果使用多处理(CPU),则可能会花费不到一个小时的时间. 我很好奇,如果使用GPU,是否可以使其更快.在轴上执行张量求和的部分似乎很容易,但是我找不到如何在张量流上实现ma==1的部分.
我考虑过先将输入更改为编码的形状(N,256,256,2)并传递给张量占位符,但是意识到,要制作一个具有该形状的数组,所需的时间甚至比上述时间还要长. 或者,是否有更好的方法使用张量流在此蒙版数据上实现像素计数?

I expect it will take a few hours to run on entire data if I use single process, and maybe less than an hour if I use multiprocessing (CPU). I'm curious if I can make it even faster if I use GPU. It seems easy to implement the part summing a tensor on axes, but I don't find how I can implement the ma==1 part on tensorflow.
I thought about making the input to encoded shape (N,256,256,2) first and pass to the tensor placeholder, but realized it would take even longer than above to make an array with that shape. Or, is there a better way to implementing pixel count on this mask data using tensorflow?

推荐答案

思考后台发生的事情

在原始实现中,以下步骤大致执行了两次:

Think about what's going on in the background

Roughly the following steps are done twice in your original implementation:

  • 从内存中加载整个数组,证明值是否等于所需值
  • 将结果写回内存(临时数组与输入数组一样大,假定为np.uint8)
  • 将整个数组加载到内存中并汇总结果
  • 将结果写回内存

很明显,这是并行化与否并行化的次优实现.我无法以纯粹的矢量化numpy方式做得更好,但是有可用的工具(Numba,Cython),您可以在其中以更直接,更贴切的方式来实现此任务.

It should be clear that this is a quite suboptimal implementation parallelized or not. I could not do it any better in a pure vectorized numpy way, but there are tools available (Numba, Cython) where you can implement this task in a more direct and paralellized way.

示例

import numpy as np
import numba as nb
import time

#Create some data
N=10000
images=np.random.randint(0, high=3, size=(N,256,256), dtype=np.uint8)

def sum_orig(ma):
  A=np.sum(ma==1,axis=(1,2))
  B=np.sum(ma==2,axis=(1,2))
  return A,B

@nb.njit(fastmath=True,parallel=True)
def sum_mod(ma):
  A=np.zeros(ma.shape[0],dtype=np.uint32)
  B=np.zeros(ma.shape[0],dtype=np.uint32)

  #parallel loop
  for i in nb.prange(ma.shape[0]):
    AT=0
    BT=0
    for j in range(ma.shape[1]):
      for k in range(ma.shape[2]):
        if (ma[i,j,k]==1):
          AT+=1
        if (ma[i,j,k]==2):
          BT+=1

    A[i]=AT
    B[i]=BT

  return A,B

#Warm up
#The funtion is compiled at the first call
[A,B]=sum_mod(images)
t1=time.time()
[A,B]=sum_mod(images)
print(time.time()-t1)
t1=time.time()
[A_,B_]=sum_orig(images)
print(time.time()-t1)

#check if it works correctly
print(np.allclose(A,A_))
print(np.allclose(B,B_))

性能

improved_version: 0.06s
original_version: 2.07s
speedup: 33x

这篇关于使用tensorflow和gpu计数像素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆