分布式Tensorflow：内部错误-Blas GEMM启动失败 [英] Distributed Tensorflow: Internal Error - Blas GEMM launch failed

查看：171 发布时间：2020/10/22 19:04:37 python tensorflow distributed-computing

本文介绍了分布式Tensorflow：内部错误-Blas GEMM启动失败的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试分布式Tensorflow，并从localhost（Windows 10，Python 3.6.6，Tensorflow 1.8.0）上的两个进程开始。每个过程都运行一个简单的神经网络（一个隐藏层）的副本，该副本针对UrbanSounds数据集的子集（每个模型有5268个样本，每个样本有193个特征）进行建模。

I am experimenting with distributed Tensorflow and started with two processes on localhost (Windows 10, Python 3.6.6, Tensorflow 1.8.0). Each process runs a replica of simple Neural Network (1-hidden layer), modeled for a subset of UrbanSounds dataset (5268 samples with 193 features each).

撰写的帖子： https://learningtensorflow.com/lesson11/
我可以重复他们的基本内容例如，根据两个不同过程的结果计算平均值。对于我的数据集，我对代码进行了如下修改，将总样本分为两半，并让两个不同的过程分别计算成本函数。
但是，在成功启动RPC服务器之后，两个进程最终都出现以下错误：

Following this well-written post: https://learningtensorflow.com/lesson11/ I could repeat their basic example, calculating mean from results of two distinct processes. For my dataset, I modified the code as follows, to divide the total samples into two half and let two distinct processes compute the cost function separately. But after the RPC server is started successfully, both processes end up in following error:

InternalError（请参见上文中的回溯）：Blas GEMM启动失败：
a.shape =（263，193），b.shape =（193，200），m = 263，n = 200，k = 193

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193

[[[节点：MatMul = MatMul [T = DT_FLOAT，transpose_a = false，
transpose_b = false，
_device = / job：local / replica：0 / task：0 /设备：GPU：0]（_ recv_Placeholder_0_G7，
w1 / read）]]

[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

在我看来，这是一个基本错误使用神经网络配置或为feed_dict准备数据集，但我看不到，因此需要另一双眼睛。
在此实验中的另一个观察结果是，GPU多数情况下拍摄到最大值而代码中止。
请协助我解决在分发Tensorflow的代码或策略方面的任何错误吗？

It appears to me some basic mistake with neural network configuration OR preparing datasets for feed_dict, but I am unable to see that so need another pair of eyes. Another observation during this experiment is that GPU mostly shooted to max and code aborted. Please assist me with any mistake in code or strategy to distribute the Tensorflow?

谢谢。

### ERROR TRACE (removed duplicate rows ...) ####
train_data, train_labels (528, 193) (528, 10)
test_data, test_labels (22, 193) (22, 10)
2018-08-27 14:35:29.096572: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-08-27 14:35:29.330127: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.683
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.63GiB
...
2018-08-27 14:35:33.982347: E T:\src\github\tensorflow\tensorflow\stream_executor\cuda\cuda_blas.cc:462] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Traceback (most recent call last):
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\client\session.py", line 1322, in _do_call
2018-08-27 14:35:33.989312: W T:\src\github\tensorflow\tensorflow\stream_executor\stream.cc:2001] attempting to perform BLAS operation using StreamExecutor without BLAS support
    return fn(*args)
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
...
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

Caused by op 'MatMul', defined at:
  File "tf_dis_audio_test.py", line 78, in <module>
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
  File "C:\Users\shakeel\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\python\ops\math_ops.py", line 2122, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
...
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(263, 193), b.shape=(193, 200), m=263, n=200, k=193
         [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:local/replica:0/task:0/device:GPU:0"](_recv_Placeholder_0_G7, w1/read)]]

### CODE SAMPLE ###
# selected UrbanSounds dataset
print("train_data, train_labels", train_data.shape, train_labels.shape)
print("test_data, test_labels", test_data.shape, test_labels.shape)

# neural network configurations
cost = 0.0
n_tasks = 2
n_epochs = 10
n_classes = 10
n_features = 193
n_hidden_1 = 200
learning_rate = 0.1
sd = 1/np.sqrt(n_features)
cost_history = np.empty(shape=[1], dtype=float)

# task#0 is set as rpc host process
rpc_server = "grpc://localhost:2001"

# run two separate python shells, each with its task number (0,1), as:
#>python this_script.py  0
#>python this_script.py  1
task_number = int(sys.argv[1])

# cluster specs with two localhosts on different ports (2001, 2002)
cluster = tf.train.ClusterSpec({job_name:["localhost:2001", "localhost:2002"]})
server = tf.train.Server(cluster, job_name="local", task_index=task_number)
server.start()

graph = tf.Graph()
with graph.as_default():    
    X = tf.placeholder(tf.float32, [None, n_features])
    Y = tf.placeholder(tf.float32, [None, n_classes])

    w1 = tf.Variable(tf.random_normal([n_features, n_hidden_1], mean = 0, stddev=sd), name="w1")
    b1 = tf.Variable(tf.random_normal([n_hidden_1], mean=0, stddev=sd), name="b1")
    w2 = tf.Variable(tf.random_normal([n_hidden_1, n_classes], mean = 0, stddev=sd), name="w2")
    b2 = tf.Variable(tf.random_normal([n_classes], mean=0, stddev=sd), name="b2")
    
    z = tf.nn.tanh(tf.matmul(X, w1) + b1)
    _y = tf.nn.softmax(tf.matmul(z, w2) + b2)
    
    cost_function = tf.reduce_mean(tf.square(Y - _y))
    train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_function)
    prediction = tf.equal(tf.argmax(Y, 1), tf.argmax(_y, 1))
    accuracy = tf.reduce_mean(tf.cast(prediction, tf.float32)) * 100.0
    print("#2: {}".format(datetime.utcnow().strftime(datetime_format)[:-3]))

# hack to fix the GPU out of memory issue
# but it does not make any good, GPU still shoots :(
gpuops = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
config = tf.ConfigProto(gpu_options=gpuops)

with tf.Session(rpc_server, graph=graph, config=config) as ss:
    # setting up the session with RPC host
    ss = tf.Session(rpc_server)
    ss.run(tf.global_variables_initializer())

    for epoch in range(n_epochs):
        batch_size = int(len(train_labels) / n_tasks)

	# run session for task#0
        if (task_number == 0):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[:batch_size-1], Y:train_labels[:batch_size-1]})

	# run session for task#1
        elif (task_number == 1):
            _, cost = ss.run([train_step, cost_function], feed_dict={X:train_data[batch_size:-1], Y:train_labels[batch_size:-1]})

	# recording the running cost of both processes
        cost_history = np.append(cost_history, cost)
        print(" epoch {}: task {}: history {:.3f}".format(epoch, task_number, cost_history))

    print("Accuracy SGD ({}): {:.3f}".format(
        epoch, round(ss.run(accuracy, feed_dict={X: test_data, Y: test_labels}), 3)))

分布式Tensorflow：内部错误-Blas GEMM启动失败 [英] Distributed Tensorflow: Internal Error - Blas GEMM launch failed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

分布式Tensorflow：内部错误-Blas GEMM启动失败 [英] Distributed Tensorflow: Internal Error - Blas GEMM launch failed

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭