케 라스의 멀티 GPU

여러 GPU에서 교육을 분할하기 위해 keras 라이브러리 (또는 tensorflow)에서 어떻게 프로그래밍 할 수 있습니까? 8 개의 GPU가있는 Amazon ec2 인스턴스에 있고 이들 모두를 사용하여 더 빨리 훈련하고 싶지만 코드는 단일 CPU 또는 GPU 전용이라고 가정 해 봅시다.

— 헥터 블란 딘
소스

tensorflow doc을 확인 했습니까?

— n1tk

@ sb0709 : 오늘 아침 읽기 시작하지만 keras에서 작업을 수행하는 방법을 궁금 해서요

— 헥터 BLANDIN을

keras에서 알지 못하지만 tensorflow의 경우 : tf는 CPU 인 경우 (기본적으로 GPU를 지원하는 경우) 계산을 위해 GPU를 기본적으로 사용합니다. "for d in [ '/ gpu : 1', '/ gpu : 2', '/ gpu : 3'... '/ gpu : 8',] :"및 for 루프에서 "tf.device (d)"는 모든 인스턴스 GPU 리소스를 포함해야합니다. 따라서 tf.device ()가 실제로 사용됩니다.

— n1tk

이처럼 ?? [ '/ gpu : 1', '/ gpu : 2', '/ gpu : 3'... '/ gpu : 8',] : d에서 tf.device (d) 그리고 그것은? 나는 그렇게 시도 할 것이다 :)

— 헥터 블 랜딘

내가 아는 한, 다른 장치에서 모든 작업을 수행 할 수 있습니다.

— n1tk

답변:

Keras FAQ에서 :

https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

아래는 '데이터 병렬 처리'를 활성화하기 위해 복사하여 붙여 넣은 코드입니다. 즉, 각 GPU가 서로 다른 데이터 하위 집합을 독립적으로 처리하도록합니다.

from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

이것은 글을 쓰는 시점에 Tensorflow 백엔드에만 유효한 것으로 보입니다.

업데이트 (2018 년 2 월) :

Keras는 이제 multi_gpu_model을 사용하여 자동 GPU 선택을 허용하므로 더 이상 gpus 수를 하드 코딩 할 필요가 없습니다. 이 풀 요청의 세부 사항 . 즉, 다음과 같은 코드를 활성화합니다.

try:
    model = multi_gpu_model(model)
except:
    pass

그러나 더 명시 적으로 말하면 다음과 같은 것을 고수 할 수 있습니다.

parallel_model = multi_gpu_model(model, gpus=None)

보너스 :

모든 GPU, 특히 NVIDIA GPU를 실제로 사용하고 있는지 확인하려면 다음을 사용하여 터미널에서 사용량을 모니터링 할 수 있습니다.

watch -n0.5 nvidia-smi

참고 문헌 :

— weiji14
소스

multi_gpu_model(model, gpus=None)GPU가 1 개인 경우에도 작동 합니까 ? 사용 가능한 GPU 수에 자동으로 적응하면 멋지다.

— CMCDragonkai

예, 1 GPU에서 작동한다고 생각합니다. github.com/keras-team/keras/pull/9226#issuecomment-361692460을 참조하십시오 . 그러나 코드가 간단한 모델 대신 multi_gpu_model에서 실행되도록주의해야 할 수도 있습니다 . 대부분의 경우 아마 중요하지 않지만 중간 레이어의 출력을 얻는 것과 같은 작업을 수행하려면 그에 따라 코딩해야합니다.

— weiji14

다중 GPU 모델 차이점에 대한 언급이 있습니까?

— CMCDragonkai

github.com/rossumai/keras-multi-gpu/blob/master/blog/docs/… 와 같은 의미 입니까?

— weiji14

그 참조는 @ weiji14가 훌륭했습니다. 그러나 나는 이것이 추론에 어떻게 작용하는지에 관심이 있습니다. keras가 사용 가능한 모델 복제본에서 배치를 균등하게 분할하거나 라운드 로빈 일정을 설정합니까?

— CMCDragonkai

TensorFlow의 경우 :

GPU를 사용한 TensorFlow

사용 방법에 대한 샘플 코드는 다음과 같습니다. 따라서 각 작업마다 장치 / 장치 와 함께 목록이 지정 됩니다.

# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

tf는 CPU에 대한 것이더라도 (지원되는 GPU 인 경우) 계산에 기본적으로 GPU를 사용합니다. "for d in [ '/ gpu : 1', '/ gpu : 2', '/ gpu : 3'... '/ gpu : 8',] :"및 for 루프에서 "tf.device (d)"는 모든 인스턴스 GPU 리소스를 포함해야합니다. 따라서 tf.device ()가 실제로 사용됩니다.

여러 GPU로 Keras 모델 교육 확장

케 라스

Keras의 경우 args.num_gpus보다 Mxnet을 사용합니다 . 여기서 num_gpus 는 필요한 GPU 목록입니다.

def backend_agnostic_compile(model, loss, optimizer, metrics, args):
  if keras.backend._backend == 'mxnet':
      gpu_list = ["gpu(%d)" % i for i in range(args.num_gpus)]
      model.compile(loss=loss,
          optimizer=optimizer,
          metrics=metrics, 
          context = gpu_list)
  else:
      if args.num_gpus > 1:
          print("Warning: num_gpus > 1 but not using MxNet backend")
      model.compile(loss=loss,
          optimizer=optimizer,
          metrics=metrics)

horovod.tensorflow

최근 모든 Uber 오픈 소스 Horovod의 맨 위에는 훌륭하다고 생각합니다.

호로 보드

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model…
loss = …
opt = tf.train.AdagradOptimizer(0.01)

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=“/tmp/train_logs”,
                                      config=config,
                                      hooks=hooks) as mon_sess:
 while not mon_sess.should_stop():
   # Perform synchronous training.
   mon_sess.run(train_op)

— n1tk
소스

기본적으로 다음 예제를 예로들 수 있습니다. keras를 가져온 후 CPU 및 GPU 소비 값을 지정하기 만하면됩니다.

import keras

config = tf.ConfigProto( device_count = {'GPU': 1 , 'CPU': 56} )
sess = tf.Session(config=config) 
keras.backend.set_session(sess)

그런 다음 모델에 적합합니다.

model.fit(x_train, y_train, epochs=epochs, validation_data=(x_test, y_test))

마지막으로 상한에 대한 작업이 아닌 소비 값을 줄일 수 있습니다.

— 존 카시
소스