# GPU vs CPU benchmarks with Flux.jl

This post compares the training time of a simple convolutional neural network on a GPU and CPU. The data, network architecture, and training loops are based on those provided in the fluxml.ai tutorial on deep learning.

I am using a Dell XPS 13 laptop with an Intel i7-7500U CPU, connected to an Nvidia GTX 1080 using the Razer Core X external GPU dock. Neither the CPU nor the GPU are particularly fast compared to 2021 hardware, so you can expect to see much faster results on modern machines.

### Preliminaries

As usual, the necessary packages are loaded. Unless you already have it stored locally, the image data, which we will train the neural networks on, also needs to be downloaded.

# Load necessary packages
using Statistics
using Flux, Flux.Optimise
using Flux: onehotbatch, onecold
using Flux: crossentropy, Momentum
using Base.Iterators: partition
using CUDA
using Images.ImageCore
using BenchmarkTools: @btime



### Data preparation

Prior to training, the downloaded image data needs to be prepared and loaded onto the GPU. I don’t specify a validation sample here as the focus is entirely on benchmarking training time. (Note that explicit transfer to the cpu via |> cpu is not necessary. I do this here solely for emphasis.)

# Prepare data
X = trainimgs(CIFAR10)
labels = onehotbatch([X[i].ground_truth.class for i in 1:50000],1:10)
getarray(X) = float.(permutedims(channelview(X), (2, 3, 1)))
imgs = [getarray(X[i].img) for i in 1:50000]

# Load data to gpu and cpu
train_gpu = ([(cat(imgs[i]..., dims = 4),
labels[:,i]) for i in partition(1:50000, 1000)]) |> gpu
train_cpu = ([(cat(imgs[i]..., dims = 4),
labels[:,i]) for i in partition(1:50000, 1000)]) |> cpu


### Neural network construction

The constructed convolutional neural network has a total of 39,558 trainable parameters. We need to define one for both the GPU and CPU separately.

# Define neural networks for both gpu and cpu
m_gpu = Chain(
Conv((5,5), 3=>16, relu),
MaxPool((2,2)),
Conv((5,5), 16=>8, relu),
MaxPool((2,2)),
x -> reshape(x, :, size(x, 4)),
Dense(200, 120),
Dense(120, 84),
Dense(84, 10),
softmax) |> gpu

m_cpu = Chain(
Conv((5,5), 3=>16, relu),
MaxPool((2,2)),
Conv((5,5), 16=>8, relu),
MaxPool((2,2)),
x -> reshape(x, :, size(x, 4)),
Dense(200, 120),
Dense(120, 84),
Dense(84, 10),
softmax) |> cpu


Finally, the loss and optimizers need to be defined. I again define these for the GPU and CPU separately to avoid any contamination.

# Define loss and optimizer
loss_gpu(x, y) = sum(crossentropy(m_gpu(x), y))
opt_gpu = Momentum(0.01)

loss_cpu(x, y) = sum(crossentropy(m_cpu(x), y))
opt_cpu = Momentum(0.01)


### Benchmark Results

The training loops defined below are set to a single epoch to avoid overly long runtimes (on the CPU). Of course, feel free to increase the epochs for more extensive benchmarking.

# Set number of training iterations
epochs = 1

# GPU benchmark
@btime for epoch = 1:epochs
for d in train_gpu
l = loss_gpu(d...)
end
update!(opt_gpu, params(m_gpu), gs)
end
end

julia> 828.756 ms (1407620 allocations: 56.20 MiB)

# CPU benchmark
@btime for epoch = 1:epochs
for d in train_cpu