Hello
Really enjoyed looking at your code.
My first suggestion is using this NN structure (unorthodox and efficient) :
Input := FNN.AddLayer(TNNetInput.Create(FNNConfig.Width, FNNConfig.Height, FNNConfig.Depth));
FNN.AddLayer(TNNetConvolutionReLU.Create(16, 5, 0, 1));
FNN.AddLayer(TNNetMaxPool.Create(4));
FNN.AddLayer(TNNetConvolutionReLU.Create(64, 3, 1, 1));
FNN.AddLayer(TNNetConvolutionReLU.Create(64, 3, 1, 1));
FNN.AddLayer(TNNetFullConnectReLU.Create(64));
FNN.AddLayer(TNNetFullConnectReLU.Create(32));
FNN.AddLayer(TNNetFullConnectLinear.Create(FNNConfig.NumClasses));
FNN.AddLayer(TNNetSoftMax.Create());
Then, use a bigger learning rate and smoother learning rate:
LearningRate := 0.001;
MinLearningRate := 0.00001;
LearningRateDecay := 0.99;
StaircaseEpochs := 1;
You can compare attachments to your code and apply changes you like. Made the saving of NN less hungry. On bigger machines (such as with 64 and 96 cores), nn saving might be too intensive. I might benchmark along weekend on a high core count computer.
On my dual core notebook, each epoch takes from 50 seconds to 65 seconds (no video card, no opencl).
About benchmarking, we can't compare 3 convolutional layers + 3 FC layers with "simpler methods".
Have a look at how the first epoch goes in the attached image (96% accuracy).
BTW, really well done. Congrats.