Navigating Neural Networks: Exploring State-of-the-Art Activation Functions – OMSCS 7641: Machine Learning

ericjmorey · 1 year ago

Navigating Neural Networks: Exploring State-of-the-Art Activation Functions – OMSCS 7641: Machine Learning

Newtra@pawb.social · edit-2 1 year ago

Note: For this guide, we’ll focus on functions that operate on the scalar preactivations at each neuron individually.

Very frustrating to see this, as large models have shown that scalar activation functions make only a tiny impact when your model is wide enough.

https://arxiv.org/abs/2002.05202v1 shows GLU-based activation functions (2 inputs->1 output) almost universally beat their equivalent scalar functions. IMO there needs to be more work around these kinds of multi-input constructions, as there are much bigger potential gains.

E.g. even for cases where the network only needs static routing (tabular data), transformers sometimes perform magically better than MLPs. This suggests there’s something special about self-attention as an “activation function”. If that magic can be extracted and made sub-quadratic, it could be a paradigm shift in NN design.

ericjmorey · 1 year ago

The authors of the blog post seem aware of the limitations of their focus:

In contrast, ReLU and its variants are often preferred for the hidden layers on large datasets and deeper models as they accelerate training. CNNs frequently benefit from the ReLU variants and the Swish activation function. When training a DNN, Leaky ReLU is generally a good starting point. Alternatively, one can chose ReLU activations and inspect the percentage of dead neurons, switching to LeakyReLU or PReLU if required. GeLU shines in NLP tasks despite its computational cost. Swish, while promising, is relatively new and requires further exploration, interpretability and testing.

The activation function landscape is rich and diverse, offering a spectrum of choices to cater to various neural network needs. I hope this guide served as a good starting point for more exploration based on your requirements and network design.

ericjmorey · edit-2 1 year ago

Thank you for highlighting this research! At first glance it’s interesting that sigmoid functions re-emerge as more useful using the approaches evaluated in that article.

Navigating Neural Networks: Exploring State-of-the-Art Activation Functions – OMSCS 7641: Machine Learning

Navigating Neural Networks: Exploring State-of-the-Art Activation Functions – OMSCS 7641: Machine Learning

Summary