Hyena Hierarchy seems to aim to be a drop-in replacement for attention : https://arxiv.org/pdf/2302.10866.pdf
It looks good on paper, but I haven’t been able to find anybody using it in a model. Does anyone have an example of a code or implementation ? Is there really a big improvement on long context lengths ?
You must log in or register to comment.
My research area has been in time series forecasting and unsupervised anomaly detection, but it is SOMEWHAT related to NLP.
Papers with code had a few potential implementations: https://paperswithcode.com/paper/hyena-hierarchy-towards-larger-convolutional
I am always skeptical of papers. They could have good results, but how much did they adjust their experiment to look good on paper?