IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. “Think before you locate” could be a new path to explore.
https://github.com/Jamie-Stirling/RetNet non-official implementation
I know we are moving away from Reddit. However, if I don’t link, I feel like we may miss out good threads on r/machinelearning. Moreover, the authors don’t only post arxiv links, they post other sutff such as Summary, Key points, … (e.g this).
So can I at least put them in the posts instead of posting in a comment?
The idea is similar to BLIP-2. Both papers use learnable tokens as queries for a transformer decoder. This decoder query from vision space base on the trainable queries and prompt.
I also want to share some resources.
For Pytorch,
For TPU,
indeed it would be great if the authors did so. I personally found some non-official implementations: