VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

KingsmanVince@kbin.social · 1 year ago

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

KingsmanVince@kbin.social · 1 year ago

The idea is similar to BLIP-2. Both papers use learnable tokens as queries for a transformer decoder. This decoder query from vision space base on the trainable queries and prompt.

nsa@kbin.social · 1 year ago

Also reminds me of this ICLR paper: Linearly Mapping from Image to Text Space.