MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

KingsmanVince@kbin.social · 1 year ago

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

KingsmanVince@kbin.social · 1 year ago

Demystifying CLIP Data

KingsmanVince@kbin.social · 1 year ago

indeed it would be great if the authors did so. I personally found some non-official implementations:

KingsmanVince@kbin.social · edit-2 1 year ago

KingsmanVince@kbin.social · 1 year ago

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

KingsmanVince@kbin.social · 1 year ago

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

KingsmanVince@kbin.social · 1 year ago

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models

KingsmanVince@kbin.social · 1 year ago

IIRC DeTr generate a sequence to predict boxes of objects. I think this paradigm can be applied to such models. “Think before you locate” could be a new path to explore.

KingsmanVince@kbin.social · 1 year ago

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

KingsmanVince@kbin.social · 1 year ago

Scaling Vision-Language Models with Sparse Mixture of Experts

KingsmanVince@kbin.social · 1 year ago

Hydra-MoE: A new class of Open-Source Mixture of Experts

KingsmanVince@kbin.social · 1 year ago

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

KingsmanVince@kbin.social · 1 year ago

Foundational Models Defining a New Era in Vision: A Survey and Outlook

KingsmanVince@kbin.social · 1 year ago

https://github.com/FudanDISC/weakly-supervised-mVLP/tree/master

KingsmanVince@kbin.social · 1 year ago

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

KingsmanVince@kbin.social · 1 year ago

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

KingsmanVince@kbin.social · 1 year ago

Vision Language Transformers: A Survey

KingsmanVince@kbin.social · 1 year ago

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

KingsmanVince@kbin.social · 1 year ago

Github link

KingsmanVince@kbin.social · 1 year ago

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

KingsmanVince@kbin.social · 1 year ago

RecycleGPT: An Autoregressive Language Model with Recyclable Module

KingsmanVince@kbin.social · 1 year ago

Multitask Pretraining with Structured Knowledge for Text-to-SQL Generation

KingsmanVince@kbin.social · 1 year ago

https://github.com/Jamie-Stirling/RetNet non-official implementation

KingsmanVince@kbin.social · 1 year ago

AI replaces programmers for real

KingsmanVince@kbin.social · 1 year ago

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs

KingsmanVince@kbin.social · 1 year ago

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

KingsmanVince@kbin.social · edit-2 1 year ago

Relevant links:

KingsmanVince@kbin.social · 1 year ago

I know we are moving away from Reddit. However, if I don’t link, I feel like we may miss out good threads on r/machinelearning. Moreover, the authors don’t only post arxiv links, they post other sutff such as Summary, Key points, … (e.g this).

So can I at least put them in the posts instead of posting in a comment?

KingsmanVince@kbin.social · 1 year ago

Reddit thread: https://www.reddit.com/r/MachineLearning/comments/14pq5mq/r_hardwiring_vit_patch_selectivity_into_cnns/

KingsmanVince@kbin.social · 1 year ago

The idea is similar to BLIP-2. Both papers use learnable tokens as queries for a transformer decoder. This decoder query from vision space base on the trainable queries and prompt.

KingsmanVince@kbin.social · edit-2 1 year ago

I also want to share some resources.
For Pytorch,

https://pytorch.org/tutorials/ their basic tutorials are fundamental but some more advanced tutorials might be outdated.
https://www.learnpytorch.io/ the author guides mostly in computer vision but he gives the overview from research to production.

For TPU,

https://github.com/ayaka14732/tpu-starter full guideline using TPUs with Jax

KingsmanVince