Hello everyone!

TL;DR: I want to propose a community-driven effort to research and improve 1-bit LLM models, for use within and outside the Horde. I think having access to such models would be very useful for the overall project, as you do not need a lot of compute to run “bigger” models if they’re compressed well. Relevant paper.

We currently live in very interesting times regarding AI development. The big companies in the US seem to still be ahead, but groups from China, working on open-weights models, are making very impressive strides. However, the focus has been and still is on developing really big models. Most of the impressive models that keep coming out are bigger and bigger, leaving most people to pay for API tokens if they want something useful. Probably this is also incentivised, as Nvidia wants to make more money from sales to data centers.

I have been keeping an eye out for the AI Horde project, and I always wanted to help out, but never got around to setting it properly up due to my AMD GPU and running Windows. However, really cool project, and I wanted to congratulate everyone involved!

In my opinion, the goal of the AI Horde community also positions it as one of the very few that are capable of bringing forward some LLMs that can be “smart” on consumer-level hardware. That is, getting models that can handle long-horizon tasks well without paying providers for API access.

Since the first paper on 1.68-bit LLMs, there has been quite a few other suggestions thrown around (recent example). No one really uses these models much, as the quality is seriously degraded. Similarly, no one is really trying to improve these models further, as there is no incentive to do so when you can just pay someone 3$ per million token output from an existing open-weight model. For example, to the best of my knowledge, no one has tried to introduce latent reasoning (example) in this context, or specifically training/fine-tuning models at 1-bit levels.

So, to get to the point, would it make sense to get some community-driven research in this area? I believe that we could all pool together compute, good training data, ideas for fine-tuning / RL-training, etc. If it works out, we could have a method that makes existing larger models (say, up to 200B) available on a single 24GB GPU.

First thing I would try is to expand on the recent NanoQuant paper:

  1. Wait for weights to be released.
  2. If no weights come out, quantize a Qwen 3 32B, and try a more diverse dataset, with more tokens to see if fidelity can improve. I could get some access to GPUs for this myself. Another option for 1-bit models would be using other existing ones (e.g., Unsloth), but performance degradation is much bigger in those versions, from what I have seen. Furthermore, the compression of these models is not as efficient, and you would not be able to fully run a 70B parameter (as described in the paper), with only 8GB VRAM.
  3. Get an LLM (or human volunteers) to determine behaviour on various tasks: find limitations, strengths, etc. Get some human preferences for RLHF, or use a bigger LLM to grade output quality. Preference here on logic tasks.
  4. Perform fine-tuning of 1-bit model based on the gathered data, and deploy for use. Return to step 2 after a while.

Fine-tuning the 1-bit model might get a bit hairy, as the binary operation are not differentiable. We also wouldn’t be able to up-cast back to F32 for regular training, as this would completely invalidate the consumer-driven access to these models. Simplest idea would be to train a LoRA head, or do some stochastic-driven training (e.g., flipping bits). However, the latter would probably be very unstable, and not work out, and LoRA might be the only option. I’m not a mathematician, so I am open to suggestions here :)

Past the initial prototype, I would consider the following stuff that’s already implemented in other quants:

  • Try to re-calibrate the LayerNorm / RMSNorm parameters based on the original model’s activations.
  • Perform some regular KL-divergence distillation
  • Other than using an LLM as a judge for RL, perhaps one could also fine-tune using a semantic similarity metric while aligning with the output of the original model. This could ensure that the intent is the same, even if the style differs.
  • Depending on token complexity, look into reducing compression just for difficult tokens and compressing further for easy ones (à la FlexQuant)
  • Mixture-of-Experts-style quantization, where we increase compression for experts that are not important, and reduce it for higher-frequency ones.

Curious what everyone thinks!

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    2
    ·
    8 days ago

    Have a nice weekend as well!

    By the way, in the meantime I rediscovered one of the distributed AI projects I remembered reading about: https://github.com/learning-at-home/hivemind

    They have some links to related projects and citations of scientific papers at the bottom.
    That is about approaches to both distributed inference, and distributed training.

    • andrew0@lemmy.dbzer0.comOP
      link
      fedilink
      arrow-up
      2
      ·
      8 days ago

      Oh, I remember this! Thanks for sharing, very nice find! Could be a worthwhile approach once we have the data :)