Community-driven efficient LLM development - Possible?

andrew0@lemmy.dbzer0.com · edit-2 20 days ago

Community-driven efficient LLM development - Possible?

andrew0@lemmy.dbzer0.com · 21 days ago

The models themselves would indeed be costly to train if you were to go for the regular approach. You would have to “upscale” the weights to be fp32 from binary, which would make the models only trainable on the usual amount of GPUs. That is because the training process relies on back-propagation, which only makes sense if your operations are differentiable. Since addition is not differentiable, your binary weights would only be updated by 0, so no change.

However, LoRA (16-bit) QLoRA (4/8-bit) fine-tuning can be done on a single GPU, assuming you can fit the model on it. Everything is frozen, except for a separate small network, which is updated during training. This can have BF16 or F32 precision, and would be trained as you would a regular network.

What I am suggesting is to actually leverage bigger models that come out, and attempt to compress them using the proposed algorithm (if it actually scales to bigger models). From there, we could employ some tricks to improve performance, think latent reasoning, community-driven RLHF only on the (Q)LoRA layers, etc. With time, we would be able to pool together a dataset and a pipeline that can be applied to any open-weight model that is released.

But it does sound a bit easier than it would be in practice. This heavily relies on re-purposing the Horde to also store training data (with user consent, of course), user scores, and later introduce a training queue.

hendrik@palaver.p3x.de · edit-2 20 days ago

Sure. I mean I know we can do it for 4bit or 8bit quantizations. Question is just if it can be done for 1bit. As per Microsoft’s first Bitnet paper, the answer was a clear: No. Seems I’ve missed the NanoQuant paper from a few days ago. They did post-training quantization. But the numbers don’t look impressive to me? I mean I’m not an expert and have just skimmed it. But the benchmarks in table 3 look like some very old model from a few years ago. Every tiny modern model can reach better scores. And the perplexity in table 2 doesn’t look great to me either. Sure, you could run a 1-bit version of a 70b parameter model… But that’s just worse than a 1b parameter model. So you can probably just skip the entire effort with the 1-bit quantization, download the smallest variant there is, run it straight out of the box, and it’d be both more “intelligent” and at least 4x faster.

I mean ultimately, you’re right. I’d like to know the benchmark results for something like a 70b model. Seems they only list those for small models. And it’s a shame they didn’t release any weights.

But I don’t see why we need a collaborative Horde effort to find out… They write a 70b model can be compressed on an H100 in 13 hours. And for example runpod.io charges $2.39 per hour for such an instance. So all we need is a bit more than $30 to find out?!

andrew0@lemmy.dbzer0.com · 20 days ago

Indeed, the quantization described in the Microsoft paper (and even in this NanoQuant paper) severely messes up the behaviour of the model. Even in this newer paper, you’d still incur ~2x performance loss (which is better than what was reported by 1.68 bit paper, if true), in terms of perplexity. However, as per the other paper I have added in the edited post, it is possible to further align a quantized model with the original one. In the end, LLMs are just fancy math that seek to maximize human preferences, and most of the bigger models were just better trained at doing that. With this approach, all we would have to do is just further refine the LoRA weights until we can match the behaviour of the unquantized model, which wouldn’t be that expensive if all we have to do is fine-tune a few million parameters. It might be that at the beginning we’re seeing worse performance compared to a 3B parameter model, but with more refinement we can further unlock some of the original performance.

Regarding the use of the Horde, I believe that behaviour alignment can’t be done without actually using it. Just like corpo-AI are giving away their models so that they can further get data, we could have a similar, but much more compute-efficient, community-driven approach. Models by the people, for the people, if you will. Furthermore, as I mentioned, I think this would be the only community that has the compute and desire to push improvements on such an idea long-term, as it isn’t profit-driven.

Let’s say that this whole experiment starts with an extreme case, the MiniMax M2.5 model, and we abstract away from any architectural fancy stuff. At ~230B parameters, we would have a 1-bit model size of ~28.75 GB, and, as per Table 2 of NanoQuant, ~23 GB if we were to prune 20% of the weights. This would be enough to fully fit it on a 24GB VRAM GPU. Following this, we could get a well-balanced list (i.e., easy, medium, hard) of reasoning tasks, and fine-tune the LoRA layer to match the output. Heck, we could even tailor this to specific tasks, such as role-playing, coding, etc. It will be a long-term experiment where we might serve two answers (depending on Horde availability), one generated by the quantized model + LoRA and another that is regularly deployed. The user could then choose the model they prefer, and use that information later for further training.

This would indeed be quite cumbersome to set up, and could very well be wasted time. Users might even opt out from this because it could take too much time to help. But hey, I still think it would be a cool experiment to see if consumers could actually use these larger models on regular hardware, and get close to the original performance without paying for all the compute that is needed.

hendrik@palaver.p3x.de · edit-2 19 days ago

Indeed. That looks nice. It’s about 2bit quantization. So not sure if it translates to the other paper. I had a quick glance at their code, and it’s specific to the Llama2 and Llama3 architectures. So, it’d need to be enhanced for other models. And what might be a bummer: they load the model at full precision to calculate the activations. That means you’re looking at a system with ~480GB of (V)RAM. And we don’t have machines of that size show up on the AI horde. (As far as I know.)

I think we’re looking more at crowdfunding research here. I mean sorry for being overly negative. I’d like to see 1bit models as well. And I always love to see community projects and independent people push the limits. I just think the hard part is coming up with the research, the math… or even the engineering to combine two papers and adapt an approach to something. So we somehow need to crowdfund that.
In these two examples, seems the compute power isn’t really the issue. I mean the 1bit training was doable on a single H100. And this LoRa isn’t very complex either, and they’re not using that many samples.

It just wouldn’t fit on any of the 38 LLM workers currently online on the AI Horde. Not even remotely. So this and the Horde is kind of a bad / impossible fit. However, I still think compute power wouldn’t be the biggest issue, we can rent that by the hour. And it’s not even hard to set up or that expensive. I think the main issue is coming up with the math and the code to produce something useful. So maybe we need a research community. And these things already happen. I mean the llama.cpp community has long been working on quantization and pioneered some things. There’s people on Reddit discovering new things. We’ve had random(?) individuals contribute substantial advancements to image and vide generation. There have been communities/projects like RedPajama, who trained a model from grounds up (and assembled the dayaset)… Seems very low precision quantization is just a tough nut to crack.

Seems to me Bitnet needs a pile of money in compute, plus a team of bright researchers to improve upon. NanoQuant doesn’t perform as good as any 4bit or 8bit model with a similar resource footprint, so it’d need way more research as well. And RILQ is a bit specific, it’d need further research as well. It’s not entirely clear whether that happens. There’s something like publishing bias. Sometimes researchers don’t publish negative results. So maybe they tried to apply it to lower resolutions, failed, and didn’t write a paper about how they failed. So I’m not sure where to go with this. There isn’t anything we could run or just apply as is.

And the AI Horde does inference with fixed scripts. On something like gaming GPU’s and Apple silicon. People who bought a few old 3090s. But that’s inference only. What’s needed for general research is a new project. It’d need to provide you with cloud GPU, launch Docker containers for arbitrary workloads. And expensive enterprise GPUs, or infiniband clusters of some. So the entire software needs to be scrapped and replaced, and the hardware improved as well for cutting edge research. We maybe can call this new thing AI Horde as well. But it’d be an entirely new thing.

And I guess alignment, harvesting user data and preferences from the user’s interactions could be done as well. At least from the technological perspective. I don’t really know if the audience likes that. Depends a bit on how it’s done.

andrew0@lemmy.dbzer0.com · 8 days ago

Apologies for the late reply! Busy days :D

I agree with you. Crowd-sourcing this type of research would be a completely different goal than what the AI Horde was built for, and would probably not be sustainable with part-time / volunteer researchers. Perhaps it’s best for us to just wait until others make more substantial progress.

The goal would still have been inference for the Horde, but with sharing of feedback based on the model’s outputs, to align it more with the original one. However, after considering this approach more, I am afraid that the maths behind it makes it impossible to “reconstruct” the original model’s manifold, or at least capture the same behaviour in all use cases.

I came here to propose this idea because, to the best of my knowledge, this is only LLM community that actually pushes for sharing of resources. However, I have seen a few days ago a post on the LocalLlama community advocating for sharing of OpenCode sessions in order to crowd-source a fine-tuning dataset, so it seems that more people are having the same thoughts! :)

I will keep an eye out on other advancements, and if I actually end up having some time, perhaps I’ll return with some contributions. I agree with you that such a project mostly relies on inference, in which case the AI Horde is not the only one that can provide that capability. What we would need is deploying such a model on HuggingFace, and creating an API endpoint for sharing training data for people that are interested in contributing.

Thanks a lot for offering your thoughts, and taking the time to write such lengthy responses to me! I hope you have a nice weekend!

hendrik@palaver.p3x.de · 8 days ago

Have a nice weekend as well!

By the way, in the meantime I rediscovered one of the distributed AI projects I remembered reading about: https://github.com/learning-at-home/hivemind

They have some links to related projects and citations of scientific papers at the bottom.
That is about approaches to both distributed inference, and distributed training.

andrew0@lemmy.dbzer0.com · 8 days ago

Oh, I remember this! Thanks for sharing, very nice find! Could be a worthwhile approach once we have the data :)