Advice needed for working with GPU's in an HA Cluster

surfrock66@lemmy.world · 8 days ago

Advice needed for working with GPU's in an HA Cluster

moonpiedumplings · 7 days ago

Yes, you have noticed that HA with GPU’s is very, very difficult. My understanding is that most people have given up, where they use something like kubernetes and just kill and restart the application on another machine, instead of truly failing over a virtual machine/container.

Now, you’ve stated that you can’t afford enterprise grade gpu’s. That’s okay. You should know that there exist projects to unlock vgpu features on non-enterprise/server grade nvidia gpus.

this is the main one: https://github.com/DualCoder/vgpu_unlock — but it only supports 2080 nvidia gpus and below. It looks like there is some work on getting 30/40 series nvidia gpus working, but I cannot find a public guide. It should be noted that only the 30/40 series has tensor cores!

Now, even if you cannot get vgpu working, you should also know that although you cannot share a virtual machine between virtual machines, you can can share a gpu between LXC containers, which proxmox supports management of.

all of my VM’s are built as if they can freely migrate

I also want to clarify something about the way HA in proxmox works. Live migration only works when the relevant proxmox hosts are online. Failover, which is a form of high availability, happens when a host goes offline, and the new virtual machine reboots from the shared storage in use.

My honest recommendation is to give up on HA for this. Going for minimal cost, I would buy one cheaper GPU (intel arc’s offer best bang for buck right now iirc) which is dedicated to VDI and jellyfin hardware decoding (but this only works if your VDI and jellyfin are in LXC containers… since there is also no vgpu), and buy one more expensive nvidia gpu with tensor cores for machine learning and AI workflows.

surfrock66@lemmy.world · 7 days ago

A big part of my HA strategy was for evaluation and Dev of proxlb as a DRS equivalent, as we plan for offboarding from VMware at work and this lab is my work case for Proxmox as our landing spot. All of my apps are VMs and not containers. I have 3 PCI slots per host open, and they are R450s which I find to be tricky adding powered cards, which is a point for the P4 even though I lose the tensor cores.

I just have a hard time committing a bunch of money that will be dependent on a hack to work for hardware enablement. I’m kind of at the same conclusion, that tossing a P4 in each host and assigning those 3 VMs as non-HA is probably the best path forward. I can wait though to see if some new solution comes forward.