Just wanted to put this out there for anyone else who was in the same position, as I’d spent some time banging on this to find a functioning combination and would have appreciated having had success reports myself.
Running Debian Trixie, current as of July 22, 2023.
I see 512x512 speeds of about 2.2 it/s, which is significantly slower than an lower-end Nvidia card than I’d used and significantly (about 1/8th the speed) that other people have reported getting the same XT 7900 XTX card running at on Linux), so there is probably more work for me to do, but it’s definitely running on the GPU and is much faster than running on the CPU, so I know that this combination (vanilla system Python, vanilla system drivers, torch nightly in a venv) does at least work, which was something that I’d been unsure of up until now.
Running on the host, no Docker containers. Using a venv. Automatic1111 web UI, in-repository drivers, 2.1.0.dev20230715+rocm5.5 torch via pip installed in a venv, standard system Python 3.11 (i.e. did not need to set up Python 3.8, as I’ve seen some people do). Needs the non-free-firmware
apt repo component enabled; I have firmware-amd-graphics-20230515-3
. Rocm 5.6 is out as of this writing from AMD, but Debian Trixie presently only has 5.5 packaged and in the repos.
I did need to install libstdc++-13-dev
– only libstdc++-12-dev
being installed caused Automatic1111 to bail out with an error in not being able to find a limits
C++ header when building some C++ code at runtime; some users had run into a similar error and resolved it by installing libstdc++-12-dev
, which was a bit confusing. I have both clang and g++ installed. I am not terribly familiar with the AMD ROCM stack, but my understanding is that part of it (libamdhip64?) performs some compilation at runtime; it apparently remembers the binaries it has compiled, as if I removed libstdc+±13-dev after a successful run, it continued to work.
The user running the Automatic1111 frontend needed to be added to the render
and video
groups to have access to the requisite device files.
I did not need to have HSA_OVERRIDE_GFX_VERSION
set.
As for options being passed in COMMAND_ARGS
, just --medvram
and --api
.
--xformers
does not work with AMD cards; Stable Diffusion (or Automatic1111, unsure about responsibility in the stack) apparently just ignores it there; passing it doesn’t break anything.
Some --opt-sdp
options, like --opt-sdp-attention
, cause dramatic slowdown, I assume causing the generation to run on the CPU instead of the GPU. I’d suggest to anyone trying to get a similar environment running to not start including optimization flags until they have things working without them; this had complicated things for me.
I see 2.59 it/s, so something like 20% higher performance, without --medvram
being passed to COMMANDLINE_ARGS
.
I have not done extensive testing to see whether any issues show up elsewhere with Stable Diffusion.
Thank you for sharing! Doesn’t fit my usecase but I’m glad to see awesome resources out there for others!