- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
I’ve been following the development of the next Stable Diffusion model, and I’ve seen this approach mentioned.
Seems like this is a way in which AI training is analogous to human learning - we learn quite a lot from fiction, games, simulations and apply this to the real world. I’m sure the same pitfalls apply as well.
Synthetic data was used here with impressive results: https://programming.dev/post/133153
There is a lot of potential in this approach, but the idea of using it for training AI systems in MRI/CT/etc. diagnostic methods, as mentioned in the article, is a bit scary to me.
Yeah, you’d better have a through way to check if there are any systematic distortions that could have an adverse effect on its operation. I do get the privacy rationale for using synthesized data, though.
I guess if they pretrain the model using the synthetic dataset and then in a separate training phase “align” it using real data, it could work. Just like how ChatGPT was pretrained on an internet dataset and then had an RLHF phase to make it behave like an assistant rather than a generic text completion model. (Not sure if I’m using the correct terms.)