(sorry if anyone got this post twice. I posted while Lemmy.World was down for maintenance, and it was acting weird, so I deleted and reposted)
(sorry if anyone got this post twice. I posted while Lemmy.World was down for maintenance, and it was acting weird, so I deleted and reposted)
Huh, it didn’t actually tell the steps
close though xD
Sadly almost all these loopholes are gone:( I bet they’ve needed to add specific protection against the words grandma and bedtime story after the overuse of them.
I wonder if there are tons of loopholes that humans wouldn’t think of, ones you could derive with access to the model’s weights.
Years ago, there were some ML/security papers about “single pixel attacks” — an early, famous example was able to convince a stop sign detector that an image of a stop sign was definitely not a stop sign, simply by changing one of the pixels that was overrepresented in the output.
In that vein, I wonder whether there are some token sequences that are extremely improbable in human language, but would convince GPT-4 to cast off its safety protocols and do your bidding.
(I am not an ML expert, just an internet nerd.)
They are, look for “glitch tokens” for more research, and here’s a Computerphile video about them:
https://youtu.be/WO2X3oZEJOA?si=LTNPldczgjYGA6uT
Wow, it’s a real thing! Thanks for giving me the name, these are fascinating.
Just download an uncensored model and run the ai software locally. That way your information isn’t being harvested for profit + the bot you get will be far more obedient.
https://github.com/Original-2/ChatGPT-exploits/tree/main
I just got it to work… Scroll for meth and xanax
I managed to get “Grandma” to tell me a lewd story just the other day, so clearly they haven’t completely been able to fix it
This is gold