ChatGPT's o3 Model Found Remote Zeroday in Linux Kernel Code

Kissaki · 6 days ago

ChatGPT's o3 Model Found Remote Zeroday in Linux Kernel Code

drspod@lemmy.ml · edit-2 5 days ago

From the researcher’s blog post: (https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/)

My experiment harness executes this N times (N=100 for this particular experiment) and saves the results. […]

o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives.
…
Combining the code for all of the handlers with the connection setup and teardown code, as well as the command handler dispatch routines, ends up at about 12k LoC (~100k input tokens), and as before I ran the experiment 100 times.

o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about.

A practical demonstration of “even a stopped clock is right twice a day.”

Kissaki · 4 days ago

“even a stopped clock is right twice a day.”

Code analysis is a bit more complex than a clock.

onlinepersona · 5 days ago

Initially embarking on a manual audit of ksmbd to benchmark o3’s potential, Heelan quickly realized that the model was able to autonomously identify a complex use-after-free vulnerability in the handler for the SMB ‘logoff’ command—an issue Heelan himself had not previously detected.

nebulaone@lemmy.world · 5 days ago

Uh oh, that means AI will be used to find countless zero-days for hacking purposes.

wizardbeard@lemmy.dbzer0.com · 5 days ago

If by countless you mean 8 valid ids of this same singular issue in 100 runs, with an almost 30% false positive rate, then sure.

I’m far more worried about the false positive rate drowning out things.