A lot lower success rate than I suspected, I guess a lot of the scoreboard times were probably legit?

  • Deebster@infosec.pub
    link
    fedilink
    arrow-up
    3
    arrow-down
    1
    ·
    3 days ago

    I was most of the way through the article before I realised that jerpint was the author (added as a control/target), not a custom model.

    I feel he stopped too early in the process - the article ends with saying that there are e.g. improvements to the prompt that could improve the result but doesn’t appear to try any of them.

    I wonder if some of the LLM cheaters have put up their method. I’d expect to see someone with a more complex setup, plus running on local AI hardware to be able to get far more stars.

    • CameronDevOPM
      link
      fedilink
      arrow-up
      3
      ·
      3 days ago

      Was a bit disappointing that they didn’t complete all 50 stars, which made them a bit of a poor control.

    • Acters@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      3 days ago

      I found taking the straight text from the website to GPT o1 can solve it but sometimes GPT o1 produces code that fails to be efficient. so challenges that had scaling on part 2( blinking rocks and lanternfish ) or other ways to cause you to have a hard time creating a fast solution(like the towel one and day 22) are places where they would struggle a lot.

      day 12 with the perimeter and all the extra minute details also causes GPT trouble. So does the day 14, especially the easter egg where you need to step through to find it but GPT can’t really solve it because there is not enough context for the tree unless you do some digging on how it should look like.

      these were some casual observations. clearly there is more to do to test out, but it shows that these are big struggle points. If we are talking about getting on the leaderboard, then I am glad there are challenges like these in AoC where not relying on an llm is better.