Writing an EBNF -> Scheme parser generator using AI for machine-unfriendly descriptions, and further enhancements?

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

Writing an EBNF -> Scheme parser generator using AI for machine-unfriendly descriptions, and further enhancements?

@[email protected] · 3 months ago

Now my compiler can hallucinate, what a time to be alive.

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

@armchair_progamer · edit-2 3 months ago

I find writing the parser by hand (recursive descent) to be easiest. Sometimes I use a lexer generator, or if there isn’t a good one (idk for Scheme), write the lexer by hand as well. Define a few helper functions and macros to remove most of the boilerplate (you really benefit from Scheme here), and you almost end up writing the rules out directly.

Yes, you need to manually implement choice and figure out what/when to lookahead. Yes, the final parser won’t be as readable as a BNF specification. But I find writing a hand-rolled parser generator for most grammars, even with the manual choice and lookahead, is surprisingly easy and fast.

The problem with parser generators is that, when they work they work well, but when they don’t work (fail to generate, the generated parser tries to parse the wrong node, the generated parser is very inefficient) it can be really hard to figure out why. A hand-rolled parser is much easier to debug, so when your grammar inevitably has problems, it ends up taking less time in total to go from spec to working hand-rolled vs. spec to working parser-generator-generated.

The hand-rolled rules may look something like (with user-defined macros and functions define-parse, parse, peek, next, and some simple rules like con-id and name-id as individual tokens):

; pattern			::= [ con-id ] factor "begin" expr-list "end"
(define-parse pattern
  (mk-pattern
    (if (con-id? (peek)) (next))
    (parse factor)
    (do (parse “begin”) (parse expr-list) (parse “end”))))

; factor 			::= name-id 
; 			 | symbol-literal
; 			 | long-name-id 
; 			 | numeric-literal 
;	 		 | text-literal 
; 			 | list-literal 
; 			 | function-lambda 
; 			 | tacit-arg
; 			 | '(' expr ')' 
(define-parse factor
  (case (peek)
    [name-id? (if (= “.” (peek2)) (mk-long-name-id …) (next))]
    [literal? (next)]
    …))

Since you’re using Scheme, you can almost certainly optimize the above to reduce even more boilerplate.

Regarding LLMs: if you start to write the parser with the EBNF comments above each rule like above, you can paste the EBNF in and Copilot will generate rules for you. Alternatively, you can feed a couple EBNF/code examples to ChatGPT and ask it to generate more.

In both cases the AI will probably make mistakes on tricky cases, but that’s practically inevitable. An LLM implemented in an error-proof code synthesis engine would be a breakthrough; and while there are techniques like fine-tuning, I suspect they wouldn’t improve the accuracy much, and certainly would amount to more effort in total (in fact most LLM “applications” are just a custom prompt on plain ChatGPT or another baseline model).

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

You cannot believe how thankful I am of this. Thanks man!

@Corbin · 3 months ago

I’m not sure I understand your reasoning. Here are the highlights for me.

But this time I wanna use Scheme, Cyclone Scheme to be exact.

Why? Scheme is a fine choice in general, but you should only tie yourself to a specific flavor of Scheme if you need specific features…

because Cyclone has a strong FFI to C.

That’s a good reason! Keep in mind that there are usually multiple possible Schemes. In this case, I’d be aware of CHICKEN Scheme as well.

I want it to translate EBNF into Scheme.

Scheme what? What sorts of programs? Presumably a lexer and parser and AST manipulation kit? In a certain sense, that’s all that one can do with grammars. Don’t get me wrong – if you can parse Antlr grammars, then you have hundreds of popular languages at your fingers. But I think that you need to clarify your three languages: your source, your target, and your implementation. I like to use tombstone diagrams for this.

I think an LLM can help.

No, I’m afraid not. Even if all the popular claims about them were true, LLMs wouldn’t help you understand compilers; at best, they might drop some phrases that can be looked up on Wikipedia.

I need an LLM that would be used in an EBNF -> Scheme parser generator.

Wants are not needs. The parser generator could be handwritten. This was a painful lesson for me when I was younger; I faffed around with a parser generator for a couple weeks, and then one of the old hands wrote two thousand lines over a week, lexer and parser and test cases.

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

Yes I realized as much that a hand-roller LP is best. Especially after I read Design Concepts [2008, MIT] and learned how to represent ASTs in S-Expressions. Thanks.