Hey. I have made some DSLs before, and for all of them, most of which were in C, I have used Flex and Bison. But this time I wanna use Scheme, Cyclone Scheme to be exact. I can potentially use Flex/Bison this time too, because Cyclone has a strong FFI to C.

But I’d rather innovate. I have been writing down the language’s grammar in EBNF:

https://gist.github.com/Chubek/bd54df78fe1f71f46cb262ba990a209b

And my thinking is, why not turn this into a parser? Not something like BNFC that translates BNF (not EBNF) into several parser and lexer specifications plus an AST, I want it to translate EBNF into Scheme.

But if you read the grammar, you will realize that there are some places where it’s not very descriptive and machine-friendly. For that reason, I think an LLM can help.

Now, I need your help. I am mostly active in systems programming. Like Assembly, C stuff. I don’t know much about LLMs and this whole AI revolution. I did some work as an ‘ML-engineer guy’ (not an ML engineer, an ML-engineer guy, there’s a difference!), so I know how this whole thing works. I have also read MITs standard book on mathematical optimization.

But I definitely need to use a pretrained model here. My knowledge of mathematical optimization is useless when you need like 28 million to train a model that would aide you with this?

I don’t want to use an API. I wanna own my software. I do use ChatGPT as a search engine, but that’s about it, I never owned Google anyways!

I know about HuggingFace. What model there do you think would help me?

Also, how do these weights work? If I bind one DNN framework to Cyclone, will the weights trained by another DNN framework work in it too? Do people use frameworks not written in C, so I would have to like triple-bind it? I know both Google’s and Facebook’s are in C. However they are in ‘garbage c’. Well let’s deal with that later.

Anyways, thanks for your help.

TL; DR:

I need an LLM that would be used in an EBNF -> Scheme parser generator.

  • @armchair_progamerM
    link
    3
    edit-2
    2 months ago

    I find writing the parser by hand (recursive descent) to be easiest. Sometimes I use a lexer generator, or if there isn’t a good one (idk for Scheme), write the lexer by hand as well. Define a few helper functions and macros to remove most of the boilerplate (you really benefit from Scheme here), and you almost end up writing the rules out directly.

    Yes, you need to manually implement choice and figure out what/when to lookahead. Yes, the final parser won’t be as readable as a BNF specification. But I find writing a hand-rolled parser generator for most grammars, even with the manual choice and lookahead, is surprisingly easy and fast.

    The problem with parser generators is that, when they work they work well, but when they don’t work (fail to generate, the generated parser tries to parse the wrong node, the generated parser is very inefficient) it can be really hard to figure out why. A hand-rolled parser is much easier to debug, so when your grammar inevitably has problems, it ends up taking less time in total to go from spec to working hand-rolled vs. spec to working parser-generator-generated.

    The hand-rolled rules may look something like (with user-defined macros and functions define-parse, parse, peek, next, and some simple rules like con-id and name-id as individual tokens):

    ; pattern			::= [ con-id ] factor "begin" expr-list "end"
    (define-parse pattern
      (mk-pattern
        (if (con-id? (peek)) (next))
        (parse factor)
        (do (parse “begin”) (parse expr-list) (parse “end”))))
    
    ; factor 			::= name-id 
    ; 			 | symbol-literal
    ; 			 | long-name-id 
    ; 			 | numeric-literal 
    ;	 		 | text-literal 
    ; 			 | list-literal 
    ; 			 | function-lambda 
    ; 			 | tacit-arg
    ; 			 | '(' expr ')' 
    (define-parse factor
      (case (peek)
        [name-id? (if (= “.” (peek2)) (mk-long-name-id …) (next))]
        [literal? (next)]
        …))
    

    Since you’re using Scheme, you can almost certainly optimize the above to reduce even more boilerplate.

    Regarding LLMs: if you start to write the parser with the EBNF comments above each rule like above, you can paste the EBNF in and Copilot will generate rules for you. Alternatively, you can feed a couple EBNF/code examples to ChatGPT and ask it to generate more.

    In both cases the AI will probably make mistakes on tricky cases, but that’s practically inevitable. An LLM implemented in an error-proof code synthesis engine would be a breakthrough; and while there are techniques like fine-tuning, I suspect they wouldn’t improve the accuracy much, and certainly would amount to more effort in total (in fact most LLM “applications” are just a custom prompt on plain ChatGPT or another baseline model).