Look 0 of my work involves HTML, well maybe 1-2 percent does; however, about 60% of my work involves regular expressions, grammar, lexical scanning and syntactic parsing, so it still irks me, and will irk me beyond my grave, when people say shit like ‘Don’t parse HTML/Markdown/etc with regex! Use a parser generator!’

So this is stupid, because most people know that HTML and Markdown are not the type of languages that require a push-down parser, or even a simple LL(1) recursive-descent parser! Unless by ‘parser generator’ they mean ‘lexer generator’ or ‘PEG generator’, they are wrong, or at least, partly incorrect.

Like my diabetes, they are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

It’s preferred if you don’t do a syntax-directed lexical translation of Markdown or HTML, and it’s best if you build a tree. I learned that making Mukette and I am currently using my implementation of ASDL to build a tree. But truth is, unlike Context-Free languages, like any non-markup language, it is ENTIRELY possible to do a syntax-directed translation of HTML and Markdown, using pre-compiled, or runtime-compiled regex.

You will have to introduce states to make it a proper Automata, but even that is not required. I once did a syntax-directed translation of Markdown to HTML in AWK! With just one extra state.

I don’t remember the copypasta that was talk of the town 10 years ago, I was a kid back then (17) and I could not dig it up. But it’s a troll that has stuck with me ever since.

Maybe, just maybe, a PEG paser generator could have been what they meant. But even then, PEG generators generate a recursive-descent parser most of the times.

In fact, I dare you to use Byacc, Btacc, Bison, Racc, PYLR, ANTLR, peg(1), leg(1), PackCC or any of these LALR or LL parser generators to parse a markup language. You’ll have a very bad time, it is not impossible, it’s just an overkill.

TL;DR: Most markup languages, like HTML or Markdown, are best lexed, not parsed! Even if you wish to make a tree out of it. But for syntax-directed translations, REs would do.

Thanks.

PS: If you translate a markup language into a tree, you can translate that tree into other markup languages. That’s what Pandoc does. Pandoc is hands-down the best piece of tool I have laid my hands on.

  • ChubakPDP11+TakeWithGrainOfSaltOP
    link
    33 months ago

    So I’ll answer your question and ask a question back from anyone who can help me.

    RE the nesting, I was under the impression that they can’t be combined when I made it. Then I read CommonMark’s specs and it seems like it’s possible. It would be miserable to do this with a syntax-directed translation. I used ASDL to write up a tree, and added some features to asdl(1) so they would be handled more properly. I am not sure whether I should use a parser generator for this, but the nesting can be handled by Lex’s start conditions — if I fail to do so, I may use a PEG generator.

    Now my question.

    I think nesting and recursion are a good case for using a push-down parser here — I will still try and find a solution before I use an LR parser.

    I avoid using Yacc because I honestly have no clue how to use it with a language like Markdown.

    So my thinking is, I would just use a starting condition stack with Lex (I use Flex). It’s pretty simple. Let’s use a linked list so there are no limits.

    
    struct stack { int state; stuct stack *next,  }
    
    struct stack *top, *bottom;
    
    void push ...
    
    int pop ...
    
    

    (I usually use typedefs though).

    So now we have a psuedo-pushdown parser. What are these called?

    I am still a beginner at this but one thing that worries me is, how would I handle the tree with this method?

    With Yacc or PEG parser generators, it is easy to assign a value to a reduction/closure. But with this method, Flex won’t allow me. Unless I use too many variables.

    I think I may use peg(1). I can even do the same stack thingy with PEG.

    Any help is welcome.

    • @[email protected]
      link
      fedilink
      43 months ago

      HTML is not even a tree (XHTML is. XML is a type 2 grammar). SGML languages like HTML are more similar to Tree-adjoining grammars.

      For example <b>This<i>is perfectly</b>valid</i> html.