Look 0 of my work involves HTML, well maybe 1-2 percent does; however, about 60% of my work involves regular expressions, grammar, lexical scanning and syntactic parsing, so it still irks me, and will irk me beyond my grave, when people say shit like ‘Don’t parse HTML/Markdown/etc with regex! Use a parser generator!’

So this is stupid, because most people know that HTML and Markdown are not the type of languages that require a push-down parser, or even a simple LL(1) recursive-descent parser! Unless by ‘parser generator’ they mean ‘lexer generator’ or ‘PEG generator’, they are wrong, or at least, partly incorrect.

Like my diabetes, they are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

It’s preferred if you don’t do a syntax-directed lexical translation of Markdown or HTML, and it’s best if you build a tree. I learned that making Mukette and I am currently using my implementation of ASDL to build a tree. But truth is, unlike Context-Free languages, like any non-markup language, it is ENTIRELY possible to do a syntax-directed translation of HTML and Markdown, using pre-compiled, or runtime-compiled regex.

You will have to introduce states to make it a proper Automata, but even that is not required. I once did a syntax-directed translation of Markdown to HTML in AWK! With just one extra state.

I don’t remember the copypasta that was talk of the town 10 years ago, I was a kid back then (17) and I could not dig it up. But it’s a troll that has stuck with me ever since.

Maybe, just maybe, a PEG paser generator could have been what they meant. But even then, PEG generators generate a recursive-descent parser most of the times.

In fact, I dare you to use Byacc, Btacc, Bison, Racc, PYLR, ANTLR, peg(1), leg(1), PackCC or any of these LALR or LL parser generators to parse a markup language. You’ll have a very bad time, it is not impossible, it’s just an overkill.

TL;DR: Most markup languages, like HTML or Markdown, are best lexed, not parsed! Even if you wish to make a tree out of it. But for syntax-directed translations, REs would do.

Thanks.

PS: If you translate a markup language into a tree, you can translate that tree into other markup languages. That’s what Pandoc does. Pandoc is hands-down the best piece of tool I have laid my hands on.

  • @Corbin
    link
    English
    173 months ago

    [HTML and Markdown] are not grammatically Type 2 (Chomsky-wise, Context-Free); rather, they are Type 3 (Chomsky-wise, Regular).

    This is at least half-wrong, in that HTML is clearly not regular. The proof is simple: HTML structurally embeds a language of balanced parentheses (a Dyck language), and such languages are context-free and not regular. I don’t know Markdown well and there are several flavors; it’s quite possible that some flavors are regular. However, traditional Markdown embeds HTML, so if HTML is not regular than neither is Markdown.

    I once did a syntax-directed translation of Markdown to HTML in AWK!

    Sure. The original Markdown implementation was in Perl and operated similarly. However, note that this doesn’t imply that either language is regular, only that a translation is possible assuming the input is valid Markdown. Punting on recognition means that invalid parse trees have undefined translations, presumably at least sometimes generating invalid HTML.

    • ChubakPDP11+TakeWithGrainOfSaltOP
      link
      12 months ago

      I remember reading something about recursive languages in Aho’s Automata Book. I will have to check it again. So Regular Languages can’t be recursive? Is that what ‘recursive’ language even means? I have to read the book again.

      Thanks for your help man.

      • @Corbin
        link
        English
        12 months ago

        The definition of recursive language can be read on Wikipedia too. Quoting from that page:

        All regular, context-free and context-sensitive languages are recursive.

        But not all recursive languages are regular. As for “recursive”, it’s a historical term that might not be sufficiently descriptive; today, I’d usually call such languages decidable, to emphasize that they are languages which we can write (always-halting) computer programs to detect.

        • ChubakPDP11+TakeWithGrainOfSaltOP
          link
          12 months ago

          Thanks! You know last night I fed a model several books and asked it a lot of stuff. This did not come up.

          What i do is I glance at the book, then ask the model to explain it to me. I call the model ‘Urkel the Wiz Kid’.

          I fed both CS books and CE books. Like Sipsers, A Quantative Approach, Automata of Aho, Harris and Harris, etc.