'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt · edit-2 3 months ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

So I’ll answer your question and ask a question back from anyone who can help me.

RE the nesting, I was under the impression that they can’t be combined when I made it. Then I read CommonMark’s specs and it seems like it’s possible. It would be miserable to do this with a syntax-directed translation. I used ASDL to write up a tree, and added some features to asdl(1) so they would be handled more properly. I am not sure whether I should use a parser generator for this, but the nesting can be handled by Lex’s start conditions — if I fail to do so, I may use a PEG generator.

Now my question.

I think nesting and recursion are a good case for using a push-down parser here — I will still try and find a solution before I use an LR parser.

I avoid using Yacc because I honestly have no clue how to use it with a language like Markdown.

So my thinking is, I would just use a starting condition stack with Lex (I use Flex). It’s pretty simple. Let’s use a linked list so there are no limits.


struct stack { int state; stuct stack *next,  }

struct stack *top, *bottom;

void push ...

int pop ...

(I usually use typedefs though).

So now we have a psuedo-pushdown parser. What are these called?

I am still a beginner at this but one thing that worries me is, how would I handle the tree with this method?

With Yacc or PEG parser generators, it is easy to assign a value to a reduction/closure. But with this method, Flex won’t allow me. Unless I use too many variables.

I think I may use peg(1). I can even do the same stack thingy with PEG.

Any help is welcome.

@[email protected] · 3 months ago

HTML is not even a tree (XHTML is. XML is a type 2 grammar). SGML languages like HTML are more similar to Tree-adjoining grammars.

For example <b>This<i>is perfectly</b>valid</i> html.