'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

ChubakPDP11+TakeWithGrainOfSalt · edit-2 3 months ago

'Don't parse markup languages with Regex' is an annoying trollpost and it should die... right?

@mcmodknower · 3 months ago

How many combinations and levels of italics, bold and ~~strikethrough~~, combined with escaped chars like * can your program handle?

How many combinations and levels of *italics*, **bold** and ~~strikethrough~~, combined with escaped chars like \* can your program handle?

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

So I’ll answer your question and ask a question back from anyone who can help me.

RE the nesting, I was under the impression that they can’t be combined when I made it. Then I read CommonMark’s specs and it seems like it’s possible. It would be miserable to do this with a syntax-directed translation. I used ASDL to write up a tree, and added some features to asdl(1) so they would be handled more properly. I am not sure whether I should use a parser generator for this, but the nesting can be handled by Lex’s start conditions — if I fail to do so, I may use a PEG generator.

Now my question.

I think nesting and recursion are a good case for using a push-down parser here — I will still try and find a solution before I use an LR parser.

I avoid using Yacc because I honestly have no clue how to use it with a language like Markdown.

So my thinking is, I would just use a starting condition stack with Lex (I use Flex). It’s pretty simple. Let’s use a linked list so there are no limits.


struct stack { int state; stuct stack *next,  }

struct stack *top, *bottom;

void push ...

int pop ...

(I usually use typedefs though).

So now we have a psuedo-pushdown parser. What are these called?

I am still a beginner at this but one thing that worries me is, how would I handle the tree with this method?

With Yacc or PEG parser generators, it is easy to assign a value to a reduction/closure. But with this method, Flex won’t allow me. Unless I use too many variables.

I think I may use peg(1). I can even do the same stack thingy with PEG.

Any help is welcome.

@[email protected] · 3 months ago

HTML is not even a tree (XHTML is. XML is a type 2 grammar). SGML languages like HTML are more similar to Tree-adjoining grammars.

For example <b>This<i>is perfectly</b>valid</i> html.

ChubakPDP11+TakeWithGrainOfSalt · 3 months ago

btw this is the ASDL I wrote:

%{

#include "mukette.h"


static Arena *ast_scratch = NULL;


#define ALLOC(size) arena_alloc(ast_scratch, size)


%}

md_linkage = Hyperlink(md_compound? name, string url)
	  | Image(string? alt, string path)
	  ;


md_inline = Italic(md_compound italic)
	 | Bold(md_compound bold)
	 | BoldItalic(md_compound bold_italic)
	 | Strikethrough(md_compound strike_through)
	 | InlineCode(string inline_code)
	 | Linkage(md_linkage linkage)
	 | RegularText(string regular_text)
	 ;


md_header_level = H1 | H2 | H3 | H4 | H5 | H6 ;


md_line = Header(md_compound text, md_header_level level)
	| Indented(md_compound text, usize num_indents)
	| LinkReference(identifier name, string url, string title)
	| HorizontalLine
	;


md_compound = (md_inline* compound) ;


md_table_row = (md_compound cells, size num_cell) ;


md_table = (md_table_row* rows, size num_rows) ;


md_ol_item = (int bullet_num, md_list_item item) ;


md_ul_item = (char bullet_char, md_list_item item) ;


md_list_item = TextItem(string text)
	    | OrderedItem(md_ol_item ordered_item)
	    | UnorderedItem(md_ul_item unordered_item)
	    | NestedList(md_list nested_list)
	    ;



md_list = (md_list_item* items) ;



md_block = Pargraph(md_compound* paragraph)
	| BlockQuote(md_compound* block_quote)2
	| CodeListing(identifier? label, string code)
	| Table(md_table table)
	| List(md_list list)
	| Line(md_line line)
	;


markdown = (md_block* blocks) ;


%%


static inline void init_tree_scratch(void) { ast_scratch = arena_init(AST_ARENA_SIZE); }
static inline void free_tree_scratch(void) { arena_free(ast_scratch); }

I had an easier time parsing ASDL with Yacc. I still can’t tell whether a grammar is LR, LL or RE, but I can tell that Markdown is not CFG.

I just updated ASDL: https://github.com/Chubek/ZephyrASDL

Apologies if I am too late on the documetnation. I am still trying to improve it by using it myself. I also wish to add an Ocaml target.

@mcmodknower · 2 months ago

md_inline and md_compound use each other, and not only at the end xor the beginning of the rule, making this a non-type 3 grammar.

Sorry for the late response, i wanted to do a better response but don’t have the time for that currently.

ChubakPDP11+TakeWithGrainOfSalt · 2 months ago

Thanks. I actually have a parse-related question which I will post somewhere soon (as in 2-3 minute).