Edit

My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (https://lemmy.ml/post/25346014/16383487)

#! /bin/bash
files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

Solution 2 (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Solution 3 (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Relevant links

https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background


Hi everyone !

I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what’s between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

  • tuna@discuss.tchncs.de
    link
    fedilink
    arrow-up
    5
    ·
    5 days ago

    This is very close

    sed ':loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;t loop;/\[[^]]*\](http/! s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g'
    

    example file

    [Some text](#Header%20Linking%20MARKDOWN.md)
    (#Should%20stay%20as%20is.md)
    Text surrounding [a link](readme.md#Other%20Page). Cool
    Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md)
    Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)
    

    but it doesn’t work if you have a http link and markdown link in the same line, and doesn’t work with [escaped \] square brackets](#and-escaped-\)-parenthesis) in the link

    but!! it was fun!

    • N0x0n@lemmy.mlOP
      link
      fedilink
      arrow-up
      2
      ·
      3 days ago

      Hello :) Sorry for the very late response !

      Effectively your regex is very close as a one line, I’m pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments…). There a 2 things missing on your beautiful and complex regex:

      1. Numbering with dots also needs to have a dash in between (actually I think every special characters like spaces or a dots are converted to a dash )
      FROM
      ---------------
      [Link with numbers](readme.md#1.3%20this%20is%20another%20test)
      
      TO
      ---------------
      [Link with numbers](readme.md#1-3-this-is-another-test)
      
      1. The part before the hashtag needs to keep it original form (links to a real file)
      FROM
      ---------------
      [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md)
      
      TO
      ---------------
      [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
      

      Sorry for the trouble I wasn’t aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free :) I’m very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)

      #! /bin/bash
      
      files="/home/USER/projects/test.md"
      
      mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
      mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
      
      while IFS= read -r line; do
      	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
      	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
      	sed -i "s/$line/${dashlink}/" "$files"
      
      	#Puts everything to lowercase after a hashtag
      	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
      	sed -i "s/$dashlink/${lowercaselink}/" "$files"
      
      	#Removes spaces (%20) from markdown links after a hashtag
      	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
      	sed -i "s/$lowercaselink/${spacelink}/" "$files"
      
      done <<<"$mdlinks2"
      
      • tuna@discuss.tchncs.de
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        2 days ago

        I did it!! It also handles the case where an external link and internal link are on the same line :D

        sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
        

        Here is my annotated file

        # Begin loop
        :l;
        
        # Bisect first link in pattern space into pattern space and append to hold space
        # Example: `text [label](file#fragment)'
        #   Pattern space: `file#fragment)'
        #   Hold space: `text [label]('
        # Steps:
        #   1. Strategically insert \n
        #       1a. If this fails, branch out
        #   2. Append to hold space (this creates two \n's. It feels weird for the
        #      first iteration, but that's ok)
        #   3. Copy hold space to pattern space, remove first \n, then trim off
        #      everything past the second \n
        #   4. Swap pattern/hold, and trim off everything up to and incl the last \n
        s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;
        Te;
        H;
        g; s/\n//; s/\n.*//;
        x; s/.*\n//;
        
        # Modify only if it is an internal link
        /^https?:/! {
            # Add hyphens
            :h;
            s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;
            th;
            # Make lowercase
            s/(#[^)]*\))/\L\1/;
        };
        
        # "conditional" branch so it checks the next conditional again
        tl;
        
        # Exit: join pattern space to hold space, then move to pattern space.
        # Since the loop uses H instead of h, have to make sure hold space is empty
        :e;
        H;
        z;
        x; s/\n//;
        
        • N0x0n@lemmy.mlOP
          link
          fedilink
          arrow-up
          2
          ·
          2 days ago

          Wow ! Thank you ! It did a rapid test on a test-file.md

          [Just a test](#just-a-test)
          [Just a link](https://mylink/%20with%20space.com)
          [External link](readme.md#just-a-test)
          [Link with numbers](readme.md#1-3-this-is-another-test)
          [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)
          

          Great job ! Thank you very much !!! I’m really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex… This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I’m sure some time in the future I will come back to it and try to break it down as learning process.

          Thank you very much !!! 👍

          • tuna@discuss.tchncs.de
            link
            fedilink
            arrow-up
            1
            ·
            2 days ago

            No problem. I think this is a great “final boss” question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!

            I really do not want to mess around with your regex

            It is very delicate for sure, but one part you can for sure change is at the # Add hyphens part. In the regex you can see (%20|\.). These are a list of “characters” which get converted to hyphens. For example, you could modify it to (%20|\.|\+) and it will convert +s to -s as well!

            Still it is not perfect:

            • If the link spans multiple lines, the regex won’t match
            • If the link contains escaped characters like \\\\\[LINK](#LINK) or [LINK\]\\\\](#LINK)
            • If the link is inside a code block ``` it will get changed (which may or may not be intended)

            But for a sed-only solution this is about as good as it will get I’m afraid.

            Overall I’m very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.

    • tuna@discuss.tchncs.de
      link
      fedilink
      arrow-up
      4
      ·
      5 days ago

      annotated it is working like this:

      # use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more
      
      # label for looping
      :loop;
      # skip the following substitute command if the line contains an http link in markdown format
      /\[[^]]*\](http/!
      # capture each part of the link, and join it together with -
      s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g;
      # if the substitution made a change, loop again, otherwise break
      t loop;
      
      # convert all insides to the link lowercase if the line doesnt contain an http link
      /\[[^]]*\](http/!
      # this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase
      s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
      
      • bizdelnick@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        edit-2
        5 days ago

        skip the following substitute command if the line contains an http link in markdown format

        Why you assume there’s only one link in the line?

        Also, you perform substitutions in the whole URL instead only the fragment component.

        • tuna@discuss.tchncs.de
          link
          fedilink
          arrow-up
          3
          ·
          edit-2
          5 days ago

          Why you assume there’s only one link in the line?

          They did not want external (http) links to be modified as that would break it:

          • [Example](https://example.com/#Some%20Link)
          • [Example](https://example.com/#some-link)

          I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was [^h][^t][^t][^p] but that would cause issues for #ttp and #A so i just gave up. Instead I think you’d want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.

          Also, you perform substitutions in the whole URL instead of the fragment component

          That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho :)