Edit
My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!
For those who will randomly come across this post here are 3 possible ways to achieve the desired results.
Solution 1 (https://lemmy.ml/post/25346014/16383487)
#! /bin/bash
files="/home/USER/projects/test.md"
mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
while IFS= read -r line; do
#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9])
dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
sed -i "s/$line/${dashlink}/" "$files"
#Puts everything to lowercase after a hashtag
lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
sed -i "s/$dashlink/${lowercaselink}/" "$files"
#Removes spaces (%20) from markdown links after a hashtag
spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
sed -i "s/$lowercaselink/${spacelink}/" "$files"
done <<<"$mdlinks2"
Solution 2 (https://lemmy.ml/post/25346014/16453351)
sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
Solution 3 (https://lemmy.ml/post/25346014/16453161)
perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
Relevant links
https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background
Hi everyone !
I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !
With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !
What Am I trying to achieve?
I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…
Convert the following string:
[Some text](#Header%20Linking%20MARKDOWN.md)
Into
[Some text](#header-linking-markdown.md)
As you can see those are the following requirement:
- Pattern:
[
]( - Only edit what’s between parentheses
- Replace
space (%20)
with-
- Everything as lowercase
- Links are sometimes in nested parentheses
- e.g. (look here
[
) ](
- e.g. (look here
- Do not change a line that begins with
https
(external links)
While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/
What I tried
The furthest I got was the following:
sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase
sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -
These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20
occurrence in the file.
The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.
I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !
Thanks in advance.
This is very close
example file
[Some text](#Header%20Linking%20MARKDOWN.md) (#Should%20stay%20as%20is.md) Text surrounding [a link](readme.md#Other%20Page). Cool Multiple [links](#Links.md) in (%20) [a](#An%20A.md) SINGLE [line](#Lines.md) Do [NOT](https://example.com/URL%20Should%20Be%20Untouched.html) CHANGE%20 [hyperlinks](http://example.com/No%20Touchy.html)
but it doesn’t work if you have a http link and markdown link in the same line, and doesn’t work with
[escaped \] square brackets](#and-escaped-\)-parenthesis)
in the linkbut!! it was fun!
Hello :) Sorry for the very late response !
Effectively your regex is very close as a one line, I’m pretty impress ! :0 However I missed to mention something In my post (I only though about it after working on it with another user in the comments…). There a 2 things missing on your beautiful and complex regex:
FROM --------------- [Link with numbers](readme.md#1.3%20this%20is%20another%20test) TO --------------- [Link with numbers](readme.md#1-3-this-is-another-test)
FROM --------------- [Link with numbers](Another%20file%20to%20readme.md#1.3%20this%20is%20another%20test.md) TO --------------- [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test.md)
Sorry for the trouble I wasn’t aware of all the GitHub-Flavored Markdown syntax :/. I got a a very cool working script that works perfectly with another user but If you want to modify your regex and try to solve the issue in pure regex feel free :) I’m very curious how It could look like (god regex is so obscure and at the same time it has some beauty in it !)
#! /bin/bash files="/home/USER/projects/test.md" mdlinks="$(grep -Po ']\((?!https).*\)' "$files")" mdlinks2="$(grep -Po '#.*' <<<$mdlinks)" while IFS= read -r line; do #Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')" sed -i "s/$line/${dashlink}/" "$files" #Puts everything to lowercase after a hashtag lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')" sed -i "s/$dashlink/${lowercaselink}/" "$files" #Removes spaces (%20) from markdown links after a hashtag spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')" sed -i "s/$lowercaselink/${spacelink}/" "$files" done <<<"$mdlinks2"
I did it!! It also handles the case where an external link and internal link are on the same line :D
Here is my annotated file
Wow ! Thank you ! It did a rapid test on a test-file.md
[Just a test](#just-a-test) [Just a link](https://mylink/%20with%20space.com) [External link](readme.md#just-a-test) [Link with numbers](readme.md#1-3-this-is-another-test) [Link with numbers](Another%20file%20to%20readme.md#1-3-this-is-another-test)
Great job ! Thank you very much !!! I’m really impressed what someone with proper knowledge can do ! However, I really do not want to mess around with your regex… This will only call for disaster xD ! I will keep preciously your regex and annotated file in my knowledge base, I’m sure some time in the future I will come back to it and try to break it down as learning process.
Thank you very much !!! 👍
No problem. I think this is a great “final boss” question for learning sed, because it turns out it is deceptively hard!! You have to understand not only a lot about regex, but about sed to get it right. I learned a lot about sed just by tackling this problem!
It is very delicate for sure, but one part you can for sure change is at the
# Add hyphens
part. In the regex you can see(%20|\.)
. These are a list of “characters” which get converted to hyphens. For example, you could modify it to(%20|\.|\+)
and it will convert+
s to-
s as well!Still it is not perfect:
\\\\\[LINK](#LINK)
or[LINK\]\\\\](#LINK)
But for a sed-only solution this is about as good as it will get I’m afraid.
Overall I’m very happy with it. Someday I would like to make a video that goes into depth about sed, since it is tricky to learn just from the docs.
I’ll give another go at it :)
annotated it is working like this:
# use a loop to iteratively replace the %20 with -, since doing s/%20/-/g would replace too much. we loop until it cant substitute any more # label for looping :loop; # skip the following substitute command if the line contains an http link in markdown format /\[[^]]*\](http/! # capture each part of the link, and join it together with - s/\(\[[^]]*\]\)\(([^)]*\)%20\([^)]*)\)/\1\2-\3/g; # if the substitution made a change, loop again, otherwise break t loop; # convert all insides to the link lowercase if the line doesnt contain an http link /\[[^]]*\](http/! # this is outside the loop rather than in the s command above because if the link doesnt contain %20 at all then it won't convert to lowercase s/\(\[[^]]*\]\)\(([^)]*)\)/\1\L\2/g
Why you assume there’s only one link in the line?
Also, you perform substitutions in the whole URL instead only the fragment component.
They did not want external (http) links to be modified as that would break it:
[Example](https://example.com/#Some%20Link)
[Example](https://example.com/#some-link)
I compromised by thinking that it might be unlikely enough to have an external http link AND internal link within the same line. You could probably still do it, my first thought was
[^h][^t][^t][^p]
but that would cause issues for#ttp
and#A
so i just gave up. Instead I think you’d want a different approach, like breaking each link onto their own line, do the same external/internal check before the substitution, and join the lines afterward.That requirement i missed. I just assumed the filename would be replaced the same way too Lol. Not too hard to fix tho :)