• JackbyDev
    link
    fedilink
    English
    arrow-up
    2
    ·
    4 hours ago

    That’s really cool! What I really would like is a tool that converts PDFs to semantic HTML files. I took a peek there and it seems easier for them because they have the original LeX source.

    I think for arbitrary PDFs files the information just isn’t there. I’ve looked into it a bit and it’s sort of all over. A tool called pdf2htmlex is pretty good but it makes the HTML look exactly like the PDF.

    • keepthepace@slrpnk.net
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 hours ago

      Yes, PDFs are much more permissive and may not have any semantic information at all. Hell, some old publications are just scanned images!

      PDF -> semantic seems to be a hard problem that basically requires OCR, like these people are doing