• e0qdk@reddthat.com
    link
    fedilink
    arrow-up
    4
    ·
    5 days ago

    From the discussion in the report, it looks like the website is serving HTML not a PDF. (The author seems to be getting an Anubis-like bot block page instead of the file they expected?) This seems like it has nothing to do with running JS embedded in a PDF if I understand what’s going on correctly.

    • rageAgainstCages@crazypeople.onlineOP
      link
      fedilink
      arrow-up
      2
      ·
      edit-2
      2 days ago

      It’s not a block page. It’s a file with a PDF extension but with HTML+js contents. This is a shitty trend that’s becoming a plague. If you do all your browsing with a GUI browser you will never notice it because both Firefox and Chromium are happy to execute whatever JavaScript they encounter.

      The bug report is a year old so it’s possible cafevanbommel.nl changed. For me, the sample URL in that report fetches nothing at all. It gets ERROR 415: Unsupported Media Type. with that URL.

      Try this:

      $ wget 'https://www.lachambre.be/kvvcr/pdf_sections/pri/fiche/fr_12_00.pdf'
      $ file fr_12_00.pdf
      fr_12_00.pdf: HTML document, ASCII text, with very long lines (31237), with CRLF, LF line terminators
      

      This fucks people up if they use a script to grab PDFs. I run a getpdf script that essentially does this:

      $ torsocks wget --xattr "$url"
      $ exiftool -config ~/tools/conf/ExifTool_config -xmp-xmp:srcurl='$url' "$filename"
      

      That is a very useful way to fetch PDFs because it adds the URL of the PDF to the file’s metadata (redundantly, because --xattr info can be lost in some operations). So later if I want to recall where a PDF came from, such as to share the link with someone, I can just do: getfattr -d "$pdf".

      So these motherfuckers who push fake PDFs force me to use a GUI to run their shitty surreptitious download management application just to simply get a PDF, which is loaded by pdf.js. Then I need to point, click, save the file. Then on the command line I have to go to whatever directory the file was saved in and manually run the metadata tools and copy-paste the URL.

      The assholes have no idea what hassle this causes. But there are so few users smart enough to save metadata in the PDF that they can easily be marginalised. It’s yet another case where sophisticated users get burnt by enshitifying web admins who know they can nanny the unwitting masses.

      • e0qdk@reddthat.com
        link
        fedilink
        arrow-up
        1
        ·
        2 days ago

        I’m not quite sure what’s going on there exactly, but I block JS in my browser (via NoScript). When I downloaded the link you provided with wget in the terminal, it returns what looks like a bot block page to me. (It includes the text “This question is for testing whether you are a human visitor and to prevent automated spam submission.” with an embedded CAPTCHA image.) If I load the link in Firefox though, it provides a PDF even with JS disabled in my browser. Usually that means a site is doing something like User-Agent sniffing or running a cookie check to block automated scrapers, but if I download the link with wget again after loading it once in my browser, it provides the PDF directly – so presumably the site has some middleware that allows requests by IP after you’ve passed an initial not-bot approval? (Maybe time limited? Haven’t experimented to find out.)

        You might be able to get around this by setting User-Agent and other headers in an initial request to impersonate the browser? (Check copy as cURL for the URL in FF’s network dev tools to see how to emulate the request exactly as your browser would do it.)