• rageAgainstCages@crazypeople.onlineOP
    link
    fedilink
    arrow-up
    2
    ·
    edit-2
    5 天前

    It’s not a block page. It’s a file with a PDF extension but with HTML+js contents. This is a shitty trend that’s becoming a plague. If you do all your browsing with a GUI browser you will never notice it because both Firefox and Chromium are happy to execute whatever JavaScript they encounter.

    The bug report is a year old so it’s possible cafevanbommel.nl changed. For me, the sample URL in that report fetches nothing at all. It gets ERROR 415: Unsupported Media Type. with that URL.

    Try this:

    $ wget 'https://www.lachambre.be/kvvcr/pdf_sections/pri/fiche/fr_12_00.pdf'
    $ file fr_12_00.pdf
    fr_12_00.pdf: HTML document, ASCII text, with very long lines (31237), with CRLF, LF line terminators
    

    This fucks people up if they use a script to grab PDFs. I run a getpdf script that essentially does this:

    $ torsocks wget --xattr "$url"
    $ exiftool -config ~/tools/conf/ExifTool_config -xmp-xmp:srcurl='$url' "$filename"
    

    That is a very useful way to fetch PDFs because it adds the URL of the PDF to the file’s metadata (redundantly, because --xattr info can be lost in some operations). So later if I want to recall where a PDF came from, such as to share the link with someone, I can just do: getfattr -d "$pdf".

    So these motherfuckers who push fake PDFs force me to use a GUI to run their shitty surreptitious download management application just to simply get a PDF, which is loaded by pdf.js. Then I need to point, click, save the file. Then on the command line I have to go to whatever directory the file was saved in and manually run the metadata tools and copy-paste the URL.

    The assholes have no idea what hassle this causes. But there are so few users smart enough to save metadata in the PDF that they can easily be marginalised. It’s yet another case where sophisticated users get burnt by enshitifying web admins who know they can nanny the unwitting masses.

    • e0qdk@reddthat.com
      link
      fedilink
      arrow-up
      2
      ·
      5 天前

      I’m not quite sure what’s going on there exactly, but I block JS in my browser (via NoScript). When I downloaded the link you provided with wget in the terminal, it returns what looks like a bot block page to me. (It includes the text “This question is for testing whether you are a human visitor and to prevent automated spam submission.” with an embedded CAPTCHA image.) If I load the link in Firefox though, it provides a PDF even with JS disabled in my browser. Usually that means a site is doing something like User-Agent sniffing or running a cookie check to block automated scrapers, but if I download the link with wget again after loading it once in my browser, it provides the PDF directly – so presumably the site has some middleware that allows requests by IP after you’ve passed an initial not-bot approval? (Maybe time limited? Haven’t experimented to find out.)

      You might be able to get around this by setting User-Agent and other headers in an initial request to impersonate the browser? (Check copy as cURL for the URL in FF’s network dev tools to see how to emulate the request exactly as your browser would do it.)

      • rageAgainstCages@crazypeople.onlineOP
        link
        fedilink
        arrow-up
        2
        ·
        2 天前

        I just did a slightly more reliable test: disabled js in the FF settings (about:config→javascript.enabled=false). And the PDF was fetched and rendered with some delay.

        It is clear from wget that the server is at least willing to push html,js,garbage that masquerades as a “pdf”. But I believe you are correct; that w/this sample URL, Mozilla is getting a true PDF based on some opaque judgment by the server.

        So that URL turns out to be a bad example. I see this all the time though. I will have to start collecting more PDF URLs that push shenanigans until one reproduces Mozilla’s anti-user behavior.