From the discussion in the report, it looks like the website is serving HTML not a PDF. (The author seems to be getting an Anubis-like bot block page instead of the file they expected?) This seems like it has nothing to do with running JS embedded in a PDF if I understand what’s going on correctly.
It’s not a block page. It’s a file with a PDF extension but with HTML+js contents. This is a shitty trend that’s becoming a plague. If you do all your browsing with a GUI browser you will never notice it because both Firefox and Chromium are happy to execute whatever JavaScript they encounter.
The bug report is a year old so it’s possible cafevanbommel.nl changed. For me, the sample URL in that report fetches nothing at all. It gets ERROR 415: Unsupported Media Type. with that URL.
Try this:
$ wget 'https://www.lachambre.be/kvvcr/pdf_sections/pri/fiche/fr_12_00.pdf'
$ file fr_12_00.pdf
fr_12_00.pdf: HTML document, ASCII text, with very longlines (31237), with CRLF, LF line terminators
This fucks people up if they use a script to grab PDFs. I run a getpdf script that essentially does this:
That is a very useful way to fetch PDFs because it adds the URL of the PDF to the file’s metadata (redundantly, because --xattr info can be lost in some operations). So later if I want to recall where a PDF came from, such as to share the link with someone, I can just do: getfattr -d "$pdf".
So these motherfuckers who push fake PDFs force me to use a GUI to run their shitty surreptitious download management application just to simply get a PDF, which is loaded by pdf.js. Then I need to point, click, save the file. Then on the command line I have to go to whatever directory the file was saved in and manually run the metadata tools and copy-paste the URL.
The assholes have no idea what hassle this causes. But there are so few users smart enough to save metadata in the PDF that they can easily be marginalised. It’s yet another case where sophisticated users get burnt by enshitifying web admins who know they can nanny the unwitting masses.
I’m not quite sure what’s going on there exactly, but I block JS in my browser (via NoScript). When I downloaded the link you provided with wget in the terminal, it returns what looks like a bot block page to me. (It includes the text “This question is for testing whether you are a human visitor and to prevent automated spam submission.” with an embedded CAPTCHA image.) If I load the link in Firefox though, it provides a PDF even with JS disabled in my browser. Usually that means a site is doing something like User-Agent sniffing or running a cookie check to block automated scrapers, but if I download the link with wget again after loading it once in my browser, it provides the PDF directly – so presumably the site has some middleware that allows requests by IP after you’ve passed an initial not-bot approval? (Maybe time limited? Haven’t experimented to find out.)
You might be able to get around this by setting User-Agent and other headers in an initial request to impersonate the browser? (Check copy as cURL for the URL in FF’s network dev tools to see how to emulate the request exactly as your browser would do it.)
From the discussion in the report, it looks like the website is serving HTML not a PDF. (The author seems to be getting an Anubis-like bot block page instead of the file they expected?) This seems like it has nothing to do with running JS embedded in a PDF if I understand what’s going on correctly.
It’s not a block page. It’s a file with a PDF extension but with HTML+js contents. This is a shitty trend that’s becoming a plague. If you do all your browsing with a GUI browser you will never notice it because both Firefox and Chromium are happy to execute whatever JavaScript they encounter.
The bug report is a year old so it’s possible
cafevanbommel.nlchanged. For me, the sample URL in that report fetches nothing at all. It getsERROR 415: Unsupported Media Type.with that URL.Try this:
$ wget 'https://www.lachambre.be/kvvcr/pdf_sections/pri/fiche/fr_12_00.pdf' $ file fr_12_00.pdf fr_12_00.pdf: HTML document, ASCII text, with very long lines (31237), with CRLF, LF line terminatorsThis fucks people up if they use a script to grab PDFs. I run a getpdf script that essentially does this:
$ torsocks wget --xattr "$url" $ exiftool -config ~/tools/conf/ExifTool_config -xmp-xmp:srcurl='$url' "$filename"That is a very useful way to fetch PDFs because it adds the URL of the PDF to the file’s metadata (redundantly, because
--xattrinfo can be lost in some operations). So later if I want to recall where a PDF came from, such as to share the link with someone, I can just do:getfattr -d "$pdf".So these motherfuckers who push fake PDFs force me to use a GUI to run their shitty surreptitious download management application just to simply get a PDF, which is loaded by pdf.js. Then I need to point, click, save the file. Then on the command line I have to go to whatever directory the file was saved in and manually run the metadata tools and copy-paste the URL.
The assholes have no idea what hassle this causes. But there are so few users smart enough to save metadata in the PDF that they can easily be marginalised. It’s yet another case where sophisticated users get burnt by enshitifying web admins who know they can nanny the unwitting masses.
I’m not quite sure what’s going on there exactly, but I block JS in my browser (via NoScript). When I downloaded the link you provided with wget in the terminal, it returns what looks like a bot block page to me. (It includes the text “This question is for testing whether you are a human visitor and to prevent automated spam submission.” with an embedded CAPTCHA image.) If I load the link in Firefox though, it provides a PDF even with JS disabled in my browser. Usually that means a site is doing something like User-Agent sniffing or running a cookie check to block automated scrapers, but if I download the link with wget again after loading it once in my browser, it provides the PDF directly – so presumably the site has some middleware that allows requests by IP after you’ve passed an initial not-bot approval? (Maybe time limited? Haven’t experimented to find out.)
You might be able to get around this by setting User-Agent and other headers in an initial request to impersonate the browser? (Check copy as cURL for the URL in FF’s network dev tools to see how to emulate the request exactly as your browser would do it.)