More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.
Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2
here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6
The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19
for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
https://gist.github.com/sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f
this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
https://twitter.com/horsemankukka/status/1486268962119761924?s=20
updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19
@jonny they look kind of meaningful. Not base64. Any ideas what could be in there?
@derwinmcgeary
yeah, I thought so too but don't know where to start reverse engineering it :/
@derwinmcgeary
it decodes with base85, but it's not Unicode. not sure if that's meaningful
@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.
@beckett @jonny "PDFparanoia" was a project for exactly this - to strip identifying watermarks and metadata from shared academic PDFs. But it fell victim to the Python 2 to 3 transition and the mess of the PDF libraries in particular, and then fell to bitrot. Would be nice to see it brought back to health.
@seachaint
@beckett
yes, lives on in mat2 and I think pdfparanoia specifically redirects to dangerzone
@matthew
pretty straightforward to get at least the top level metadata
https://social.coop/@jonny/107686442819944047
The cognitive load of constantly dealing with greedy corporate rentiers is exhausting.
@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!
@shusha
yes definitely, love it and spend basically all my time reading it nowadays ❤️
@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)
@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical
@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).
@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)
@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful
@f0x @jonny because of this I am tempted to write a tool that screencaps books, adds a little bit of invisible noise to the page, shapens the image, and then stiches the screencaps into the pdf with OCR. Just don't rely on publisher data to be trustworthy at all when half their gig these days is basically legal malware development.
I have an old program that I can probably retrofit to do this if I ever have the time.
ya ya see dangerzone linked here
https://social.coop/@jonny/107685931861268392
@jonny they're almost getting to the level of ISO standards for metadata f'wittery.
For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though
@jonny Adding unique identifiers on stuff you distribute to be able to trace where it gets copied is hardly a new thing, and I don't think it is good terminology to call it "surveillance". As the hash is a passive part of the document, it is not used (and possibly can't be used) to spy on you.
I don't think it is productive to call this practice "surveillance", as it just make it more difficult for the readers to differentiate between levels of threats to their privacy.
@jonny@social.coop seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free 🤨
@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.
@kawaiipunk
interesting ... will take a look.
You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:
exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>
To remove *all* metadata, you can use dangerzone or mat2