More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.
here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.
@beckett @jonny "PDFparanoia" was a project for exactly this - to strip identifying watermarks and metadata from shared academic PDFs. But it fell victim to the Python 2 to 3 transition and the mess of the PDF libraries in particular, and then fell to bitrot. Would be nice to see it brought back to health.
@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!
@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical
@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)
@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful
@f0x @jonny because of this I am tempted to write a tool that screencaps books, adds a little bit of invisible noise to the page, shapens the image, and then stiches the screencaps into the pdf with OCR. Just don't rely on publisher data to be trustworthy at all when half their gig these days is basically legal malware development.
I have an old program that I can probably retrofit to do this if I ever have the time.
@jonny they're almost getting to the level of ISO standards for metadata f'wittery.
For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though
@jonny Adding unique identifiers on stuff you distribute to be able to trace where it gets copied is hardly a new thing, and I don't think it is good terminology to call it "surveillance". As the hash is a passive part of the document, it is not used (and possibly can't be used) to spy on you.
I don't think it is productive to call this practice "surveillance", as it just make it more difficult for the readers to differentiate between levels of threats to their privacy.
@email@example.com seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free 🤨
@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.
A Fediverse instance for people interested in cooperative and collective projects.