More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

dangerzone (GUI, render PDF as images, then re-OCR everything):
mat2 (render PDF as images, don't OCR):

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

which is surprising to me, so I'm a little hesitant to make that as a general claim

@jonny they look kind of meaningful. Not base64. Any ideas what could be in there?

yeah, I thought so too but don't know where to start reverse engineering it :/

it decodes with base85, but it's not Unicode. not sure if that's meaningful

@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.

@beckett @jonny "PDFparanoia" was a project for exactly this - to strip identifying watermarks and metadata from shared academic PDFs. But it fell victim to the Python 2 to 3 transition and the mess of the PDF libraries in particular, and then fell to bitrot. Would be nice to see it brought back to health.

yes, lives on in mat2 and I think pdfparanoia specifically redirects to dangerzone


The cognitive load of constantly dealing with greedy corporate rentiers is exhausting.

@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!

yes definitely, love it and spend basically all my time reading it nowadays ❤️

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful

@f0x @jonny because of this I am tempted to write a tool that screencaps books, adds a little bit of invisible noise to the page, shapens the image, and then stiches the screencaps into the pdf with OCR. Just don't rely on publisher data to be trustworthy at all when half their gig these days is basically legal malware development.

I have an old program that I can probably retrofit to do this if I ever have the time.

@jonny @f0x oh lol sorry yeah that's what that dangerzone does already pretty much :P

I suspect it should be a good idea to compare two PDF from two different source. If the hash match, it's all good. If the it doesn't, strip the EXIF. If it still doesn't match... find the difference somehow.

@jonny I often worry momentarily about that sort of thing happening whenever I download an image or a pdf from the net, even if I'm not planning to share it. I thought I was being too paranoid.

@jonny they're almost getting to the level of ISO standards for metadata f'wittery.

For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though

Of course if you grab two copies do a binary diff on them you can see exactly where those bytes are and modify them.

@jonny Adding unique identifiers on stuff you distribute to be able to trace where it gets copied is hardly a new thing, and I don't think it is good terminology to call it "surveillance". As the hash is a passive part of the document, it is not used (and possibly can't be used) to spy on you.
I don't think it is productive to call this practice "surveillance", as it just make it more difficult for the readers to differentiate between levels of threats to their privacy. seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free 🤨

@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.

@jonny ignore me im retarded and only saw one post.
Sign in to participate in the conversation

A Fediverse instance for people interested in cooperative and collective projects.