Are Robust LLM Fingerprints Adversarially Robust?

Published on

January 29, 2026

Read time:

6 mins

Are Robust LLM Fingerprints Adversarially Robust?

Tl;dr:

TL;DR

Under a realistic malicious host threat model, we study ten recent black-box LLM fingerprinting schemes. For nine out of ten, we construct simple, efficient attacks that achieve near-perfect attack success (ASR ≈ 100%) while preserving >90% of the model’s original benchmark utility. Even the remaining scheme—domain-specific watermarks—can be significantly weakened (≈ 65% ASR at ≈ 92% utility).

Across schemes, we find four structural vulnerabilities that can be exploited to bypass fingerprint verification by malicious model hosts via four respective attacks:

Verbatim verification: memorized fingerprints are token-by-token fragile
→ Output suppression: lightly perturb all model outputs
Overconfidence: fingerprints rely on probability spikes
→ Output detection: selectively perturb overconfident outputs
Unnatural queries: intrinsic fingerprints look unnatural
→ Input detection: filter out weird, high-perplexity queries
Statistical signatures: statistical fingerprints leak patterns
→ Statistical analysis: learn global signature statistics and correct for them

‍

Anshul Nasery†, Edoardo Contente◆, Alkin Kaz‡, Pramod Viswanath‡◆, Sewoong Oh†◆

† University of Washington
‡ Princeton University
◆ Sentient

Links

Paper (PDF): https://arxiv.org/pdf/2509.26598
arXiv: https://arxiv.org/abs/2509.26598
alphaXiv: https://www.alphaxiv.org/abs/2509.26598
Code (under-maintenance): https://github.com/sentient-agi/mlprints

Overview

The main goal is to propose a framework to critically evaluate existing model fingerprinting schemes under a more realistic scenario where the model host is maliciously attacking the verification process while preserving model utility.

Thus far, the lack of such systematic evaluation has led to:
(i) fingerprinting schemes being introduced without proper robustness guardrails,
(ii) most of those methods failing under easy-to-apply attack scenarios, and
(iii) inconsistent comparisons to existing baselines.

To rectify this, our goals are to:

Provide user-friendly benchmarks to measure the utility–ASR curve for any fingerprinting method under several attacks (see our GitHub repo).
Provide a family of attacks that are powerful against existing fingerprint families.
Demonstrate the necessity of systematic stress-testing by numerically showing how existing methods fail under our attacks.

In our repository (https://github.com/sentient-xyz/modelprints), we provide implementations of Chain&Hash, FPEdit, EditMF, RoFL, MergePrint, ProFLingo, and DSWatermark, and test their robustness against the four vulnerability classes and corresponding attacks.

We adopt techniques from backdoor, jailbreaking, and watermarking literature, and further design stronger attacks (e.g., output detection and statistical analysis) tailored to fingerprint-specific weaknesses such as exact memorization—while maintaining high utility. We also explored stronger defenses (e.g., approximate verification) to ensure these vulnerabilities are fundamental rather than easily patchable.

Overall, relatively simple attacks suffice, motivating future fingerprint designs to adopt systematic robustness benchmarks and offering insights into factors affecting fingerprint robustness.

Fingerprinting Landscape

Following recent fingerprinting surveys, we group existing methods into:

Invasive (trained) fingerprints
- Memorization-based: train specific (query, response) pairs into the model
- Statistical: embed subtle n-gram biases detectable via hypothesis tests
Intrinsic (natural) fingerprints
Discover existing queries that elicit unique responses without additional training.

Threat Model

We adopt a practical malicious-host threat model:

The model host possesses the fingerprinted weights and serves an API.
The host aims to preserve utility while evading verification.
The fingerprinting protocol is public, but the specific instantiation (the “private key”) is secret (Kerckhoffs’s principle¹).
The host may run offline white-box inference and lightweight post-processing but cannot replace the model entirely.

Metrics

Attack Success Rate (ASR): fraction of fingerprint queries that fail verification (1 − effectiveness).
Normalized Utility: accuracy on IFEval, GSM8K, GPQA-Diamond, and TriviaQA, normalized by the un-attacked model.

A perfect attack has ASR ≈ 1 and Normalized Utility ≈ 1.

Key insight: Under this realistic adversary, many popular schemes share structural weaknesses that are easy to exploit.

The Four Common Attacks (and Their Vulnerabilities)

1) Verbatim Verification → Output Suppression

Memorization-based schemes verify responses by matching fixed patterns (exact prefixes, substrings, or keywords). An attacker can push generation away from these patterns while preserving utility.

Strategies for the first n decoding steps:

SuppressTop-k (ST): drop the top-k most probable tokens and sample from the rest.
SuppressNeighbor (SN): suppress likely fingerprint tokens and lexical variants.
SuppressLookahead (SL): simulate future steps, identify tokens leading to fingerprints, and suppress them early.

Even permissive keyword-based verifiers can be broken: stronger suppression (SL) raises ASR from 0% to 87.5% while maintaining >90% normalized utility.

2) Overconfidence → Output Detection

Memorized fingerprints often induce overconfident outputs, where specific tokens receive extremely high probability.

We exploit this via a confidence gate:

If the maximum token probability exceeds a threshold (e.g., >0.9), treat the step as suspicious and trigger suppression only then.

This selective suppression preserves most normal outputs, recovering utility while maintaining high ASR.

3) Unnatural Queries → Input Detection

Intrinsic fingerprints often rely on unnatural prompts, such as cryptographic strings or GCG-optimized suffixes.

A malicious host can:

Use a small helper LLM to compute input perplexity.
Apply a simple threshold to flag high-perplexity queries.
Route flagged queries to a backup model or produce a generic refusal.

For schemes like RoFL, a single perplexity threshold nearly perfectly separates fingerprint queries from real user prompts, yielding ~100% ASR with negligible utility loss.

4) Statistical Signatures → Statistical Analysis

Statistical fingerprints (e.g., DSWatermark) bias token distributions using secret green lists and verify via hypothesis testing.

In practice, these biases leak globally. An attacker can:

Compare outputs from a watermarked model and a calibration model.
Identify continuations consistently favored by the watermark.
Subtract a small learned bias during generation.

This attack achieves approximately 65% ASR at ~92% normalized utility, substantially weakening verification.

What Actually Breaks?

We evaluate ten recent fingerprinting schemes:

Memorization-based schemes (Instructional FP, Chain&Hash, Perinucleus FP, Implicit FP, FPEdit, EditMF)
→ Broken by output suppression + detection
→ ASR ≈ 94–100%, typically with <10% utility loss
Intrinsic schemes (RoFL, MergePrint, ProFLingo)
→ Easily filtered by perplexity
→ ~100% ASR with negligible utility drop
Statistical fingerprints (DSWatermark)
→ More robust, but still weakened
→ ~65% ASR at ~92% utility

Why These Attacks Are Practical

Cheap at inference: simple probability thresholds, small helper models, or minor logit adjustments
Utility-preserving: selectively target rare triggers or subtle statistics
Aligned with the threat model: no key leakage, no model swapping—only exploiting public structure

Takeaways and Design Recommendations

Memorization-based fingerprints are token-level fragile and easily suppressed.
GCG-optimized intrinsic fingerprints are statistically unnatural and easily filtered.
Statistical fingerprints leak global signatures that can be partially reverse-engineered.

Design recommendations for future fingerprints:

Make keys statistically natural and indistinguishable from real prompts.
Avoid large confidence spikes in fingerprint responses.
Avoid exact string matching in verification.
Avoid shared global statistical signatures that can be learned and scrubbed.

Current schemes should be viewed as stepping stones toward more robust, adversarially resilient provenance mechanisms.