Penligent Header

CVE-2025-62164 PoC: vLLM’s Completions Data-Plane Bug That Turns Embeddings Into an Attack Surface

CVE-2025-62164 is a high-severity vulnerability in vLLM, one of the most widely deployed open-source LLM inference engines. The issue lives inside the Completions API and is triggered when the server processes user-supplied prompt embeddings. In affected versions (0.10.2 up to but not including 0.11.1), vLLM deserializes tensors using torch.load() without strong validation. A crafted sparse tensor can slip through and cause an out-of-bounds write during densification, which reliably crashes the worker and may be escalated to remote code execution under the right conditions. The project shipped a fix in vLLM 0.11.1. (NVD)

Two things make this CVE feel different from the usual AI-stack bugs. First, it’s a data-plane flaw: not an admin UI or misconfig, but an exploit path reachable through the same inference endpoint your users hit for completions. Second, it sits at the exact intersection of unsafe deserialization and upstream behavior drift, a combo that keeps showing up as LLM infrastructure matures.

Where vLLM Sits in the Stack — and Why That Placement Amplifies Risk

vLLM is effectively a throughput-optimized inference layer. Teams deploy it as a public SaaS API, behind an enterprise gateway, or as the serving backend for multi-tenant agent systems. In all those layouts, vLLM is close to the internet and close to GPU resources. That sounds like performance engineering; it also means low-privilege API callers may reach privileged code paths. (wiz.io)

So the blast radius is not subtle. A single crashable endpoint can create GPU starvation, queue buildup, autoscaler churn, and noisy-neighbor incidents. If exploitation ever stabilizes into RCE, the inference fleet becomes a legitimate foothold for supply-chain intrusion.

The Vulnerability in One Paragraph

The Completions endpoint in vulnerable vLLM versions allows clients to pass prompt embeddings instead of raw text. vLLM reconstructs those tensors via torch.load() without sufficient integrity, type, or structural checks. Since PyTorch 2.8.0 disables sparse tensor integrity checks by default, a malicious sparse tensor can bypass internal bounds protections and trigger an out-of-bounds memory write when to_dense() is called. The immediate, repeatable outcome is remote DoS (worker crash). With favorable memory layout and control, the same primitive could plausibly be turned into RCE on the host. (NVD)

Root Cause Anatomy: How “Convenient Embedding Pass-Through” Became Memory Corruption

A Deserialization Sink on a Public Endpoint

torch.load() is powerful by design. It’s meant to restore tensors and object graphs from trusted sources (checkpoints, internal pipelines). In vLLM’s case, it’s used on a field that can be populated by an API caller. That shifts the trust boundary from “internal model artifact” to “untrusted internet input,” which is historically where unsafe deserialization blows up. (NVD)

Even though this issue manifests as memory corruption rather than a classic pickle-RCE chain, the underlying mistake is the same: treating a complex binary structure as if it were just another request parameter.

The PyTorch 2.8.0 Behavior Change Was the Spark

The vLLM advisory and NVD both pin escalation on a PyTorch change: sparse tensor integrity checks are now off by default. Previously, malformed sparse tensors were more likely to be rejected before the code path reached densification. With checks disabled, vLLM’s lack of pre-validation became exploitable in a consistent way. (NVD)

This is a useful mental model for AI infra security: upstream defaults can silently turn “unsafe but dormant” into “unsafe and weaponizable.”

Impact Reality Check: DoS Is Guaranteed, RCE Is a Ceiling

All public write-ups agree that remote DoS is reliable. A single malformed request can kill a worker; repeated requests can keep a fleet unstable. (ZeroPath)

RCE is described as potential for good reason. Memory corruption provides a pathway, but weaponization depends on allocator behavior, hardening flags, container boundaries, and how much control the attacker has over the corrupted region. There is no CISA KEV listing and no widely confirmed in-the-wild exploit chain as of November 25, 2025, but treating data-plane memory corruption as “DoS-only” would be a mistake. (wiz.io)

Affected Versions and Fix Status

ItemDetails
ComponentvLLM Completions API (prompt embeddings handling)
Affected versions0.10.2 ≤ vLLM < 0.11.1
Patched version0.11.1
Triggercrafted prompt embeddings (sparse tensor)
영향reliable DoS; potential RCE
CVSS8.8 High (AV:N/AC:L/PR:L/UI:N/S:U/C:H/I:H/A:H)

(NVD)

Who Should Panic First: Threat Models That Matter

If you want a practical prioritization lens, think about where embeddings can enter your system.

Public vLLM endpoints are the obvious high-risk case. Even if callers need an API key, the bar is low: a normal user with basic access may be enough to crash your workers. (wiz.io)

Multi-tenant “LLM as a Service” platforms come next. The danger is that embeddings might flow in indirectly — through toolchains, plugins, agent frameworks, or upstream services that pass embeddings through as an optimization. The more places you accept non-text payloads, the more complicated your trust boundary becomes.

Finally, don’t discount community demos and education deployments. They are frequently unauthenticated, under-monitored, and exposed long after the owner forgets they exist.

Safe Ways to Confirm Exposure (Without Risky Probing)

The fastest triage is version-based.

python -c "import vllm; print(vllm.__version__)"
# affected if 0.10.2 <= version < 0.11.1

(NVD)

Operationally, look for a pattern of worker segfaults or abrupt restarts tied to unusually large or structurally odd completion requests. In practice, crash spikes appear first; sophisticated exploitation (if it ever arrives) comes later. (ZeroPath)

A harmless canary check — standard completions, no embedding pass-through — is useful to baseline stability around patching:

import requests, json, time

HOST = "https://<your-vllm-host>/v1/completions"
headers = {"Authorization": "Bearer <token>"}

payload = {
    "model": "your-model-name",
    "prompt": "health check",
    "max_tokens": 4
}

for i in range(5):
    r = requests.post(HOST, headers=headers, data=json.dumps(payload), timeout=10, verify=False)
    print(i, r.status_code, r.text[:160])
    time.sleep(1)

Patch Fast, Then Harden the Data Plane

The real fix is simply upgrading to vLLM 0.11.1 or later. Everything else is a stopgap. (NVD)

After that, treat “binary inference inputs” as high-risk sinks. If your product genuinely needs embedding pass-through, gate it with strict schema validation: enforce expected tensor dtypes, shapes, max sizes, and ban sparse formats unless you explicitly support them. Even a dumb allowlist blocks the specific class of malformed structures this CVE relies on. (wiz.io)

On the infrastructure side, lock down the blast radius. vLLM workers should run with least privilege, read-only filesystems where possible, no sensitive host mounts, and container seccomp/AppArmor profiles. If someone ever chains memory corruption into code execution, you want it trapped in a box that can’t reach secrets or lateral paths.

Why CVE-2025-62164 Matters for AI Security as a Discipline

This incident is a clean example of how AI security is drifting away from classic web app playbooks.

The new frontier is model-service data planes: tensors, embeddings, multimodal blobs, and serialized artifacts that move through APIs because they’re fast and convenient. They’re also structurally rich and fragile — perfect for corruption bugs if you deserialize without paranoia.

It’s also a reminder that the risk surface of an LLM stack is compositional. vLLM didn’t “invent” sparse tensor insecurity; a PyTorch default changed, and a missing validation layer downstream turned that change into a CVE. Inference engineering now needs the same level of dependency scrutiny that kernel teams take for granted.

CV-2025-62164 Penligent

Controlled Validation When PoCs Are Messy or Late

AI infra CVEs often arrive before stable public PoCs, or with PoCs that are too risky to point at production serving clusters. The defensible approach is to industrialize a safer loop: authoritative intel → hypothesis → lab-only validation → auditable evidence.

In Penligent-style agentic workflows, you can have agents ingest the vLLM advisory and NVD record, derive the exact exposure conditions (versions, embeddings path, PyTorch assumptions), and generate a minimal-risk validation plan that you run only in an isolated replica. That gets you real proof — version fingerprints, crash signatures, pre/post-patch deltas — without gambling with your prod GPUs. (NVD)

Just as important, evidence-first reporting makes it easier to explain urgency to ops leadership. “We patched because a blog said so” doesn’t survive incident review. “We patched because our lab replica is crashable via the vulnerable embeddings path, and here is the timeline and diff after upgrading to 0.11.1” does.CVE-2025-62164 PoC: vLLM’s Completions Data-Plane Bug That Turns Embeddings Into an Attack Surface

게시물을 공유하세요:
관련 게시물