Skip to content

Fix EBADF crash in vsock connection handling#552

Draft
DePasqualeOrg wants to merge 3 commits intoapple:mainfrom
DePasqualeOrg:fix-vsock-ebadf-crash
Draft

Fix EBADF crash in vsock connection handling#552
DePasqualeOrg wants to merge 3 commits intoapple:mainfrom
DePasqualeOrg:fix-vsock-ebadf-crash

Conversation

@DePasqualeOrg
Copy link
Contributor

@DePasqualeOrg DePasqualeOrg commented Feb 23, 2026

Note: I encountered a crash while using this package. The root cause analysis from Claude Code pointed to this potential issue, but since the specifics are beyond my understanding, I'm marking this as a draft PR. I put it through several rounds of review with Claude Code and Codex, which suggested that this fixes a legitimate issue. Also, PR #403 mentions that a fix remains to be found for EBADF panics. If this isn't helpful, please close this PR.

Problem

The gRPC client crashes with a precondition failure in NIO:

NIOPosix/System.swift:262: Precondition failed: unacceptable errno 9 Bad file descriptor
  in fcntl(descriptor:command:value:))

The crash occurs in BaseSocket.ignoreSIGPIPE() when NIO calls fcntl(fd, F_SETNOSIGPIPE, 1) on a file descriptor that has been invalidated.

Root cause

VZVirtioSocketConnection is not a raw POSIX socket — it bridges the process to the Virtualization daemon via XPC. When close() is called, the framework signals the hypervisor to tear down the host-to-guest vsock mapping, which invalidates all file descriptors pointing to the underlying kernel object — including dup'd ones.

The existing dupHandle() method calls self.close() immediately after dup(). This is safe when the fd is used synchronously, but dialAgent() passes the fd to gRPC's ClientConnection(.connectedSocket(fd)), which defers NIO channel creation until the first RPC call. By that time, the fd is invalid.

The same pattern exists in waitForAgent(), where the agent's first gRPC call (setTime via TimeSyncer) can be deferred by up to 30 seconds when Rosetta is not enabled.

Fix

Keep the VZVirtioSocketConnection alive until the gRPC client is done with the fd.

  • VsockTransport (new): A thread-safe wrapper that retains the VZVirtioSocketConnection and provides explicit close semantics. Includes a deinit safety net.
  • Vminitd: Gains an optional VsockTransport field. close() uses defer to ensure the transport is closed even if gRPC shutdown throws.
  • dupFileDescriptor() (new): Like dupHandle(), but does not close the connection. The caller is responsible for keeping the connection alive.
  • dialAgent(): Uses dupFileDescriptor() + VsockTransport instead of dupHandle().
  • waitForAgent(): Same migration — returns (FileHandle, VsockTransport) so start() can pass the transport to Vminitd.

After the fix, both the original fd (in VZVirtioSocketConnection) and the dup'd fd (used by NIO/gRPC) are open simultaneously. NIO owns and closes the dup'd fd during channel teardown. Vminitd.close() shuts down gRPC first, then closes the transport. The dup() is still necessary to avoid a double-close.

Not changed

dial() also uses dupHandle() but returns a raw FileHandle via the VirtualMachineInstance protocol. The fd is used immediately for relay I/O, so the risk is lower. Migrating it would require changing the protocol, which is a larger change better suited for a follow-up. A doc comment has been added to dupHandle() noting when it is and isn't safe.

Tests

  • Unit tests (VsockTransportTests.swift): Verify fd lifecycle invariants using Unix socket pairs — the exact fcntl(F_SETNOSIGPIPE) call that triggers the NIO crash, read/write through dup'd fds, correct teardown ordering, and peer EOF behavior.
  • Integration test (testExecDeferredConnectionStability): Runs 10 sequential exec calls with 100ms delays between creating the gRPC connection and making the first RPC, exercising the exact code path that was crashing.

Exercises the dialAgent() → gRPC RPC path with deliberate delays
between creating the connection and making the first RPC call. This
reproduces a crash where NIO hits a precondition failure (EBADF) in
fcntl(F_SETNOSIGPIPE) because the VZVirtioSocketConnection was closed
before the gRPC client created the NIO channel.
When dialAgent() creates a gRPC connection via vsock, it dup's the
file descriptor and immediately closes the VZVirtioSocketConnection.
The Virtualization framework tears down the vsock endpoint when the
connection is closed, which can invalidate the dup'd descriptor. Since
gRPC defers NIO channel creation until the first RPC, the fd may be
invalid by then, causing a precondition failure in NIO's
fcntl(F_SETNOSIGPIPE).

The fix introduces VsockTransport, a Sendable wrapper that retains
the VZVirtioSocketConnection until Vminitd.close() explicitly shuts
it down after the gRPC channel. A new dupFileDescriptor() method dups
without closing, and dialAgent() passes the connection as transport.
@dcantah
Copy link
Member

dcantah commented Feb 24, 2026

Do you know what was occurring when you got the crash? Were you going to stop the vm/container? Was it just running and randomly crashed? It'd be useful to narrow that down. I personally haven't seen any EBADFs in quite awhile now. Our last set was solely due to interactions with our (no longer exposed) pause/resume functionality.

@DePasqualeOrg
Copy link
Contributor Author

I think I was starting or resuming a container, or trying to run a command in a container. Sorry that I can't be more specific than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants