Fix EBADF crash in vsock connection handling#552
Draft
DePasqualeOrg wants to merge 3 commits intoapple:mainfrom
Draft
Fix EBADF crash in vsock connection handling#552DePasqualeOrg wants to merge 3 commits intoapple:mainfrom
DePasqualeOrg wants to merge 3 commits intoapple:mainfrom
Conversation
Exercises the dialAgent() → gRPC RPC path with deliberate delays between creating the connection and making the first RPC call. This reproduces a crash where NIO hits a precondition failure (EBADF) in fcntl(F_SETNOSIGPIPE) because the VZVirtioSocketConnection was closed before the gRPC client created the NIO channel.
When dialAgent() creates a gRPC connection via vsock, it dup's the file descriptor and immediately closes the VZVirtioSocketConnection. The Virtualization framework tears down the vsock endpoint when the connection is closed, which can invalidate the dup'd descriptor. Since gRPC defers NIO channel creation until the first RPC, the fd may be invalid by then, causing a precondition failure in NIO's fcntl(F_SETNOSIGPIPE). The fix introduces VsockTransport, a Sendable wrapper that retains the VZVirtioSocketConnection until Vminitd.close() explicitly shuts it down after the gRPC channel. A new dupFileDescriptor() method dups without closing, and dialAgent() passes the connection as transport.
Member
|
Do you know what was occurring when you got the crash? Were you going to stop the vm/container? Was it just running and randomly crashed? It'd be useful to narrow that down. I personally haven't seen any EBADFs in quite awhile now. Our last set was solely due to interactions with our (no longer exposed) pause/resume functionality. |
Contributor
Author
|
I think I was starting or resuming a container, or trying to run a command in a container. Sorry that I can't be more specific than that. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The gRPC client crashes with a precondition failure in NIO:
The crash occurs in
BaseSocket.ignoreSIGPIPE()when NIO callsfcntl(fd, F_SETNOSIGPIPE, 1)on a file descriptor that has been invalidated.Root cause
VZVirtioSocketConnectionis not a raw POSIX socket — it bridges the process to the Virtualization daemon via XPC. Whenclose()is called, the framework signals the hypervisor to tear down the host-to-guest vsock mapping, which invalidates all file descriptors pointing to the underlying kernel object — including dup'd ones.The existing
dupHandle()method callsself.close()immediately afterdup(). This is safe when the fd is used synchronously, butdialAgent()passes the fd to gRPC'sClientConnection(.connectedSocket(fd)), which defers NIO channel creation until the first RPC call. By that time, the fd is invalid.The same pattern exists in
waitForAgent(), where the agent's first gRPC call (setTimeviaTimeSyncer) can be deferred by up to 30 seconds when Rosetta is not enabled.Fix
Keep the
VZVirtioSocketConnectionalive until the gRPC client is done with the fd.VsockTransport(new): A thread-safe wrapper that retains theVZVirtioSocketConnectionand provides explicit close semantics. Includes adeinitsafety net.Vminitd: Gains an optionalVsockTransportfield.close()usesdeferto ensure the transport is closed even if gRPC shutdown throws.dupFileDescriptor()(new): LikedupHandle(), but does not close the connection. The caller is responsible for keeping the connection alive.dialAgent(): UsesdupFileDescriptor()+VsockTransportinstead ofdupHandle().waitForAgent(): Same migration — returns(FileHandle, VsockTransport)sostart()can pass the transport toVminitd.After the fix, both the original fd (in
VZVirtioSocketConnection) and the dup'd fd (used by NIO/gRPC) are open simultaneously. NIO owns and closes the dup'd fd during channel teardown.Vminitd.close()shuts down gRPC first, then closes the transport. Thedup()is still necessary to avoid a double-close.Not changed
dial()also usesdupHandle()but returns a rawFileHandlevia theVirtualMachineInstanceprotocol. The fd is used immediately for relay I/O, so the risk is lower. Migrating it would require changing the protocol, which is a larger change better suited for a follow-up. A doc comment has been added todupHandle()noting when it is and isn't safe.Tests
VsockTransportTests.swift): Verify fd lifecycle invariants using Unix socket pairs — the exactfcntl(F_SETNOSIGPIPE)call that triggers the NIO crash, read/write through dup'd fds, correct teardown ordering, and peer EOF behavior.testExecDeferredConnectionStability): Runs 10 sequential exec calls with 100ms delays between creating the gRPC connection and making the first RPC, exercising the exact code path that was crashing.