Skip to content

Fix CI integration test coldstart issues.#1230

Open
jglogan wants to merge 3 commits intoapple:mainfrom
jglogan:slow-launchd
Open

Fix CI integration test coldstart issues.#1230
jglogan wants to merge 3 commits intoapple:mainfrom
jglogan:slow-launchd

Conversation

@jglogan
Copy link
Contributor

@jglogan jglogan commented Feb 19, 2026

  • Closes [chore]: Fix CLINetwork tests. #1206.
  • Closes [Request]: Consistent, useful behavior for --debug. #507.
  • Addresses existing log messages for [Request]: UX: Use stdout and stderr, and logging consistently in CLI commands. #642.
  • Nondeterministic CI errors are resulting from very slow launch times for the first runtime helper, which causes ContainersService to be locked for longer than our 20 sec timeout. Bumping the timeout to 60 seconds addresses this case for now.
  • Since many log messages needed to be changed to troubleshoot the issue, updated all log messages to use structured logging, and implemented consistent entry/exit logging for all service operations.
  • Added logging for ContainerService lock acquisition to help with finding root cause for the slow service startup.
  • Plumbed the --debug flag on both container system start and container system logs so that the flag is actually useful.
  • Updated the install-init.sh script so that can install in a custom app root directory.

- Closes apple#1206.
- Closes apple#507.
- Addresses existing log messages for apple#642.
- Nondeterministic CI errors are resulting from
  very slow launch times for the first runtime
  helper, which causes ContainersService to be
  locked for longer than our 20 sec timeout.
  Bumping the timeout to 60 seconds addresses
  this case for now.
- Since many log messages needed to be changed
  to troubleshoot the issue, updated all log
  messages to use structured logging, and
  implemented consistent entry/exit logging for
  all service operations.
- Added logging for ContainerService lock
  acquisition to help with finding root cause
  for the slow service startup.
- Plumbed the `--debug` flag on both
  `container system start` and `container system logs`
  so that the flag is actually useful.
- Updated the `install-init.sh` script so that
  can install in a custom app root directory.
@github-actions github-actions bot added the cli label Feb 19, 2026
START_ARGS+=(--app-root "$2")
shift 2
;;
-l|--log-root)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing uses this yet, but it will be used when the CI log collection PR posts.


SWIFT="/usr/bin/swift"
IMAGE_NAME="vminit:latest"
DEST_DIR="${1:-$(git rev-parse --show-toplevel)/bin}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dead code, removed it

bin/container system stop && sleep 3 && bin/container system start && sleep 3
bin/container system stop
sleep 3
bin/container --debug system start "${START_ARGS[@]}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the essential change to the script

var args = ["log"]
args.append(self.follow ? "stream" : "show")
args.append(contentsOf: ["--info", "--debug"])
args.append(contentsOf: ["--info", logOptions.debug ? "--debug" : nil].compactMap { $0 })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we only pass debug into log stream for container --debug system logs.

return .seconds(timeoutSeconds)
}
)
var timeout: Duration = XPCClient.xpcRegistrationTimeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key point here is that we use the common timeout value now.

env[ApplicationRoot.environmentName] = appRoot.path(percentEncoded: false)
env[InstallRoot.environmentName] = installRoot.path(percentEncoded: false)

let processedArgs = (args ?? ["start"]) + (debug ? ["--debug"] : [])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main change, adds --debug to the plugin startup as necessary.

The implication is that all plugins need to implement a start subcommand that accepts a --debug flag.

/// requests only have milliseconds of overhead, but in some instances,
/// macOS can take 5 seconds (or considerably longer) to launch a
/// service after it has been registered.
public static let xpcRegistrationTimeout: Duration = .seconds(60)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our common timeout value, chosen to work with our slow service startup in CI.

let commandName = Self.configuration.commandName ?? "container-apiserver"
let log = APIServer.setupLogger(debug: debug)
log.info("starting \(commandName)")
log.info("starting helper", metadata: ["name": "\(commandName)"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is used for startup/shutdown for all services.

let pluginLoader = try initializePluginLoader(log: log)
try await initializePlugins(pluginLoader: pluginLoader, log: log, routes: &routes)
let containersService = try initializeContainerService(
let containersService = try initializeContainersService(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some name nitpickery for "containers service" and "networks service".

containersService: containersService,
log: log
log: log,
debugHelpers: debug
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug settings from API server to any network helpers we create.

pluginLoader: pluginLoader,
log: log
log: log,
debugHelpers: debug
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug settings from API server to any runtime helpers we create.

let containerRoot = appRoot.appendingPathComponent("containers")
try FileManager.default.createDirectory(at: containerRoot, withIntermediateDirectories: true)
self.exitMonitor = ExitMonitor(log: log)
self.lock = AsyncLock(log: log)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set up debug logging for lock acquisition/release

/// List containers matching the given filters.
public func list(filters: ContainerListFilters = .all) async throws -> [ContainerSnapshot] {
self.log.debug("\(#function)")
log.debug(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pattern we use everywhere for entry/exit on our service operations.

  • All information on entry metadata.
  • Only identifying information on exit metadata.
  • debug level, except trace for potentially busy calls like resize.

try await lock.withLock { context in
public func withContainerList<T: Sendable>(
logMetadata: Logger.Metadata? = nil,
_ operation: @Sendable @escaping ([ContainerSnapshot]) async throws -> T
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allows us to trace locks from callers in other services using withContainerList(logMetadata:) { ... }

/// - Returns: Tuple of (total count, active count, total size, reclaimable size)
public func calculateDiskUsage() async -> (Int, Int, UInt64, UInt64) {
await lock.withLock { _ in
await lock.withLock(logMetadata: ["acquirer": "\(#function)"]) { _ in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our pattern for tracking ContainersService lock usage.

"--uuid", configuration.id,
"--debug",
]
debug ? "--debug" : nil,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug flag from API service to runtime helper.

"--mode",
configuration.mode.rawValue,
]
if debugHelpers {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Propagate debug flag from API server to network helpers.

import Foundation
import Testing

@Suite(.serialized)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turn these tests back on and run them in parallel again.

CODESIGN_OPTS ?= --force --sign - --timestamp=none

# Conditionally use a temporary data directory for integration tests
ifeq ($(strip $(APP_ROOT)),)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this anymore since we'll want to add other options (--log-root) soon. Make it so we can accumulate startup options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[chore]: Fix CLINetwork tests. [Request]: Consistent, useful behavior for --debug.

1 participant

Comments