Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion aks-node-controller/pkg/nodeconfigutils/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,10 @@ set -euo pipefail

logger -t aks-boothook "boothook start $(date -Ins)"

mkdir -p /opt/azure/containers
mkdir -p /opt/azure/containers /var/lib/waagent

touch /var/lib/waagent/experimental_skip_ready_report
chmod 0644 /var/lib/waagent/experimental_skip_ready_report

cat <<'EOF' | base64 -d >/opt/azure/containers/aks-node-controller-config.json
%s
Expand Down
5 changes: 3 additions & 2 deletions e2e/scenario_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -506,8 +506,9 @@ func Test_Ubuntu2204_Scriptless(t *testing.T) {
RunScenario(t, &Scenario{
Description: "tests that a new ubuntu 2204 node using self contained installer can be properly bootstrapped",
Config: Config{
Cluster: ClusterKubenet,
VHD: config.VHDUbuntu2204Gen2Containerd,
Cluster: ClusterKubenet,
VHD: config.VHDUbuntu2204Gen2Containerd,
UseCustomDataOnlyProvisioning: true,
Validator: func(ctx context.Context, s *Scenario) {
ValidateFileHasContent(ctx, s, "/var/log/azure/aks-node-controller.log", "aks-node-controller finished successfully")
},
Expand Down
6 changes: 4 additions & 2 deletions e2e/test_helpers.go
Original file line number Diff line number Diff line change
Expand Up @@ -301,8 +301,10 @@ func prepareAKSNode(ctx context.Context, s *Scenario) (*ScenarioVM, error) {
require.NoError(s.T, err, "create vmss %q, check %s for vm logs", s.Runtime.VMSSName, testDir(s.T))
}

err = getCustomScriptExtensionStatus(s, scenarioVM.VM)
require.NoError(s.T, err)
if !s.Config.UseCustomDataOnlyProvisioning {
err = getCustomScriptExtensionStatus(s, scenarioVM.VM)
require.NoError(s.T, err)
}

if !s.Config.SkipDefaultValidation {
vmssCreatedAt := time.Now() // Record the start time
Expand Down
5 changes: 5 additions & 0 deletions e2e/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,11 @@ type Config struct {
// AKSNodeConfigMutator if defined then aks-node-controller will be used to provision nodes
AKSNodeConfigMutator func(*aksnodeconfigv1.Configuration)

// UseCustomDataOnlyProvisioning switches an AKSNodeConfig scenario to a CustomData-only E2E flow.
// It omits the VMSS Custom Script Extension and uses CustomData to run aks-node-controller provision
// directly during cloud-init instead.
UseCustomDataOnlyProvisioning bool

// VMConfigMutator is a function which mutates the base VMSS model according to the scenario's requirements
VMConfigMutator func(*armcompute.VirtualMachineScaleSet)

Expand Down
5 changes: 4 additions & 1 deletion e2e/vmss.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,10 @@ func CustomDataWithHack(s *Scenario, binaryURL string) (string, error) {
#!/bin/bash
set -euo pipefail

mkdir -p /opt/azure/containers /opt/azure/bin
mkdir -p /opt/azure/containers /opt/azure/bin /var/lib/waagent

touch /var/lib/waagent/experimental_skip_ready_report
chmod 0644 /var/lib/waagent/experimental_skip_ready_report

cat <<'EOF' | base64 -d > /opt/azure/containers/aks-node-controller-config-hack.json
%s
Expand Down
9 changes: 9 additions & 0 deletions parts/linux/cloud-init/artifacts/cse_install.sh
Original file line number Diff line number Diff line change
Expand Up @@ -872,4 +872,13 @@ EOF
systemctl restart nvidia-persistenced.service || exit 1
}

skipCloudInitReadyReport() {
local config_filepath="/etc/cloud/cloud.cfg.d/81_azure_skip_ready_report.cfg"
mkdir -p "$(dirname "${config_filepath}")"
cat <<EOF >"${config_filepath}"
datasource:
Azure:
experimental_skip_ready_report: true
EOF
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipCloudInitReadyReport writes the cloud-init config to skip the built-in ready report, but it doesn’t create/update the marker file that report_ready.py uses to decide whether it should run. If this function is used to bake the setting into the VHD, nodes can end up with cloud-init skipping the report and the standalone script also skipping it (because the marker is missing). Consider creating /var/lib/waagent/experimental_skip_ready_report here (ensuring the directory exists) or changing the standalone script to detect the cloud-init config directly.

Suggested change
EOF
EOF
# Ensure the marker file used by report_ready.py is present so that
# disabling the cloud-init ready report still allows the standalone
# reporting script to run as expected.
mkdir -p /var/lib/waagent
touch /var/lib/waagent/experimental_skip_ready_report

Copilot uses AI. Check for mistakes.
}
#EOF
9 changes: 9 additions & 0 deletions parts/linux/cloud-init/artifacts/cse_start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,15 @@ EVENT_JSON=$( jq -n \
)
echo ${EVENT_JSON} > ${EVENTS_LOGGING_DIR}${EVENTS_FILE_NAME}.json


if [ -x /opt/azure/containers/report_ready.py ]; then
if [ "$EXIT_CODE" -eq 0 ]; then
python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric"
Copy link

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running report_ready.py synchronously here can add noticeable tail latency to successful provisioning (each attempt can block up to the HTTP timeout(s) plus retry delay). Consider running the success-path report in the background (similar to log upload) and/or tightening the per-request timeouts so a transient wireserver issue doesn't extend CSE completion by ~minutes.

Suggested change
python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric"
python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" &

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this script was extracted from cloud-init repo ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI generated, asked it to copy paste from cloud init repo and also implement the KVP logging as well.

else
python3 /opt/azure/containers/report_ready.py -v --failure --description "ExitCode: ${EXIT_CODE}, ${message_string}" || echo "WARNING: Failed to report failure to Azure fabric"
fi
Comment on lines +107 to +112
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This report_ready.py invocation runs synchronously before log upload/exit, and can block provisioning for up to ~100s on wireserver timeouts/retries (GET/POST timeouts are 30s with multiple retries). If this is intended to be best-effort (as suggested by || echo "WARNING"), consider running it in the background on success and/or moving it after upload_logs (especially on failure) or passing tighter retry/timeout settings to avoid delaying provisioning and log upload.

Copilot uses AI. Check for mistakes.
Comment on lines +107 to +112
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates cloud-init/CSE scripts under parts/, which are covered by snapshot-style golden tests in pkg/agent/testdata/* (e.g., baker_test.go reads ./testdata/<folder>/CustomData). Please run make generate (or regenerate the testdata via the repo’s standard workflow) and include the updated golden files in this PR; otherwise CI is likely to fail due to mismatched expected CustomData/CSE outputs.

Copilot uses AI. Check for mistakes.
fi

# force a log upload to the host after the provisioning script finishes
# if we failed, wait for the upload to complete so that we don't remove
# the VM before it finishes. if we succeeded, upload in the background
Expand Down
Loading
Loading