Skip to content

feat: implement provisioning script hotfix system#8086

Draft
Devinwong wants to merge 8 commits intomainfrom
devinwon/scripts_hotfix_MAR
Draft

feat: implement provisioning script hotfix system#8086
Devinwong wants to merge 8 commits intomainfrom
devinwon/scripts_hotfix_MAR

Conversation

@Devinwong
Copy link
Collaborator

What this PR does / why we need it:

  • Added a mechanism for publishing hotfixed provisioning scripts as OCI artifacts.
  • Nodes can now autonomously detect and pull hotfixes during provisioning.
  • Stamped VHDs with provisioning scripts version for hotfix detection.
  • Introduced build-hotfix-oci.sh for building and pushing hotfix artifacts.
  • Created manifest.json to map SKUs to their script inventories.
  • Added README documentation for the hotfix system and usage instructions.

Which issue(s) this PR fixes:

Fixes #

- Added a mechanism for publishing hotfixed provisioning scripts as OCI artifacts.
- Nodes can now autonomously detect and pull hotfixes during provisioning.
- Stamped VHDs with provisioning scripts version for hotfix detection.
- Introduced `build-hotfix-oci.sh` for building and pushing hotfix artifacts.
- Created `manifest.json` to map SKUs to their script inventories.
- Added README documentation for the hotfix system and usage instructions.
Copilot AI review requested due to automatic review settings March 12, 2026 18:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements an OCI-based “provisioning script hotfix” mechanism so nodes can detect and apply updated provisioning scripts at CSE/provisioning time without requiring a new VHD release.

Changes:

  • Adds a SKU→script inventory manifest.json plus a build-hotfix-oci.sh tool to package and publish hotfix artifacts to an OCI registry.
  • Adds hotfix detection/application logic to cse_start.sh, and stamps VHDs with a provisioning-scripts version for tag matching.
  • Adds an e2e scenario test and updates generated CustomData test snapshots accordingly.

Reviewed changes

Copilot reviewed 70 out of 76 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vhdbuilder/provisioning-manifest/manifest.json Defines source→destination mappings and permissions for hotfix-packaged scripts (currently Ubuntu 22.04).
vhdbuilder/provisioning-manifest/build-hotfix-oci.sh Builds a tarball + metadata and pushes them as an OCI artifact via oras.
vhdbuilder/provisioning-manifest/README.md Documents the hotfix workflow and operator usage.
vhdbuilder/packer/install-dependencies.sh Stamps the VHD with a provisioning-scripts version used for hotfix tag detection.
parts/linux/cloud-init/artifacts/cse_start.sh Adds node-side detection/pull/extract logic for hotfix artifacts during provisioning.
e2e/scenario_test.go Adds an e2e scenario validating hotfix detection runs and handles the no-hotfix case.
Makefile Adds a convenience target to invoke the hotfix build script.
.pipelines/scripts/verify_shell.sh Marks the new script as bash-only for linting/verification.
pkg/agent/testdata/** Regenerates CustomData snapshot blobs to reflect updated provisioning scripts.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +192 to +205
# Generate metadata
METADATA_PATH="${OUTPUT_DIR}/hotfix-metadata.json"
cat > "$METADATA_PATH" <<EOF
{
"hotfixId": "${HOTFIX_TAG}",
"affectedVersion": "${AFFECTED_VERSION}",
"sku": "${SKU}",
"description": "${DESCRIPTION}",
"sourceCommit": "${SOURCE_COMMIT}",
"createdAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"tarballSha256": "${TARBALL_SHA256}",
"files": ${METADATA_FILES_JSON}
}
EOF
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hotfix-metadata.json is built via a heredoc with ${DESCRIPTION} interpolated directly into JSON. If the description contains quotes, backslashes, or newlines, the resulting file will be invalid JSON (and can break consumers). Generate the metadata using jq -n --arg ... (or otherwise JSON-escape fields) instead of string concatenation; same applies to METADATA_FILES_JSON entries.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +110
# Defaults
REGISTRY="hotfixscriptpoc.azurecr.io"
DRY_RUN=false
SKU=""
AFFECTED_VERSION=""
DESCRIPTION=""
FILES=""

usage() {
cat <<EOF
Usage: $(basename "$0") [OPTIONS]

Build and push a provisioning script hotfix as an OCI artifact.

Required:
--sku <sku> Target OS SKU (e.g., ubuntu-2204)
--affected-version <ver> Baked VHD image version to hotfix (e.g., 202602.10.0)
--description <desc> Human-readable description of the hotfix
--files <paths> Comma-separated source paths of changed files
(relative to repo root, e.g., parts/linux/cloud-init/artifacts/cse_helpers.sh)

Optional:
--registry <registry> Target container registry (default: hotfixscriptpoc.azurecr.io)
--dry-run Build artifact locally but do not push to registry

Examples:
# Build and push a hotfix for a single file
$(basename "$0") \\
--sku ubuntu-2204 \\
--affected-version 202602.10.0 \\
--description "Fix CVE-2026-XXXX in provision_source.sh" \\
--files "parts/linux/cloud-init/artifacts/cse_helpers.sh"

# Dry-run (no push) with multiple files
$(basename "$0") \\
--sku ubuntu-2204 \\
--affected-version 202602.10.0 \\
--description "Fix provisioning regression" \\
--files "parts/linux/cloud-init/artifacts/cse_helpers.sh,parts/linux/cloud-init/artifacts/cse_install.sh" \\
--dry-run
EOF
exit 1
}

# Parse arguments
while [[ $# -gt 0 ]]; do
case "$1" in
--sku) SKU="$2"; shift 2 ;;
--affected-version) AFFECTED_VERSION="$2"; shift 2 ;;
--description) DESCRIPTION="$2"; shift 2 ;;
--files) FILES="$2"; shift 2 ;;
--registry) REGISTRY="$2"; shift 2 ;;
--dry-run) DRY_RUN=true; shift ;;
-h|--help) usage ;;
*) echo "ERROR: Unknown option: $1"; usage ;;
esac
done

# Validate required arguments
if [[ -z "$SKU" || -z "$AFFECTED_VERSION" || -z "$DESCRIPTION" || -z "$FILES" ]]; then
echo "ERROR: Missing required arguments."
usage
fi

# Validate manifest exists
if [[ ! -f "$MANIFEST_FILE" ]]; then
echo "ERROR: Manifest not found at ${MANIFEST_FILE}"
exit 1
fi

# Validate SKU exists in manifest
if ! jq -e ".skus[\"${SKU}\"]" "$MANIFEST_FILE" > /dev/null 2>&1; then
echo "ERROR: SKU '${SKU}' not found in manifest."
echo "Available SKUs: $(jq -r '.skus | keys[]' "$MANIFEST_FILE" | tr '\n' ', ')"
exit 1
fi

# Validate oras is available
if ! command -v oras &>/dev/null; then
echo "ERROR: 'oras' CLI not found. Install from https://oras.land/"
exit 1
fi

HOTFIX_TAG="${AFFECTED_VERSION}-hotfix"
REPOSITORY=$(jq -r ".skus[\"${SKU}\"].repository" "$MANIFEST_FILE")
ARTIFACT_REF="${REGISTRY}/${REPOSITORY}:${HOTFIX_TAG}"
ARTIFACT_TYPE="application/vnd.aks.provisioning-scripts.hotfix.v1"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manifest.json includes a per-SKU registry field, but build-hotfix-oci.sh ignores it and always defaults to the hardcoded hotfixscriptpoc.azurecr.io unless --registry is provided. This makes the manifest misleading and increases the chance of publishing to the wrong registry. Consider defaulting REGISTRY from .skus[sku].registry (and letting --registry override).

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +15
1. **VHD Build**: Each VHD is stamped with the AgentBaker commit SHA in `/opt/azure/containers/.provisioning-scripts-version`
2. **Hotfix Publish**: An operator builds and pushes corrected scripts as an OCI artifact tagged `<baked-version>-hotfix`
3. **Node Detection**: At provisioning time, `check_for_script_hotfix()` in `cse_start.sh` checks the registry for a matching hotfix tag
4. **Overlay**: If found, the tarball is extracted over the baked scripts before `provision.sh` runs
5. **Fallback**: Any failure is non-fatal — nodes always proceed with baked scripts
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README says the VHD is stamped with the AgentBaker commit SHA in /opt/azure/containers/.provisioning-scripts-version, but the VHD build change stamps IMAGE_VERSION when available (VHD SIG version) and only falls back to the commit SHA. Please update the README wording to match the actual stamp semantics so operators publish hotfixes against the right value.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +67
if ! timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" > /dev/null 2>&1; then
echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log"
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oras manifest fetch probe logs "no hotfix tag found (normal case)" for any failure (including DNS/network timeouts and 401/403 auth failures), because stderr is discarded. This will make real outages look like the normal no-hotfix path. Consider capturing the error output/exit code and logging a different message for transient/auth errors vs an actual missing tag.

Suggested change
if ! timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" > /dev/null 2>&1; then
echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log"
local manifest_err
manifest_err=$(timeout 30 oras manifest fetch "${repo}:${hotfix_tag}" 2>&1 >/dev/null)
local manifest_rc=$?
if [ "$manifest_rc" -ne 0 ]; then
# Distinguish "tag truly missing" (normal case) from other failures
if echo "$manifest_err" | grep -qiE "404|MANIFEST_UNKNOWN|NAME_UNKNOWN|not[[:space:]]\+found"; then
echo "$(date): Hotfix check: no hotfix tag '${hotfix_tag}' found (normal case)" >> "$hotfix_log"
else
echo "$(date): Hotfix check: oras manifest fetch failed for '${hotfix_tag}' (rc=${manifest_rc}): ${manifest_err}" >> "$hotfix_log"
fi

Copilot uses AI. Check for mistakes.
…onanonwestus3

- Changed registry references in provisioning manifest and related scripts to use abe2eprivatenonanonwestus3.azurecr.io instead of hotfixscriptpoc.azurecr.io.
- Updated README.md to reflect the new registry for testing and artifact verification.
- Modified build-hotfix-oci.sh to set the default registry to abe2eprivatenonanonwestus3.azurecr.io.
- Adjusted manifest.json to point to the new registry for provisioning scripts.
- Renamed `oras_login_with_kubelet_identity` to `oras_login_with_managed_identity` in tests to reflect the new implementation.
- Enhanced README to clarify the hotfix process, including the use of managed identity for ORAS login and registry selection details.
- Updated `build-hotfix-oci.sh` to create tarballs without including the staging directory root entry, preventing permission issues during extraction.
Copilot AI review requested due to automatic review settings March 13, 2026 05:12
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 84 changed files in this pull request and generated 3 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +24 to +26
# Defaults
REGISTRY="abe2eprivatenonanonwestus3.azurecr.io"
DRY_RUN=false
Comment on lines +194 to +207
# Generate metadata
METADATA_PATH="${OUTPUT_DIR}/hotfix-metadata.json"
cat > "$METADATA_PATH" <<EOF
{
"hotfixId": "${HOTFIX_TAG}",
"affectedVersion": "${AFFECTED_VERSION}",
"sku": "${SKU}",
"description": "${DESCRIPTION}",
"sourceCommit": "${SOURCE_COMMIT}",
"createdAt": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
"tarballSha256": "${TARBALL_SHA256}",
"files": ${METADATA_FILES_JSON}
}
EOF

### Step-by-Step

1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA).
… for consistency

- Changed the conditional check for IMAGE_VERSION in install-dependencies.sh from double brackets to single brackets for consistency with shell scripting best practices.
Copilot AI review requested due to automatic review settings March 13, 2026 20:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 84 changed files in this pull request and generated 6 comments.


You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +2544 to +2548
`sudo sed -n '/^check_for_script_hotfix()/,/^}/p' /opt/azure/containers/provision_start.sh > /tmp/hotfix_func.sh`,
`sudo bash -c 'echo "999999.99.0" > /opt/azure/containers/.provisioning-scripts-version'`,
`sudo rm -f /opt/azure/containers/.hotfix-applied`,
`sudo bash -c '> /var/log/azure/hotfix-check.log'`,
fmt.Sprintf(`sudo bash -c 'export PATH=/opt/bin:$PATH && source /tmp/hotfix_func.sh && export HOTFIX_REGISTRY=%s && check_for_script_hotfix'`, hotfixRegistry),
Comment on lines +2593 to +2597
`sudo sed -n '/^check_for_script_hotfix()/,/^}/p' /opt/azure/containers/provision_start.sh > /tmp/hotfix_func.sh`,
`sudo bash -c 'echo "999999.99.0" > /opt/azure/containers/.provisioning-scripts-version'`,
`sudo rm -f /opt/azure/containers/.hotfix-applied`,
`sudo bash -c '> /var/log/azure/hotfix-check.log'`,
fmt.Sprintf(`sudo bash -c 'export PATH=/opt/bin:$PATH && source /tmp/hotfix_func.sh && export HOTFIX_REGISTRY=%s && check_for_script_hotfix'`, hotfixRegistry),
nbc.ContainerService.Properties.SecurityProfile = &datamodel.SecurityProfile{
PrivateEgress: &datamodel.PrivateEgress{
Enabled: true,
ContainerRegistryServer: fmt.Sprintf("%s.azurecr.io/aks-managed-repository", config.PrivateACRNameNotAnon(config.Config.DefaultLocation)),
Comment on lines +198 to +202
"hotfixId": "${HOTFIX_TAG}",
"affectedVersion": "${AFFECTED_VERSION}",
"sku": "${SKU}",
"description": "${DESCRIPTION}",
"sourceCommit": "${SOURCE_COMMIT}",
Comment on lines +24 to +26
# Defaults
REGISTRY="abe2eprivatenonanonwestus3.azurecr.io"
DRY_RUN=false
Comment on lines +47 to +48
1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA).

Copilot AI review requested due to automatic review settings March 14, 2026 02:45
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 29 out of 84 changed files in this pull request and generated 1 comment.


You can also share your feedback on Copilot code review. Take the survey.


### Step-by-Step

1. **Identify affected versions**: Determine which baked VHD versions contain the bug. Check the version stamp format (currently git commit SHA).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants