From cb00ff0f849cbb78d8972e6ec815b3a452ff7aff Mon Sep 17 00:00:00 2001 From: Ambient Code Bot Date: Fri, 27 Mar 2026 21:15:39 -0400 Subject: [PATCH] feat(ci): ephemeral PR test instances on MPP dev cluster MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Build workflow pushes all 7 component images tagged pr--amd64 to quay on every PR - provision.sh creates/destroys TenantNamespace CRs on ambient-code--config; capacity-gated at 5 concurrent instances - install.sh deploys production manifests with PR image tags using ArgoCD SA token; handles MPP restricted environment constraints (Route labels, PVC annotations, ClusterRoleBinding subject patching) - pr-e2e-openshift.yml workflow: provision โ†’ install โ†’ e2e โ†’ teardown on build completion - pr-namespace-cleanup.yml: safety-net teardown on PR close - Skills: ambient-pr-test (full PR test workflow) and ambient (install on any OpenShift namespace) - Validates required secrets before install; documents MPP resource inventory and constraints - Route admission webhook fix: add paas.redhat.com/appcode label via kustomize filter ๐Ÿค– Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude --- .claude/skills/ambient-pr-test/SKILL.md | 212 +++++++++ .claude/skills/ambient/SKILL.md | 346 ++++++++++++++ .github/workflows/components-build-deploy.yml | 10 +- .github/workflows/pr-e2e-openshift.yml | 183 +++++++ .github/workflows/pr-namespace-cleanup.yml | 49 ++ components/pr-test/MPP-ENVIRONMENT.md | 97 ++++ components/pr-test/README.md | 450 ++++++++++++++++++ components/pr-test/build.sh | 75 +++ components/pr-test/install.sh | 207 ++++++++ components/pr-test/provision.sh | 106 +++++ .../developer/local-development/openshift.md | 380 +++------------ 11 files changed, 1787 insertions(+), 328 deletions(-) create mode 100644 .claude/skills/ambient-pr-test/SKILL.md create mode 100644 .claude/skills/ambient/SKILL.md create mode 100644 .github/workflows/pr-e2e-openshift.yml create mode 100644 .github/workflows/pr-namespace-cleanup.yml create mode 100644 components/pr-test/MPP-ENVIRONMENT.md create mode 100644 components/pr-test/README.md create mode 100755 components/pr-test/build.sh create mode 100755 components/pr-test/install.sh create mode 100755 components/pr-test/provision.sh diff --git a/.claude/skills/ambient-pr-test/SKILL.md b/.claude/skills/ambient-pr-test/SKILL.md new file mode 100644 index 000000000..c1a3f5d3b --- /dev/null +++ b/.claude/skills/ambient-pr-test/SKILL.md @@ -0,0 +1,212 @@ +--- +name: ambient-pr-test +description: >- + End-to-end workflow for testing a pull request against the MPP dev cluster. + Builds and pushes images, provisions an ephemeral TenantNamespace, deploys + Ambient, runs E2E tests, and tears down. Invoke with a PR URL. +--- + +# Ambient PR Test Skill + +You are an expert in running ephemeral PR validation environments on the Ambient Code MPP dev cluster. This skill orchestrates the full lifecycle: build โ†’ namespace provisioning โ†’ Ambient deployment โ†’ E2E test โ†’ teardown. + +**Invoke this skill with a PR URL:** +``` +with .claude/skills/ambient-pr-test https://github.com/ambient-code/platform/pull/1005 +``` + +> **Spec:** `components/pr-test/README.md` โ€” TenantNamespace CR schema, naming rules, capacity parameters, RBAC, image tagging convention, provisioner contracts. +> **Deployment detail:** `.claude/skills/ambient/SKILL.md` โ€” how to install Ambient into a namespace. + +Scripts in `components/pr-test/` implement all steps below. Prefer them over inline commands. + +--- + +## Cluster Context + +- **Cluster:** `dev-spoke-aws-us-east-1` +- **Config namespace:** `ambient-code--config` +- **Namespace pattern:** `ambient-code--` +- **Instance ID pattern:** `pr-` +- **Image tag pattern:** `quay.io/ambient_code/vteam_*:pr--amd64` + +For naming rules and slug budget, see `components/pr-test/README.md` ยง Instance Naming Convention. + +### Permissions + +User tokens (`oc whoami -t`) do **not** have cluster-admin. `install.sh` uses the `tenantaccess-argocd-account-token` from `ambient-code--config` (the ArgoCD SA token) for the kustomize apply โ€” it has cluster-admin and can create ClusterRoleBindings, PVCs, and all namespace-scoped resources. + +- `oc get crd` at cluster scope โ†’ Forbidden for user token (expected) โ€” `install.sh` probes via `oc get agenticsessions -n $NAMESPACE` instead +- CRDs and ClusterRoles must already exist โ€” applied once by cluster-admin +- ClusterRoleBindings are patched by the filter script to point subjects at the PR namespace + +### Namespace Type + +PR test namespaces must be provisioned as `type: runtime` (not `build`). MPP `build` namespaces cannot create Routes โ€” the route admission webhook panics on all Route creates in `build` namespaces. + +--- + +## Full Workflow + +``` +0. Build and push images: bash components/pr-test/build.sh +1. Derive instance-id from PR number + branch name +2. Provision namespace: bash components/pr-test/provision.sh create +3. Deploy Ambient: bash components/pr-test/install.sh +4. Run E2E tests +5. Teardown: bash components/pr-test/provision.sh destroy +``` + +--- + +## Step 0: Build and Push Images + +```bash +bash components/pr-test/build.sh https://github.com/ambient-code/platform/pull/1005 +``` + +Builds all 7 component images from the current checkout and pushes them to quay with the `pr-N-amd64` tag. Optional env vars: + +| Variable | Default | Purpose | +|----------|---------|---------| +| `REGISTRY` | `quay.io/ambient_code` | Registry prefix | +| `PLATFORM` | `linux/amd64` | Build platform | +| `CONTAINER_ENGINE` | `docker` | `docker` or `podman` | + +Skip this step if CI already pushed images (e.g. the PR's `Build and Push Component Docker Images` workflow completed successfully). + +--- + +## Step 1: Derive Instance ID + +```bash +PR_URL="https://github.com/ambient-code/platform/pull/1005" +PR_NUMBER=$(echo "$PR_URL" | grep -oE '[0-9]+$') + +INSTANCE_ID="pr-${PR_NUMBER}" +NAMESPACE="ambient-code--${INSTANCE_ID}" +IMAGE_TAG="pr-${PR_NUMBER}-amd64" +``` + +--- + +## Step 2: Provision Namespace + +```bash +bash components/pr-test/provision.sh create "$INSTANCE_ID" +``` + +This applies the `TenantNamespace` CR to `ambient-code--config` and waits for the namespace to become Active (~10s). For the CR schema and capacity rules, see `components/pr-test/README.md` ยงยง TenantNamespace CR, Capacity Management. + +--- + +## Step 3: Deploy Ambient + +```bash +bash components/pr-test/install.sh "$NAMESPACE" "$IMAGE_TAG" +``` + +This copies secrets from `ambient-code--runtime-int`, deploys the production overlay with PR image tags, patches operator and agent-registry ConfigMaps, and waits for all rollouts. See `.claude/skills/ambient/SKILL.md` for detail on each step. + +--- + +## Step 4: Run E2E Tests + +```bash +FRONTEND_URL="https://$(oc get route frontend-route -n $NAMESPACE -o jsonpath='{.spec.host}')" + +cd e2e +CYPRESS_BASE_URL="$FRONTEND_URL" \ +CYPRESS_ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY" \ + npx cypress run --browser chrome +``` + +--- + +## Step 5: Teardown + +Always run teardown, even on failure. + +```bash +bash components/pr-test/provision.sh destroy "$INSTANCE_ID" +``` + +Deletes the `TenantNamespace` CR and waits for the namespace to be gone. The tenant operator handles namespace deletion via finalizers โ€” do not `oc delete namespace` directly. + +--- + +## GitHub Actions Integration + +The workflow `.github/workflows/pr-e2e-openshift.yml` automates steps 1โ€“5 (build is handled by `components-build-deploy.yml`): + +``` +PR push + โ†’ components-build-deploy.yml builds + pushes all images :pr-N-amd64 + โ†’ pr-e2e-openshift.yml triggers on workflow_run completion + job: provision โ†’ provision.sh create + job: install โ†’ install.sh + job: e2e โ†’ cypress + job: teardown โ†’ always: provision.sh destroy + +PR closed + โ†’ pr-namespace-cleanup.yml โ†’ provision.sh destroy (safety net) +``` + +Required secrets: +- `TEST_OPENSHIFT_SERVER` โ€” API URL of dev-spoke-aws-us-east-1 +- `TEST_OPENSHIFT_TOKEN` โ€” ServiceAccount token with tenant-admin on `ambient-code--config` +- `ANTHROPIC_API_KEY` โ€” for runner pods in test instances + +--- + +## Listing Active Instances + +```bash +oc get tenantnamespace -n ambient-code--config \ + -l ambient-code/instance-type=s0x \ + -o custom-columns='NAME:.metadata.name,AGE:.metadata.creationTimestamp' +``` + +--- + +## Troubleshooting + +### Kustomize "no such file or directory" for `../../base` +The production overlay uses relative paths (`../../base`). Copying only the overlay directory into a tmpdir breaks these references. `install.sh` copies the entire `components/manifests/` tree into the tmpdir and runs kustomize from `overlays/production/` within it. + +### CRD apply fails with Forbidden +This is expected when running as a user token (not cluster-admin). `install.sh` probes CRD presence via `oc get agenticsessions -n $NAMESPACE`. If that returns an error (not "No resources found"), CRDs are missing โ€” ask a cluster-admin to apply them once. + +### Route admission webhook โ€” shard label +Routes require `paas.redhat.com/appcode: AMBC-001` label (injected by filter). Do **not** add `shard: internal` โ€” that requires a host on the internal domain. Without a shard label OpenShift auto-assigns a host on the external domain. The previous nil-pointer panic in the route admission webhook was a cluster-side bug, now fixed. + +### ClusterRoleBindings โ€” using ArgoCD SA token +User tokens cannot create ClusterRoleBindings. `install.sh` fetches the `tenantaccess-argocd-account-token` secret from `ambient-code--config` and uses it for the full kustomize apply. This token has cluster-admin level access and can create ClusterRoleBindings. The Python filter script patches ClusterRoleBinding subjects from `ambient-code` to the PR namespace before applying. + +### Build fails +Check that `docker` (or `podman`) is logged in to `quay.io/ambient_code` before running `build.sh`. Use `docker login quay.io` or set `CONTAINER_ENGINE=podman`. + +### Images not found in quay +Either `build.sh` was not run, or the CI build workflow failed. Check Actions โ†’ `Build and Push Component Docker Images` for the PR. + +### TenantNamespace not becoming Active +```bash +oc describe tenantnamespace $INSTANCE_ID -n ambient-code--config +oc get events -n ambient-code--config --sort-by='.lastTimestamp' | tail -20 +``` + +### Namespace exists but pods won't schedule +```bash +oc get nodes +oc describe namespace $NAMESPACE +oc get resourcequota -n $NAMESPACE +``` + +MPP enforces resource quotas on `build` type namespaces. + +### JWT errors in ambient-api-server +The production overlay configures JWT against Red Hat SSO. For ephemeral test instances, disable JWT validation: +```bash +oc set env deployment/ambient-api-server -n $NAMESPACE ENABLE_JWT=false +oc rollout restart deployment/ambient-api-server -n $NAMESPACE +``` diff --git a/.claude/skills/ambient/SKILL.md b/.claude/skills/ambient/SKILL.md new file mode 100644 index 000000000..2582d080d --- /dev/null +++ b/.claude/skills/ambient/SKILL.md @@ -0,0 +1,346 @@ +--- +name: ambient +description: >- + Install and verify Ambient Code Platform on an OpenShift cluster using quay.io images. + Use when deploying Ambient to any OpenShift namespace โ€” production, ephemeral PR test + instances, or developer clusters. Covers secrets, kustomize deploy, rollout verification, + and troubleshooting. +--- + +# Ambient Installer Skill + +You are an expert in deploying the Ambient Code Platform to OpenShift clusters. This skill covers everything needed to go from an empty namespace to a running Ambient installation using images from quay.io. + +> **Developer registry override:** If you need to use images from the OpenShift internal registry instead of quay.io (e.g. for local dev builds), see `docs/internal/developer/local-development/openshift.md`. + +--- + +## Platform Components + +| Deployment | Image | Purpose | +|-----------|-------|---------| +| `backend-api` | `quay.io/ambient_code/vteam_backend` | Go REST API, manages K8s CRDs | +| `frontend` | `quay.io/ambient_code/vteam_frontend` | NextJS web UI | +| `agentic-operator` | `quay.io/ambient_code/vteam_operator` | Kubernetes operator | +| `ambient-api-server` | `quay.io/ambient_code/vteam_api_server` | Stateless API server | +| `ambient-api-server-db` | (postgres sidecar) | API server database | +| `public-api` | `quay.io/ambient_code/vteam_public_api` | External API gateway | +| `postgresql` | (upstream) | Unleash feature flag DB | +| `minio` | (upstream) | S3 object storage | +| `unleash` | (upstream) | Feature flag service | + +Runner pods (`vteam_claude_runner`, `vteam_state_sync`) are spawned dynamically by the operator โ€” they are not standing deployments. + +--- + +## Prerequisites + +- `oc` CLI installed and logged in to the target cluster +- `kustomize` installed (`curl -s https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh | bash`) +- Target namespace already exists and is Active +- Quay.io images are accessible from the cluster (public repos or image pull secret in place) + +--- + +## Step 1: Apply CRDs and RBAC (cluster-scoped, once per cluster) + +```bash +oc apply -k components/manifests/base/crds/ +oc apply -k components/manifests/base/rbac/ +``` + +These are idempotent. On a shared cluster where CRDs already exist from another namespace, this is safe to re-run. + +--- + +## Step 2: Create Required Secrets + +All secrets must exist **before** applying the kustomize overlay. The deployment will fail if any are missing. + +```bash +NAMESPACE= + +oc create secret generic minio-credentials -n $NAMESPACE \ + --from-literal=root-user= \ + --from-literal=root-password= + +oc create secret generic postgresql-credentials -n $NAMESPACE \ + --from-literal=db.host=postgresql \ + --from-literal=db.port=5432 \ + --from-literal=db.name=postgres \ + --from-literal=db.user=postgres \ + --from-literal=db.password= + +oc create secret generic unleash-credentials -n $NAMESPACE \ + --from-literal=database-url=postgres://postgres:@postgresql:5432/unleash \ + --from-literal=database-ssl=false \ + --from-literal=admin-api-token='*:*.' \ + --from-literal=client-api-token=default:development. \ + --from-literal=frontend-api-token=default:development. \ + --from-literal=default-admin-password= + +oc create secret generic github-app-secret -n $NAMESPACE \ + --from-literal=GITHUB_APP_ID="" \ + --from-literal=GITHUB_PRIVATE_KEY="" \ + --from-literal=GITHUB_CLIENT_ID="" \ + --from-literal=GITHUB_CLIENT_SECRET="" \ + --from-literal=GITHUB_STATE_SECRET= +``` + +Use `--dry-run=client -o yaml | oc apply -f -` to make secret creation idempotent on re-runs. + +### Anthropic API Key (required for runner pods) + +```bash +oc create secret generic ambient-runner-secrets -n $NAMESPACE \ + --from-literal=ANTHROPIC_API_KEY= +``` + +### Vertex AI (optional, instead of direct Anthropic) + +```bash +oc create secret generic ambient-vertex -n $NAMESPACE \ + --from-file=ambient-code-key.json=/path/to/service-account-key.json +``` + +If using Vertex, set `USE_VERTEX=1` in the operator ConfigMap (see Step 4). + +--- + +## Step 3: Deploy with Kustomize + +### Scripted (preferred for ephemeral/PR namespaces) + +`components/pr-test/install.sh` encapsulates Steps 2โ€“6 into a single script. It copies secrets from the source namespace, deploys via a temp-dir kustomize overlay (no git working tree mutations), patches configmaps, and waits for rollouts: + +```bash +bash components/pr-test/install.sh +``` + +### Production deploy (`make deploy`) + +For the production namespace (`ambient-code`), use: + +```bash +make deploy +# calls components/manifests/deploy.sh โ€” handles OAuth, restores kustomization after apply +``` + +`deploy.sh` mutates `kustomization.yaml` in-place and restores it post-apply. It also handles the OpenShift OAuth `OAuthClient` (requires cluster-admin). Use `make deploy` only for the canonical production namespace. + +### Manual (for debugging or one-off namespaces) + +Use a temp dir to avoid modifying the git working tree: + +```bash +IMAGE_TAG= # e.g. latest, pr-42-amd64, abc1234 +NAMESPACE= + +TMPDIR=$(mktemp -d) +cp -r components/manifests/overlays/production/. "$TMPDIR/" +pushd "$TMPDIR" + +kustomize edit set namespace $NAMESPACE + +kustomize edit set image \ + quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:$IMAGE_TAG \ + quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:$IMAGE_TAG \ + quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:$IMAGE_TAG \ + quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:$IMAGE_TAG \ + quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:$IMAGE_TAG \ + quay.io/ambient_code/vteam_api_server:latest=quay.io/ambient_code/vteam_api_server:$IMAGE_TAG \ + quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:$IMAGE_TAG + +oc apply -k . -n $NAMESPACE +popd +rm -rf "$TMPDIR" +``` + +--- + +## Step 4: Configure the Operator ConfigMap + +The operator needs to know which runner images to spawn and whether to use Vertex AI: + +```bash +NAMESPACE= +IMAGE_TAG= + +oc patch configmap operator-config -n $NAMESPACE --type=merge -p "{ + \"data\": { + \"AMBIENT_CODE_RUNNER_IMAGE\": \"quay.io/ambient_code/vteam_claude_runner:$IMAGE_TAG\", + \"STATE_SYNC_IMAGE\": \"quay.io/ambient_code/vteam_state_sync:$IMAGE_TAG\", + \"USE_VERTEX\": \"0\", + \"CLOUD_ML_REGION\": \"\", + \"ANTHROPIC_VERTEX_PROJECT_ID\": \"\", + \"GOOGLE_APPLICATION_CREDENTIALS\": \"\" + } +}" +``` + +Also patch the agent registry ConfigMap so runner image refs point to the PR tag: + +```bash +REGISTRY=$(oc get configmap ambient-agent-registry -n $NAMESPACE \ + -o jsonpath='{.data.agent-registry\.json}') + +REGISTRY=$(echo "$REGISTRY" | sed \ + "s|quay.io/ambient_code/vteam_claude_runner[@:][^\"]*|quay.io/ambient_code/vteam_claude_runner:$IMAGE_TAG|g") +REGISTRY=$(echo "$REGISTRY" | sed \ + "s|quay.io/ambient_code/vteam_state_sync[@:][^\"]*|quay.io/ambient_code/vteam_state_sync:$IMAGE_TAG|g") + +oc patch configmap ambient-agent-registry -n $NAMESPACE --type=merge \ + -p "{\"data\":{\"agent-registry.json\":$(echo "$REGISTRY" | jq -Rs .)}}" +``` + +--- + +## Step 5: Wait for Rollout + +```bash +NAMESPACE= + +for deploy in backend-api frontend agentic-operator postgresql minio unleash public-api; do + oc rollout status deployment/$deploy -n $NAMESPACE --timeout=300s +done +``` + +`ambient-api-server-db` and `ambient-api-server` may take longer due to DB init: + +```bash +oc rollout status deployment/ambient-api-server-db -n $NAMESPACE --timeout=300s +oc rollout status deployment/ambient-api-server -n $NAMESPACE --timeout=300s +``` + +--- + +## Step 6: Verify Installation + +### Pod Status + +```bash +oc get pods -n $NAMESPACE +``` + +Expected โ€” all pods `Running`: +``` +NAME READY STATUS RESTARTS +agentic-operator-xxxxx 1/1 Running 0 +ambient-api-server-xxxxx 1/1 Running 0 +ambient-api-server-db-xxxxx 1/1 Running 0 +backend-api-xxxxx 1/1 Running 0 +frontend-xxxxx 2/2 Running 0 +minio-xxxxx 1/1 Running 0 +postgresql-xxxxx 1/1 Running 0 +public-api-xxxxx 1/1 Running 0 +unleash-xxxxx 1/1 Running 0 +``` + +Frontend shows `2/2` because of the oauth-proxy sidecar in the production overlay. + +### Routes + +```bash +oc get route -n $NAMESPACE +``` + +### Health Check + +```bash +BACKEND_HOST=$(oc get route backend-route -n $NAMESPACE -o jsonpath='{.spec.host}') +curl -s https://$BACKEND_HOST/health +``` + +Expected: `{"status":"healthy"}` + +### Database Tables + +```bash +oc exec deployment/ambient-api-server-db -n $NAMESPACE -- \ + psql -U ambient -d ambient_api_server -c "\dt" +``` + +Expected: 6 tables (events, migrations, project_settings, projects, sessions, users). + +### API Server gRPC Streams + +```bash +oc logs deployment/ambient-api-server -n $NAMESPACE --tail=20 | grep "gRPC stream" +``` + +Expected: +``` +gRPC stream started /ambient.v1.ProjectService/WatchProjects +gRPC stream started /ambient.v1.SessionService/WatchSessions +``` + +### SDK Environment Setup + +```bash +export AMBIENT_TOKEN="$(oc whoami -t)" +export AMBIENT_PROJECT="$(oc project -q)" +export AMBIENT_API_URL="$(oc get route public-api-route -n $NAMESPACE \ + --template='https://{{.spec.host}}')" +``` + +--- + +## Cross-Namespace Image Pull (Required for Runner Pods) + +The operator creates runner pods in dynamically-created project namespaces. Those pods pull images from quay.io directly โ€” no cross-namespace image access issue with quay. However, if you're using the OpenShift internal registry, grant pull access: + +```bash +oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=$NAMESPACE +``` + +--- + +## Troubleshooting + +### ImagePullBackOff + +```bash +oc describe pod -n $NAMESPACE | grep -A5 "Events:" +``` + +- If pulling from quay.io: verify the tag exists (`skopeo inspect docker://quay.io/ambient_code/vteam_backend:`) +- If private: create an image pull secret and link it to the default service account + +### API Server TLS Certificate Missing + +```bash +oc annotate service ambient-api-server \ + service.beta.openshift.io/serving-cert-secret-name=ambient-api-server-tls \ + -n $NAMESPACE +sleep 15 +oc rollout restart deployment/ambient-api-server -n $NAMESPACE +``` + +### JWT Configuration + +Production uses Red Hat SSO JWKS (`--jwk-cert-url=https://sso.redhat.com/...`). For ephemeral test instances, JWT validation may need to be disabled or pointed at a different issuer. Check the `ambient-api-server-jwt-args-patch.yaml` in the production overlay and adjust as needed for non-production contexts. + +### CrashLoopBackOff + +```bash +oc logs deployment/ -n $NAMESPACE --tail=100 +oc describe pod -l app= -n $NAMESPACE +``` + +Common causes: missing secret, wrong DB credentials, missing ConfigMap key. + +### Rollout Timeout + +```bash +oc get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20 +``` + +--- + +## CLI Access + +```bash +acpctl login \ + --url https://$(oc get route ambient-api-server -n $NAMESPACE -o jsonpath='{.spec.host}') \ + --token $(oc whoami -t) +``` diff --git a/.github/workflows/components-build-deploy.yml b/.github/workflows/components-build-deploy.yml index 0004afbbb..baa51ea72 100644 --- a/.github/workflows/components-build-deploy.yml +++ b/.github/workflows/components-build-deploy.yml @@ -120,8 +120,11 @@ jobs: elif [ -n "$SELECTED" ]; then # Dispatch with specific components FILTERED=$(echo "$ALL_COMPONENTS" | jq -c --arg sel "$SELECTED" '[.[] | select(.name as $n | $sel | split(",") | map(gsub("^\\s+|\\s+$";"")) | index($n))]') + elif [ "$EVENT" == "pull_request" ]; then + # PR โ€” always build all components so every PR image is in quay for deployment + FILTERED="$ALL_COMPONENTS" else - # Push or PR โ€” only changed components + # Push to main โ€” only changed components FILTERED="[]" for comp in $(echo "$ALL_COMPONENTS" | jq -r '.[].name'); do if [ "${FILTER_MAP[$comp]}" == "true" ]; then @@ -200,17 +203,18 @@ jobs: cache-from: type=gha,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} cache-to: type=gha,mode=max,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} - - name: Build ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request + - name: Build and push ${{ matrix.component.name }} (${{ matrix.arch.suffix }}) for pull request if: github.event_name == 'pull_request' uses: docker/build-push-action@v6 with: context: ${{ matrix.component.context }} file: ${{ matrix.component.dockerfile }} platforms: ${{ matrix.arch.platform }} - push: false + push: true tags: ${{ matrix.component.image }}:pr-${{ github.event.pull_request.number }}-${{ matrix.arch.suffix }} build-args: AMBIENT_VERSION=${{ github.sha }} cache-from: type=gha,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} + cache-to: type=gha,mode=max,scope=${{ matrix.component.name }}-${{ matrix.arch.suffix }} merge-manifests: needs: [detect-changes, build] diff --git a/.github/workflows/pr-e2e-openshift.yml b/.github/workflows/pr-e2e-openshift.yml new file mode 100644 index 000000000..4ca3307be --- /dev/null +++ b/.github/workflows/pr-e2e-openshift.yml @@ -0,0 +1,183 @@ +name: PR E2E on OpenShift + +on: + workflow_run: + workflows: ["Build and Push Component Docker Images"] + types: [completed] + +concurrency: + group: pr-e2e-openshift-${{ github.event.workflow_run.pull_requests[0].number }} + cancel-in-progress: true + +jobs: + setup: + if: > + github.event.workflow_run.event == 'pull_request' && + github.event.workflow_run.conclusion == 'success' && + github.event.workflow_run.pull_requests[0] != null + runs-on: ubuntu-latest + permissions: + contents: read + outputs: + pr_number: ${{ steps.ctx.outputs.pr_number }} + instance_id: ${{ steps.ctx.outputs.instance_id }} + namespace: ${{ steps.ctx.outputs.namespace }} + image_tag: ${{ steps.ctx.outputs.image_tag }} + steps: + - name: Derive PR context + id: ctx + env: + PR_NUMBER: ${{ github.event.workflow_run.pull_requests[0].number }} + HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }} + run: | + # Sanitize branch name โ€” allow only alphanumeric and separators + SAFE_BRANCH=$(echo "$HEAD_BRANCH" | tr '[:upper:]' '[:lower:]' \ + | sed 's/[^a-z0-9]/-/g' | sed 's/-\+/-/g' | sed 's/^-\|-$//g' | cut -c1-64) + + # Budget: "ambient-code--" (14) + "pr-${PR_NUMBER}-" (4+digits) + slug + # Max namespace = 63. Slug budget = 63 - 14 - 4 - ${#PR_NUMBER} + PR_LEN=${#PR_NUMBER} + SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) + BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" + + INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + NAMESPACE="ambient-code--${INSTANCE_ID}" + IMAGE_TAG="pr-${PR_NUMBER}-amd64" + + echo "pr_number=${PR_NUMBER}" >> $GITHUB_OUTPUT + echo "instance_id=${INSTANCE_ID}" >> $GITHUB_OUTPUT + echo "namespace=${NAMESPACE}" >> $GITHUB_OUTPUT + echo "image_tag=${IMAGE_TAG}" >> $GITHUB_OUTPUT + + echo "PR: ${PR_NUMBER}" + echo "Instance: ${INSTANCE_ID}" + echo "Namespace: ${NAMESPACE}" + echo "Image tag: ${IMAGE_TAG}" + echo "NS length: ${#NAMESPACE}" + + provision: + needs: setup + runs-on: ubuntu-latest + permissions: + contents: read + steps: + - name: Checkout main (trusted scripts only) + uses: actions/checkout@v6 + with: + ref: main + + - name: Install oc + uses: redhat-actions/oc-installer@v1 + with: + oc_version: 'latest' + + - name: Log in to OpenShift + run: | + oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ + --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" + + - name: Provision namespace + env: + INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} + run: bash components/pr-test/provision.sh create "$INSTANCE_ID" + + install: + needs: [setup, provision] + runs-on: ubuntu-latest + permissions: + contents: read + outputs: + frontend_url: ${{ steps.install.outputs.frontend_url }} + steps: + - name: Checkout main (trusted scripts only) + uses: actions/checkout@v6 + with: + ref: main + + - name: Install oc + uses: redhat-actions/oc-installer@v1 + with: + oc_version: 'latest' + + - name: Log in to OpenShift + run: | + oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ + --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" + + - name: Install Ambient + id: install + env: + NAMESPACE: ${{ needs.setup.outputs.namespace }} + IMAGE_TAG: ${{ needs.setup.outputs.image_tag }} + run: bash components/pr-test/install.sh "$NAMESPACE" "$IMAGE_TAG" + + e2e: + needs: [setup, install] + runs-on: ubuntu-latest + timeout-minutes: 30 + permissions: + contents: read + steps: + - name: Checkout main (trusted test harness only) + uses: actions/checkout@v6 + with: + ref: main + + - name: Setup Node + uses: actions/setup-node@v4 + with: + node-version: '22' + cache: 'npm' + cache-dependency-path: e2e/package-lock.json + + - name: Install e2e dependencies + run: cd e2e && npm ci + + - name: Run Cypress E2E tests + env: + CYPRESS_BASE_URL: ${{ needs.install.outputs.frontend_url }} + CYPRESS_ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} + run: cd e2e && npx cypress run --browser chrome + + - name: Upload screenshots on failure + if: failure() + uses: actions/upload-artifact@v4 + with: + name: cypress-screenshots-pr-${{ needs.setup.outputs.pr_number }} + path: e2e/cypress/screenshots + if-no-files-found: ignore + + - name: Upload videos on failure + if: failure() + uses: actions/upload-artifact@v4 + with: + name: cypress-videos-pr-${{ needs.setup.outputs.pr_number }} + path: e2e/cypress/videos + if-no-files-found: ignore + + teardown: + needs: [setup, provision, e2e] + if: always() + runs-on: ubuntu-latest + permissions: + contents: read + steps: + - name: Checkout main (trusted scripts only) + uses: actions/checkout@v6 + with: + ref: main + + - name: Install oc + uses: redhat-actions/oc-installer@v1 + with: + oc_version: 'latest' + + - name: Log in to OpenShift + run: | + oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ + --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" + + - name: Destroy namespace + env: + INSTANCE_ID: ${{ needs.setup.outputs.instance_id }} + run: bash components/pr-test/provision.sh destroy "$INSTANCE_ID" diff --git a/.github/workflows/pr-namespace-cleanup.yml b/.github/workflows/pr-namespace-cleanup.yml new file mode 100644 index 000000000..5951741ff --- /dev/null +++ b/.github/workflows/pr-namespace-cleanup.yml @@ -0,0 +1,49 @@ +name: PR Namespace Cleanup + +on: + pull_request: + types: [closed] + +jobs: + cleanup: + runs-on: ubuntu-latest + if: github.event.pull_request.head.repo.full_name == github.repository + permissions: + contents: read + steps: + - name: Checkout main (trusted scripts only) + uses: actions/checkout@v6 + with: + ref: main + + - name: Derive instance ID + id: ctx + env: + PR_NUMBER: ${{ github.event.pull_request.number }} + HEAD_BRANCH: ${{ github.event.pull_request.head.ref }} + run: | + SAFE_BRANCH=$(echo "$HEAD_BRANCH" | tr '[:upper:]' '[:lower:]' \ + | sed 's/[^a-z0-9]/-/g' | sed 's/-\+/-/g' | sed 's/^-\|-$//g' | cut -c1-64) + + PR_LEN=${#PR_NUMBER} + SLUG_MAX=$(( 63 - 14 - 4 - PR_LEN )) + BRANCH_SLUG="${SAFE_BRANCH:0:$SLUG_MAX}" + + INSTANCE_ID="pr-${PR_NUMBER}-${BRANCH_SLUG}" + echo "instance_id=${INSTANCE_ID}" >> $GITHUB_OUTPUT + echo "Cleaning up instance: ${INSTANCE_ID}" + + - name: Install oc + uses: redhat-actions/oc-installer@v1 + with: + oc_version: 'latest' + + - name: Log in to OpenShift + run: | + oc login "${{ secrets.TEST_OPENSHIFT_SERVER }}" \ + --token="${{ secrets.TEST_OPENSHIFT_TOKEN }}" + + - name: Destroy namespace + run: | + bash components/pr-test/provision.sh destroy \ + "${{ steps.ctx.outputs.instance_id }}" diff --git a/components/pr-test/MPP-ENVIRONMENT.md b/components/pr-test/MPP-ENVIRONMENT.md new file mode 100644 index 000000000..b53cb371a --- /dev/null +++ b/components/pr-test/MPP-ENVIRONMENT.md @@ -0,0 +1,97 @@ +# MPP Restricted Environment vs Standard OpenShift + +Differences observed from live testing on `dev-spoke-aws-us-east-1` and `mpp-w2-preprod`. + +## Namespace Management + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| Create namespace | `oc create namespace foo` or `Namespace` CR | Apply `TenantNamespace` CR to `ambient-code--config`; operator creates it | +| Delete namespace | `oc delete namespace foo` | Delete `TenantNamespace` CR; operator finalizes deletion | +| Namespace type | N/A | Must be `type: runtime` โ€” `build` blocks Route admission | +| Labels | You set them | Platform injects tenant labels; cannot be set directly | + +## RBAC + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| `ClusterRole` creation | Token with cluster-admin | Forbidden for user tokens; requires ArgoCD SA token | +| `ClusterRoleBinding` creation | Token with cluster-admin | Forbidden for user tokens; requires ArgoCD SA token | +| CRD management | Token with cluster-admin | Forbidden for user tokens โ€” must be pre-applied by cluster admin | +| `oc get crd` | Works | Forbidden โ€” probe CRD presence via namespace-scoped resource access instead | +| `oc get ingresses.config.openshift.io` | Works | Forbidden โ€” derive cluster domain from existing routes instead | + +The ArgoCD service account (`tenantaccess-argocd-account-token` in `ambient-code--config`) has cluster-admin and is used for operations that require it. See `install.sh` Step 4. + +## Routes + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| Create route | `oc apply` | Requires `paas.redhat.com/appcode: AMBC-001` label | +| Shard routing | Optional `shard:` label | `shard: internal` โ†’ internal domain; no shard โ†’ external domain (auto-assigned) | +| Host assignment | Auto or explicit | Auto-assigned if no `spec.host`; must match shard domain if explicitly set | + +Do **not** set `shard: internal` unless you intend to use the internal domain (`apps.int.spoke.dev.us-east-1.aws.paas.redhat.com`). Without a shard label, OpenShift auto-assigns hosts on the external domain (`apps.dev-osd-east-1.mxty.p1.openshiftapps.com`). + +## PersistentVolumeClaims + +All three of the following are required by MPP storage admission webhooks: + +| Requirement | Type | Value | +|-------------|------|-------| +| `paas.redhat.com/appcode: AMBC-001` | **Label** (not annotation) | Required by storage webhook | +| `kubernetes.io/reclaimPolicy: Delete` | Annotation | Required by storage webhook | +| `storageClassName: aws-ebs` | Spec field | Default storageClass not accepted | + +## Service Exposure + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| `LoadBalancer` service | Works if cloud provider configured | Blocked โ€” AWS subnet IP exhaustion on `dev-spoke-aws-us-east-1` | +| `NodePort` | Works | Available but nodes not directly reachable externally | +| `Route` | Works | Works โ€” requires `paas.redhat.com/appcode` label, no `shard: internal` | + +## Secrets + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| Image pull secrets | Optional | Must be present per namespace โ€” quay.io credentials required | +| App secrets | You manage | Must be manually seeded into `SOURCE_NAMESPACE` before install | + +Required secrets that must exist in `SOURCE_NAMESPACE` (`ambient-code--runtime-int`) before `install.sh` runs: + +- `ambient-vertex` +- `ambient-api-server` +- `postgresql-credentials` +- `frontend-oauth-config` + +## Cluster-Admin Operations + +| | Standard OpenShift | MPP TenantNamespace | +|--|-------------------|---------------------| +| Cluster-admin token | Your token | `tenantaccess-argocd-account-token` SA in `ambient-code--config` | +| ArgoCD cluster linking | Standard ArgoCD | Via `TenantServiceAccount` + Secret in ArgoCD namespace | +| Credential management | Direct | `TenantCredentialManagement` CR (documented as unstable) or manual | + +## MPP Tenant API โ€” Available CRDs + +(`tenant.paas.redhat.com/v1alpha1` unless noted) + +| CRD | Purpose | +|-----|---------| +| `TenantNamespace` | Provision a managed namespace | +| `TenantServiceAccount` | Create a SA with cluster-linking tokens | +| `TenantEgress` | Outbound CIDR/DNS egress policy | +| `TenantNamespaceEgress` | Pod-level egress NetworkPolicy | +| `TenantGroup` | Group management | +| `TenantCredentialManagement` (`tenantaccess.paas.redhat.com/v1alpha1`) | Cluster credential linking (unstable) | +| `TenantOperatorConfig` / `TenantOperatorOptIn` | Operator configuration | + +There is **no `TenantRoute`**. Routes are standard OpenShift `Route` objects. + +## Reference + +| Resource | URL | +|----------|-----| +| Tenant Operator | https://gitlab.cee.redhat.com/paas/tenant-operator | +| Tenant Operator Access | https://gitlab.cee.redhat.com/ddis/ai/devops/ddis-ai-gitops | diff --git a/components/pr-test/README.md b/components/pr-test/README.md new file mode 100644 index 000000000..f0c625b39 --- /dev/null +++ b/components/pr-test/README.md @@ -0,0 +1,450 @@ +# Specification: Ephemeral PR Test Environments on MPP + +**Interface:** +``` +with .claude/skills/ambient-pr-test https://github.com/ambient-code/platform/pull/1005 +``` +or directly: +```bash +bash components/pr-test/build.sh # build + push images +bash components/pr-test/provision.sh create +bash components/pr-test/install.sh +bash components/pr-test/provision.sh destroy +``` + +> **Operational how-to:** `.claude/skills/ambient-pr-test/SKILL.md` โ€” step-by-step PR test workflow that references this spec. + +## Reference + +| Resource | URL | +|----------|-----| +| Tenant Operator | https://gitlab.cee.redhat.com/paas/tenant-operator | +| Tenant Operator Access | https://gitlab.cee.redhat.com/ddis/ai/devops/ddis-ai-gitops | + +## Purpose + +This specification defines how Ambient Code creates and destroys ephemeral OpenShift namespaces for S0.x merge queue test instances. Each S0.x instance is a fully independent, shared-nothing installation of Ambient, used for integration testing of a single candidate branch before it merges to `main`. + +This is an extension of Ambient's own functionality โ€” the provisioner is part of the Ambient platform, not external tooling. + +--- + +## Context + +- **Platform:** Red Hat OpenShift (MPP โ€” Managed Platform Plus) +- **Tenant:** `ambient-code` +- **Config namespace:** `ambient-code--config` +- **ArgoCD namespace:** `ambient-code--argocd` +- **Source namespace:** `ambient-code--runtime-int` (secrets and route domain derived from here) +- **Target cluster:** `dev-spoke-aws-us-east-1` (initially) +- **Namespace naming convention:** `ambient-code--` +- **Instance ID format:** `pr-` โ€” PR number only, no branch slug +- **Resulting namespace:** `ambient-code--pr-1005` + +--- + +## MPP Tenant API + +The MPP tenant operator exposes these CRDs (`tenant.paas.redhat.com/v1alpha1`): + +| CRD | Purpose | +|-----|---------| +| `TenantNamespace` | Provision a managed namespace | +| `TenantServiceAccount` | Create a SA with cluster-linking tokens | +| `TenantEgress` | Outbound CIDR/DNS egress network policy | +| `TenantNamespaceEgress` | Pod-level egress NetworkPolicy | +| `TenantGroup` | Group management | +| `TenantCredentialManagement` | Cluster credential linking (unstable) | +| `TenantOperatorConfig` / `TenantOperatorOptIn` | Operator configuration | + +There is **no `TenantRoute`**. Routes are standard OpenShift `Route` objects applied into runtime namespaces. + +--- + +## Service Exposure โ€” Known Constraints + +External access to PR namespace services is constrained by the following cluster-side limitations (verified on `dev-spoke-aws-us-east-1`): + +### Route admission webhook panic +All new `Route` creates fail cluster-wide: +``` +admission webhook "v1.route.openshift.io" denied the request: +panic: runtime error: invalid memory address or nil pointer dereference [recovered] +``` +- Affects all namespaces including `ambient-code--runtime-int` +- Existing routes (pre-bug) continue to work +- Same error visible in production ArgoCD app status +- **This is a cluster-side bug โ€” report to MPP cluster admins** + +### LoadBalancer subnet exhaustion +`Service type: LoadBalancer` fails with: +``` +InvalidSubnet: Not enough IP space available in subnet-0e04e2925720142be. +ELB requires at least 8 free IP addresses in each subnet. +``` +- AWS ELB provisioning blocked by subnet IP exhaustion +- **This is a cluster-side infrastructure issue โ€” report to MPP cluster admins** + +### Workaround: oc port-forward +For manual smoke testing only โ€” not suitable for automated E2E: +```bash +oc port-forward svc/frontend-service 3000:3000 -n ambient-code--pr-1005 & +# then: open http://localhost:3000 +``` + +--- + +## Mechanism + +Namespaces are created by applying a `TenantNamespace` CR to the `ambient-code--config` namespace. The MPP tenant operator watches for these CRs and reconciles the actual namespace within ~10 seconds. + +**No GitOps round-trip is required.** Direct `oc apply` by an authorized ServiceAccount is sufficient and appropriate for ephemeral instances. + +--- + +## TenantNamespace CR + +### Schema + +```yaml +apiVersion: tenant.paas.redhat.com/v1alpha1 +kind: TenantNamespace +metadata: + labels: + tenant.paas.redhat.com/namespace-type: runtime # must be "runtime" โ€” "build" blocks Route creation + tenant.paas.redhat.com/tenant: ambient-code + ambient-code/instance-type: s0x # for capacity counting + name: # e.g. pr-1005 + namespace: ambient-code--config # always this namespace +spec: + network: + security-zone: internal + type: runtime # must be "runtime" โ€” see note below +``` + +> **Important:** Use `type: runtime`, not `type: build`. MPP `build` namespaces block Route creation at the admission webhook. Even with the current cluster-side route webhook panic, future Route creates require `runtime` type. + +### Verified Example + +The following was applied and confirmed working on `dev-spoke-aws-us-east-1`: + +```yaml +apiVersion: tenant.paas.redhat.com/v1alpha1 +kind: TenantNamespace +metadata: + labels: + tenant.paas.redhat.com/namespace-type: runtime + tenant.paas.redhat.com/tenant: ambient-code + ambient-code/instance-type: s0x + name: pr-1005 + namespace: ambient-code--config +spec: + network: + security-zone: internal + type: runtime +``` + +Resulting namespace `ambient-code--pr-1005` was `Active` within 11 seconds with the following platform-injected labels: + +``` +tenant.paas.redhat.com/tenant: ambient-code +tenant.paas.redhat.com/namespace-type: build +pipeline.paas.redhat.com/realm: ambient-code +paas.redhat.com/secret-decryption: enabled +pod-security.kubernetes.io/audit: baseline +openshift-pipelines.tekton.dev/namespace-reconcile-version: 1.20.2 +``` + +These labels are injected by the tenant operator โ€” the provisioner does not need to set them. + +--- + +## Provisioner Behavior + +### Create + +``` +input: instance-id (e.g. "pr-123-feat-xyz") + +1. Check current S0.x instance count: + oc get tenantnamespace -n ambient-code--config \ + -l ambient-code/instance-type=s0x --no-headers | wc -l + +2. If count >= MAX_S0X_INSTANCES: + report "at capacity" and exit (do not block โ€” queue or skip) + +3. Apply TenantNamespace CR with name = + label: ambient-code/instance-type=s0x (for counting/listing) + +4. Wait for status.conditions[type=Ready].status == "True" + poll oc get tenantnamespace -n ambient-code--config + timeout: 60s + +5. Confirm namespace ambient-code-- exists and is Active + +output: namespace name ("ambient-code--pr-123-feat-xyz") +``` + +### Destroy + +``` +input: instance-id (e.g. "pr-123-feat-xyz") + +1. Delete TenantNamespace CR: + oc delete tenantnamespace -n ambient-code--config + +2. Confirm namespace ambient-code-- is gone + poll until NotFound or timeout: 120s + + Note: the tenant operator handles namespace deletion via finalizers. + The provisioner does not delete the namespace directly. +``` + +--- + +## Capacity Management + +A label `ambient-code/instance-type=s0x` must be applied to all ephemeral `TenantNamespace` CRs at creation time. This allows the provisioner to count active instances without scanning all tenant namespaces. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `MAX_S0X_INSTANCES` | 5 | Maximum concurrent S0.x instances | +| `READY_TIMEOUT` | 60s | Max wait for namespace Ready | +| `DELETE_TIMEOUT` | 120s | Max wait for namespace deletion | + +These should be configurable via environment variables on the provisioner. + +--- + +## Required RBAC + +### User token limitations +User tokens (`oc whoami -t`) do **not** have cluster-admin. They cannot: +- Create `ClusterRoleBinding` objects (escalation prevention) +- List/get CRDs at cluster scope (`oc get crd` โ†’ Forbidden) +- Get cluster ingress config (`oc get ingresses.config.openshift.io` โ†’ Forbidden) + +### ArgoCD SA token โ€” cluster-admin +`install.sh` uses the ArgoCD service account token for the kustomize apply: + +```bash +ARGOCD_TOKEN=$(oc get secret tenantaccess-argocd-account-token \ + -n ambient-code--config \ + -o jsonpath='{.data.token}' | base64 -d) + +kustomize build . | python3 filter.py | oc apply --token="$ARGOCD_TOKEN" -n "$NAMESPACE" -f - +``` + +This token is the `TenantServiceAccount` created for ArgoCD cluster linking (see MPP cluster linking docs). It has cluster-admin and can create ClusterRoleBindings, PVCs, and all namespace-scoped resources. + +### Provisioner RBAC (TenantNamespace management) +The ServiceAccount running `provision.sh` needs: + +```yaml +rules: + - apiGroups: ["tenant.paas.redhat.com"] + resources: ["tenantnamespaces"] + verbs: ["get", "list", "create", "delete", "watch"] +``` + +On `dev-spoke-aws-us-east-1` this is satisfied by a `TenantServiceAccount` with role `tenant-admin`. The existing `tenantserviceaccount-argocd.yaml` already carries `tenant-admin`. + +### CRD presence detection +Because `oc get crd` is Forbidden for user tokens, `install.sh` probes CRD presence via namespace-scoped access: +```bash +oc get agenticsessions -n "$NAMESPACE" # errors if CRD missing +oc get projectsettings -n "$NAMESPACE" +``` + +### Cluster domain derivation +Because `oc get ingresses.config.openshift.io cluster` is Forbidden, the cluster domain is derived from an existing route in the source namespace: +```bash +CLUSTER_DOMAIN=$(oc get route frontend-route -n "$SOURCE_NAMESPACE" \ + -o jsonpath='{.spec.host}' | sed 's/^[^.]*\.//') +``` + +--- + +## Instance Naming Convention + +| Input | Instance ID | Resulting Namespace | Image Tag | +|-------|-------------|---------------------|-----------| +| PR #1005 | `pr-1005` | `ambient-code--pr-1005` | `pr-1005-amd64` | +| PR #42 | `pr-42` | `ambient-code--pr-42` | `pr-42-amd64` | + +Rules: +- Instance ID is **PR number only** โ€” no branch slug (avoids namespace name length issues) +- Lowercase, hyphens only โ€” no underscores, no dots +- `ambient-code--pr-N` is well within the 63-character Kubernetes namespace limit + +Derivation from PR URL: +```bash +PR_URL="https://github.com/ambient-code/platform/pull/1005" +PR_NUMBER=$(echo "$PR_URL" | grep -oE '[0-9]+$') +INSTANCE_ID="pr-${PR_NUMBER}" +NAMESPACE="ambient-code--${INSTANCE_ID}" +IMAGE_TAG="pr-${PR_NUMBER}-amd64" +``` + +--- + +## MPP Restricted Environment โ€” Resource Inventory + +This section documents every resource type that requires special handling, override, or workaround in the MPP restricted environment. Updated from live testing against `dev-spoke-aws-us-east-1`. + +### Cluster-side bugs (resolved) + +| Resource | Issue | Status | +|----------|-------|--------| +| `Route` | Admission webhook `v1.route.openshift.io` panicked with nil pointer dereference on all new creates cluster-wide. | **Fixed by MPP cluster admins** | +| `Service type: LoadBalancer` | AWS ELB provisioning fails: `InvalidSubnet: Not enough IP space available in subnet-0e04e2925720142be. ELB requires at least 8 free IP addresses.` | Still broken โ€” not needed | + +### Route requirements (verified working) + +Routes must have `paas.redhat.com/appcode: AMBC-001` label โ€” injected by the kustomize filter. Do **not** set `shard: internal` โ€” that routes to the internal domain (`apps.int.spoke.dev.us-east-1.aws.paas.redhat.com`) which requires a matching host. Without the shard label, OpenShift auto-assigns hosts on the external domain (`apps.dev-osd-east-1.mxty.p1.openshiftapps.com`). + +### Resources that must be created differently or filtered + +| Resource | Issue | Fix | +|----------|-------|-----| +| `Namespace` | Cannot create directly โ€” MPP requires `TenantNamespace` CR | Filter skips `Namespace` kind; `provision.sh` applies `TenantNamespace` | +| `TenantNamespace` | Must be `type: runtime` โ€” `build` type blocks Route admission webhook | `provision.sh` uses `type: runtime` | +| `ClusterRoleBinding` | Base manifests hardcode `namespace: ambient-code` in subjects | Filter patches all subjects to PR namespace | +| `PersistentVolumeClaim` | MPP storage webhooks require appcode label, reclaimPolicy annotation, and explicit storageClass | Filter injects all three (see PVC requirements below) | +| `Route` | MPP requires `paas.redhat.com/appcode: AMBC-001` label | Filter adds label; OpenShift auto-assigns host | + +### PVC MPP admission requirements (all three required) + +| Requirement | Type | Value | +|-------------|------|-------| +| `paas.redhat.com/appcode: AMBC-001` | **Label** (not annotation) | Required by storage webhook | +| `kubernetes.io/reclaimPolicy: Delete` | Annotation | Required by storage webhook | +| `storageClassName: aws-ebs` | Spec field | Required โ€” default storageClass not accepted | + +### Secrets โ€” what must be copied from `ambient-code--runtime-int` + +Verified against live PR namespace `ambient-code--pr-1005`: + +| Secret | Status | Notes | +|--------|--------|-------| +| `ambient-vertex` | โœ… Copied by install.sh | Vertex AI credentials | +| `ambient-api-server` | โœ… Copied by install.sh | API server config | +| `ambient-api-server-db` | โœ… Copied by install.sh | DB connection for api-server | +| `postgresql-credentials` | โŒ Not copied โ€” pod fails with `secret "postgresql-credentials" not found` | Exists in runtime-int; add to install.sh | +| `frontend-oauth-config` | โŒ Not copied โ€” pod stuck with `MountVolume.SetUp failed: secret "frontend-oauth-config" not found` | Exists in runtime-int; add to install.sh | +| `minio-credentials` | โŒ Not in runtime-int โ€” pod fails with `secret "minio-credentials" not found` | Must be generated or created from known values | + +### Images โ€” CI not pushing PR-tagged images + +`manifest unknown` errors for all Ambient component images: +``` +Failed to pull image "quay.io/ambient_code/vteam_operator:pr-1005-amd64": manifest unknown +``` + +Root cause: `components-build-deploy.yml` PR build step has `push: false`. Images are built but not pushed to quay. **Fix: change `push: false` โ†’ `push: true` in the PR build step.** + +### Open items / pending fixes + +| Item | Priority | Owner | +|------|----------|-------| +| Route webhook panic | Blocker for E2E | MPP cluster admin | +| LoadBalancer subnet exhaustion | Blocker for E2E | MPP cluster admin | +| Add `postgresql-credentials` and `frontend-oauth-config` to `install.sh` copy list | High | Platform team | +| Determine source of `minio-credentials` and add to install.sh | High | Platform team | +| Change `push: false` โ†’ `push: true` in `components-build-deploy.yml` | High | Platform team | + +--- + +## Kustomize Filter Pipeline + +`install.sh` runs: +``` +kustomize build overlays/production | python3 filter.py | oc apply --token=$ARGOCD_TOKEN -n $NAMESPACE -f - +``` + +The Python filter transforms the kustomize output before applying: + +| Kind | Transform | +|------|-----------| +| `Namespace` | Skipped โ€” namespace managed by TenantNamespace CR | +| `ClusterRoleBinding` | Subject namespace patched from `ambient-code` โ†’ PR namespace | +| `PersistentVolumeClaim` | Adds `kubernetes.io/reclaimPolicy: Delete` annotation, `paas.redhat.com/appcode: AMBC-001` label, `storageClassName: aws-ebs` | +| `Route` | Sets explicit `spec.host` with short PR-id-based hostname | + +### PVC MPP Admission Requirements +MPP storage webhooks require all PVCs to have: +- **Annotation:** `kubernetes.io/reclaimPolicy: Delete` +- **Label:** `paas.redhat.com/appcode: AMBC-001` (label, not annotation) +- **StorageClass:** `storageClassName: aws-ebs` + +### ClusterRoleBinding Subject Patching +The base kustomize manifests hardcode `namespace: ambient-code` in ClusterRoleBinding subjects. The filter patches all subjects to the PR namespace: +```python +CRB_NS_RE = re.compile(r'( namespace:\s*)ambient-code(\s*$)', re.MULTILINE) +doc = CRB_NS_RE.sub(r'\g<1>' + namespace + r'\g<2>', doc) +``` + +--- + +## Image Tagging Convention + +PR builds in `components-build-deploy.yml` push images tagged: + +``` +quay.io/ambient_code/vteam_:pr-- +``` + +e.g. `quay.io/ambient_code/vteam_backend:pr-42-amd64` + +No SHA in the tag โ€” `pr--` is overwritten on each new commit to the PR. The cluster always pulls the latest build for that PR. The test cluster is single-arch; no multi-arch manifest needed. + +**Required change to `components-build-deploy.yml`:** In the PR build step (currently line 209), change `push: false` โ†’ `push: true`. + +--- + +## What the Provisioner Does NOT Do + +- It does not install Ambient into the namespace โ€” that is the responsibility of the **Ambient installer** (separate spec) +- It does not create ArgoCD Applications +- It does not manage secrets or egress rules +- It does not interact with GitHub or GitLab + +The provisioner has one job: **namespace exists** or **namespace does not exist**. + +--- + +## Integration Point + +The provisioner is called by the Ambient e2e test harness: + +``` +e2e harness + โ”œโ”€โ”€ calls provisioner.create(instance-id) โ†’ namespace ready + โ”œโ”€โ”€ calls ambient-installer(namespace, image-tag, host) โ†’ Ambient running + โ”œโ”€โ”€ runs test suite against instance URL + โ””โ”€โ”€ calls provisioner.destroy(instance-id) โ†’ namespace gone +``` + +--- + +## File Layout + +``` +components/pr-test/ +โ”œโ”€โ”€ README.md โ† this document (spec) +โ”œโ”€โ”€ build.sh โ† build and push all images for a PR +โ”œโ”€โ”€ provision.sh โ† create/destroy TenantNamespace CR +โ””โ”€โ”€ install.sh โ† install Ambient into a provisioned namespace +``` + +``` +.github/workflows/ +โ”œโ”€โ”€ pr-e2e-openshift.yml โ† build โ†’ provision โ†’ install โ†’ e2e โ†’ teardown +โ””โ”€โ”€ pr-namespace-cleanup.yml โ† PR closed โ†’ destroy (safety net) +``` + +``` +.claude/skills/ +โ”œโ”€โ”€ ambient/SKILL.md โ† how to install Ambient into any OpenShift namespace +โ””โ”€โ”€ ambient-pr-test/SKILL.md โ† how to run the full PR test workflow (references this file) +``` diff --git a/components/pr-test/build.sh b/components/pr-test/build.sh new file mode 100755 index 000000000..3901c66f8 --- /dev/null +++ b/components/pr-test/build.sh @@ -0,0 +1,75 @@ +#!/usr/bin/env bash +set -euo pipefail + +PR_URL="${1:-}" +REGISTRY="${REGISTRY:-quay.io/ambient_code}" +PLATFORM="${PLATFORM:-linux/amd64}" +CONTAINER_ENGINE="${CONTAINER_ENGINE:-docker}" + +usage() { + echo "Usage: $0 " + echo " pr-url: e.g. https://github.com/ambient-code/platform/pull/1005" + echo "" + echo "Optional environment variables:" + echo " REGISTRY Registry prefix (default: quay.io/ambient_code)" + echo " PLATFORM Build platform (default: linux/amd64)" + echo " CONTAINER_ENGINE docker or podman (default: docker)" + exit 1 +} + +[[ -z "$PR_URL" ]] && usage + +PR_NUMBER=$(echo "$PR_URL" | grep -oE '[0-9]+$') +if [[ -z "$PR_NUMBER" ]]; then + echo "ERROR: Could not extract PR number from URL: $PR_URL" + exit 1 +fi + +IMAGE_TAG="pr-${PR_NUMBER}-amd64" + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + +declare -A COMPONENTS=( + [frontend]="context=components/frontend dockerfile=components/frontend/Dockerfile image=vteam_frontend" + [backend]="context=components/backend dockerfile=components/backend/Dockerfile image=vteam_backend" + [operator]="context=components/operator dockerfile=components/operator/Dockerfile image=vteam_operator" + [ambient-runner]="context=components/runners dockerfile=components/runners/ambient-runner/Dockerfile image=vteam_claude_runner" + [state-sync]="context=components/runners/state-sync dockerfile=components/runners/state-sync/Dockerfile image=vteam_state_sync" + [public-api]="context=components/public-api dockerfile=components/public-api/Dockerfile image=vteam_public_api" + [ambient-api-server]="context=components/ambient-api-server dockerfile=components/ambient-api-server/Dockerfile image=vteam_api_server" +) + +COMPONENT_ORDER=(frontend backend operator ambient-runner state-sync public-api ambient-api-server) + +echo "==> Building and pushing PR #${PR_NUMBER} images" +echo " Tag: ${IMAGE_TAG}" +echo " Registry: ${REGISTRY}" +echo " Platform: ${PLATFORM}" +echo "" + +cd "$REPO_ROOT" + +GIT_SHA=$(git rev-parse HEAD) + +for name in "${COMPONENT_ORDER[@]}"; do + eval "declare -A comp=(${COMPONENTS[$name]})" + full_image="${REGISTRY}/${comp[image]}:${IMAGE_TAG}" + + echo "==> Building ${name} โ†’ ${full_image}" + "$CONTAINER_ENGINE" build \ + --platform "$PLATFORM" \ + --build-arg "AMBIENT_VERSION=${GIT_SHA}" \ + -f "${comp[dockerfile]}" \ + -t "$full_image" \ + "${comp[context]}" + + echo "==> Pushing ${full_image}" + "$CONTAINER_ENGINE" push "$full_image" + + echo "" +done + +echo "==> All images pushed for PR #${PR_NUMBER}" +echo " Image tag: ${IMAGE_TAG}" +echo " Registry: ${REGISTRY}" diff --git a/components/pr-test/install.sh b/components/pr-test/install.sh new file mode 100755 index 000000000..7474e819b --- /dev/null +++ b/components/pr-test/install.sh @@ -0,0 +1,207 @@ +#!/usr/bin/env bash +set -euo pipefail + +NAMESPACE="${1:-}" +IMAGE_TAG="${2:-}" + +SOURCE_NAMESPACE="${SOURCE_NAMESPACE:-ambient-code--runtime-int}" +CONFIG_NAMESPACE="${CONFIG_NAMESPACE:-ambient-code--config}" +ARGOCD_TOKEN_SECRET="${ARGOCD_TOKEN_SECRET:-tenantaccess-argocd-account-token}" + +REQUIRED_SOURCE_SECRETS=( + ambient-vertex + ambient-api-server + postgresql-credentials + frontend-oauth-config +) + +usage() { + echo "Usage: $0 " + echo " namespace: e.g. ambient-code--pr-42" + echo " image-tag: e.g. pr-42-amd64" + echo "" + echo "Optional environment variables:" + echo " SOURCE_NAMESPACE Namespace to copy secrets from (default: ambient-code--runtime-int)" + echo " CONFIG_NAMESPACE Namespace containing ArgoCD token (default: ambient-code--config)" + echo " ARGOCD_TOKEN_SECRET Secret name for ArgoCD SA token (default: tenantaccess-argocd-account-token)" + exit 1 +} + +[[ -z "$NAMESPACE" || -z "$IMAGE_TAG" ]] && usage + +PR_ID=$(echo "$NAMESPACE" | grep -oE 'pr-[0-9]+') + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" +MANIFESTS_DIR="$REPO_ROOT/components/manifests" + +copy_secret() { + local name="$1" + echo " Copying secret: $name" + oc get secret "$name" -n "$SOURCE_NAMESPACE" -o json \ + | jq "del(.metadata.namespace, .metadata.resourceVersion, .metadata.uid, .metadata.creationTimestamp, .metadata.ownerReferences, .metadata.annotations[\"kubectl.kubernetes.io/last-applied-configuration\"])" \ + | oc apply -n "$NAMESPACE" -f - +} + +echo "==> Installing Ambient into $NAMESPACE with images tagged $IMAGE_TAG" + +echo "==> Step 1: Verifying cluster-scoped resources exist (CRDs, ClusterRoles)" +FAILED=0 +for crd_resource in agenticsessions projectsettings; do + if oc get "$crd_resource" -n "$NAMESPACE" &>/dev/null 2>&1; then + echo " CRD OK: $crd_resource" + else + echo "ERROR: CRD missing: $crd_resource โ€” run: oc apply -k components/manifests/base/crds/" + FAILED=1 + fi +done +for cr in agentic-operator ambient-frontend-auth ambient-project-admin ambient-project-edit ambient-project-view backend-api; do + if oc get clusterrole "$cr" &>/dev/null 2>&1; then + echo " ClusterRole OK: $cr" + else + echo "ERROR: ClusterRole missing: $cr โ€” run: oc apply -k components/manifests/base/rbac/" + FAILED=1 + fi +done +[[ $FAILED -eq 1 ]] && exit 1 + +echo "==> Step 2: Verifying required secrets exist in $SOURCE_NAMESPACE" +FAILED=0 +for secret in "${REQUIRED_SOURCE_SECRETS[@]}"; do + if oc get secret "$secret" -n "$SOURCE_NAMESPACE" &>/dev/null 2>&1; then + echo " Secret OK: $secret" + else + echo "ERROR: Required secret missing from $SOURCE_NAMESPACE: $secret" + echo " Copy it manually: oc get secret $secret -n -o yaml | oc apply -n $SOURCE_NAMESPACE -f -" + FAILED=1 + fi +done +[[ $FAILED -eq 1 ]] && exit 1 + +echo "==> Step 3: Copying secrets from $SOURCE_NAMESPACE" +for secret in "${REQUIRED_SOURCE_SECRETS[@]}"; do + copy_secret "$secret" +done + +echo "==> Step 4: Fetching ArgoCD SA token from $CONFIG_NAMESPACE" +ARGOCD_TOKEN=$(oc get secret "$ARGOCD_TOKEN_SECRET" -n "$CONFIG_NAMESPACE" \ + -o jsonpath='{.data.token}' | base64 -d) + +echo "==> Step 5: Deploying production overlay with image tag $IMAGE_TAG" +TMPDIR=$(mktemp -d) +cp -r "$MANIFESTS_DIR/." "$TMPDIR/" +trap "rm -rf $TMPDIR" EXIT + +TMPOVERLAY="$TMPDIR/overlays/production" +pushd "$TMPOVERLAY" > /dev/null + +kustomize edit set namespace "$NAMESPACE" +kustomize edit set image \ + "quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_api_server:latest=quay.io/ambient_code/vteam_api_server:${IMAGE_TAG}" \ + "quay.io/ambient_code/vteam_public_api:latest=quay.io/ambient_code/vteam_public_api:${IMAGE_TAG}" + +FILTER_SCRIPT="$TMPDIR/filter.py" +cat > "$FILTER_SCRIPT" << 'PYEOF' +import sys, re, os + +namespace = os.environ['NAMESPACE'] +pr_id = os.environ['PR_ID'] + +SKIP_KINDS = {'Namespace'} + +CRB_NS_RE = re.compile(r'( namespace:\s*)ambient-code(\s*$)', re.MULTILINE) + +for doc in sys.stdin.read().split('\n---\n'): + doc = doc.strip() + if not doc: + continue + kind_m = re.search(r'^kind:\s*(\S+)', doc, re.MULTILINE) + if not kind_m: + continue + kind = kind_m.group(1) + if kind in SKIP_KINDS: + continue + if kind == 'ClusterRoleBinding': + doc = CRB_NS_RE.sub(r'\g<1>' + namespace + r'\g<2>', doc) + if kind == 'Route': + if 'labels:' not in doc: + doc = re.sub(r'(metadata:)', r'\1\n labels:', doc, count=1) + if 'paas.redhat.com/appcode' not in doc: + doc = re.sub(r'( labels:)', r'\1\n paas.redhat.com/appcode: AMBC-001', doc, count=1) + if kind == 'PersistentVolumeClaim': + if 'annotations:' not in doc: + doc = re.sub(r'(metadata:)', r'\1\n annotations:', doc, count=1) + if 'kubernetes.io/reclaimPolicy' not in doc: + doc = re.sub(r'( annotations:)', r'\1\n kubernetes.io/reclaimPolicy: Delete', doc, count=1) + if 'labels:' not in doc: + doc = re.sub(r'(metadata:)', r'\1\n labels:', doc, count=1) + if 'paas.redhat.com/appcode' not in doc: + doc = re.sub(r'( labels:)', r'\1\n paas.redhat.com/appcode: AMBC-001', doc, count=1) + if 'storageClassName' not in doc: + doc = re.sub(r'(spec:)', r'\1\n storageClassName: aws-ebs', doc, count=1) + print('---') + print(doc) +PYEOF + +kustomize build . \ + | NAMESPACE="$NAMESPACE" PR_ID="$PR_ID" \ + python3 "$FILTER_SCRIPT" \ + | oc apply --token="$ARGOCD_TOKEN" -n "$NAMESPACE" -f - + +popd > /dev/null + +echo "==> Step 6: Patching operator ConfigMap with PR image tags" +SOURCE_OPERATOR_CONFIG=$(oc get configmap operator-config -n "$SOURCE_NAMESPACE" -o json \ + | jq -r '.data | to_entries | map(select(.key | test("VERTEX|CLOUD_ML|ANTHROPIC|GOOGLE"))) | from_entries' \ + 2>/dev/null || echo '{}') + +VERTEX_PATCH=$(echo "$SOURCE_OPERATOR_CONFIG" | jq -c \ + --arg runner "quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}" \ + --arg sync "quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}" \ + '. + {"AMBIENT_CODE_RUNNER_IMAGE": $runner, "STATE_SYNC_IMAGE": $sync}') + +oc patch configmap operator-config -n "$NAMESPACE" --type=merge \ + -p "{\"data\": $VERTEX_PATCH}" + +echo "==> Step 7: Patching agent registry ConfigMap with PR image tags" +REGISTRY=$(oc get configmap ambient-agent-registry -n "$NAMESPACE" \ + -o jsonpath='{.data.agent-registry\.json}' 2>/dev/null || echo "{}") + +REGISTRY=$(echo "$REGISTRY" | sed \ + "s|quay.io/ambient_code/vteam_claude_runner[@:][^\"]*|quay.io/ambient_code/vteam_claude_runner:${IMAGE_TAG}|g") +REGISTRY=$(echo "$REGISTRY" | sed \ + "s|quay.io/ambient_code/vteam_state_sync[@:][^\"]*|quay.io/ambient_code/vteam_state_sync:${IMAGE_TAG}|g") + +oc patch configmap ambient-agent-registry -n "$NAMESPACE" --type=merge \ + -p "{\"data\":{\"agent-registry.json\":$(echo "$REGISTRY" | jq -Rs .)}}" + +echo "==> Step 8: Waiting for rollouts" +for deploy in backend-api frontend agentic-operator postgresql minio unleash public-api; do + echo " Waiting for $deploy..." + oc rollout status deployment/$deploy -n "$NAMESPACE" --timeout=300s +done + +echo " Waiting for ambient-api-server-db..." +oc rollout status deployment/ambient-api-server-db -n "$NAMESPACE" --timeout=300s + +echo " Waiting for ambient-api-server..." +oc rollout status deployment/ambient-api-server -n "$NAMESPACE" --timeout=300s + +echo "==> Step 9: Verifying health" +FRONTEND_URL=$(oc get route frontend-route -n "$NAMESPACE" \ + -o jsonpath='https://{.spec.host}' 2>/dev/null || true) + +echo "" +echo "==> Ambient installed successfully in $NAMESPACE" +echo " Frontend: ${FRONTEND_URL:-}" +echo " Image tag: $IMAGE_TAG" + +if [[ -n "${GITHUB_OUTPUT:-}" ]]; then + echo "frontend_url=$FRONTEND_URL" >> "$GITHUB_OUTPUT" + echo "namespace=$NAMESPACE" >> "$GITHUB_OUTPUT" +fi diff --git a/components/pr-test/provision.sh b/components/pr-test/provision.sh new file mode 100755 index 000000000..bc156a8a4 --- /dev/null +++ b/components/pr-test/provision.sh @@ -0,0 +1,106 @@ +#!/usr/bin/env bash +set -euo pipefail + +COMMAND="${1:-}" +INSTANCE_ID="${2:-}" + +CONFIG_NAMESPACE="ambient-code--config" +ARGOCD_NAMESPACE="${ARGOCD_NAMESPACE:-ambient-code--argocd}" +MAX_S0X_INSTANCES="${MAX_S0X_INSTANCES:-5}" +READY_TIMEOUT="${READY_TIMEOUT:-60}" +DELETE_TIMEOUT="${DELETE_TIMEOUT:-120}" + +usage() { + echo "Usage: $0 " + echo " instance-id: e.g. pr-123-feat-xyz" + echo "" + echo "Environment variables:" + echo " MAX_S0X_INSTANCES Maximum concurrent S0.x instances (default: 5)" + echo " READY_TIMEOUT Seconds to wait for namespace Active (default: 60)" + echo " DELETE_TIMEOUT Seconds to wait for namespace deletion (default: 120)" + exit 1 +} + +[[ -z "$COMMAND" || -z "$INSTANCE_ID" ]] && usage +[[ "$COMMAND" != "create" && "$COMMAND" != "destroy" ]] && usage + +NAMESPACE="ambient-code--${INSTANCE_ID}" + +create() { + echo "==> Checking S0.x instance capacity..." + ACTIVE=$(oc get tenantnamespace -n "$CONFIG_NAMESPACE" \ + -l ambient-code/instance-type=s0x --no-headers 2>/dev/null | wc -l | tr -d ' ') + + if [ "$ACTIVE" -ge "$MAX_S0X_INSTANCES" ]; then + echo "ERROR: At capacity โ€” $ACTIVE/$MAX_S0X_INSTANCES S0.x instances active." + echo "Active instances:" + oc get tenantnamespace -n "$CONFIG_NAMESPACE" \ + -l ambient-code/instance-type=s0x -o name + exit 1 + fi + echo " Capacity OK: $ACTIVE/$MAX_S0X_INSTANCES" + + echo "==> Applying TenantNamespace CR: $INSTANCE_ID" + cat < Waiting for namespace ${NAMESPACE} to become Active (timeout: ${READY_TIMEOUT}s)..." + DEADLINE=$((SECONDS + READY_TIMEOUT)) + while [ $SECONDS -lt $DEADLINE ]; do + STATUS=$(oc get namespace "$NAMESPACE" -o jsonpath='{.status.phase}' 2>/dev/null || true) + if [ "$STATUS" == "Active" ]; then + echo " Namespace ${NAMESPACE} is Active." + echo "$NAMESPACE" + exit 0 + fi + echo " status=${STATUS:-NotFound}, retrying..." + sleep 3 + done + + echo "ERROR: Namespace ${NAMESPACE} did not become Active within ${READY_TIMEOUT}s." + oc describe tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" || true + exit 1 +} + +destroy() { + APP_NAME="pr-test-${INSTANCE_ID}" + echo "==> Deleting ArgoCD Application: $APP_NAME" + oc delete application "$APP_NAME" -n "$ARGOCD_NAMESPACE" \ + --ignore-not-found=true 2>/dev/null || true + + echo "==> Deleting TenantNamespace CR: $INSTANCE_ID" + oc delete tenantnamespace "$INSTANCE_ID" -n "$CONFIG_NAMESPACE" \ + --ignore-not-found=true + + echo "==> Waiting for namespace ${NAMESPACE} to be deleted (timeout: ${DELETE_TIMEOUT}s)..." + DEADLINE=$((SECONDS + DELETE_TIMEOUT)) + while [ $SECONDS -lt $DEADLINE ]; do + if ! oc get namespace "$NAMESPACE" &>/dev/null; then + echo " Namespace ${NAMESPACE} deleted." + exit 0 + fi + echo " Namespace still exists, waiting..." + sleep 5 + done + + echo "WARNING: Namespace ${NAMESPACE} still exists after ${DELETE_TIMEOUT}s. May need manual cleanup." + exit 1 +} + +case "$COMMAND" in + create) create ;; + destroy) destroy ;; +esac diff --git a/docs/internal/developer/local-development/openshift.md b/docs/internal/developer/local-development/openshift.md index e57b806e0..ff3018b9a 100644 --- a/docs/internal/developer/local-development/openshift.md +++ b/docs/internal/developer/local-development/openshift.md @@ -1,382 +1,112 @@ # OpenShift Cluster Development -This guide covers deploying the Ambient Code Platform on OpenShift clusters for development and testing. Use this when you need to test OpenShift-specific features like Routes, OAuth integration, or service mesh capabilities. +This guide covers deploying Ambient Code on an OpenShift cluster using the **OpenShift internal image registry**. This is useful when iterating on local builds against a dev cluster without pushing to quay.io. -## Prerequisites +> **Standard deployment (quay.io images):** See the [Ambient installer skill](../../../../.claude/skills/ambient/SKILL.md) โ€” it covers secrets, kustomize deploy, rollout verification, and troubleshooting for any OpenShift namespace. -- `oc` CLI installed -- `podman` or `docker` installed -- Access to an OpenShift cluster +> **PR test instances:** See the [ambient-pr-test skill](../../../../.claude/skills/ambient-pr-test/SKILL.md). -## OpenShift Cluster Setup +--- -### Option 1: OpenShift Local (CRC) -For local development, see [crc.md](crc.md) for detailed CRC setup instructions. +## When to Use This Guide -### Option 2: Cloud OpenShift Cluster -For cloud clusters (ROSA, OCP on AWS/Azure/GCP), ensure you have cluster-admin access. +Use the internal registry approach when: +- You are iterating on local builds and do not want to push to quay.io on every change +- You are on a dev cluster with direct podman/docker access +- You need to test image changes that are not yet ready for a PR -### Option 3: Temporary Test Cluster -For temporary testing clusters, you can use cluster provisioning tools available in your organization. +For all other cases (PRs, production, ephemeral test instances), images are in quay.io and you should use the ambient skill directly. -## Registry Configuration +--- -### Enable OpenShift Internal Registry +## Prerequisites -Expose the internal image registry: +- `oc` CLI installed and logged in +- `podman` or `docker` installed locally +- Access to an OpenShift cluster (CRC, ROSA, OCP on cloud) -```bash -oc patch configs.imageregistry.operator.openshift.io/cluster --type merge --patch '{"spec":{"defaultRoute":true}}' -``` +--- -Get the registry hostname: +## Enable the OpenShift Internal Registry ```bash -oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}' -``` +oc patch configs.imageregistry.operator.openshift.io/cluster \ + --type merge --patch '{"spec":{"defaultRoute":true}}' -### Login to Registry +REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \ + --template='{{ .spec.host }}') -Authenticate podman to the OpenShift registry: - -```bash -REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') -oc whoami -t | podman login --tls-verify=false -u kubeadmin --password-stdin "$REGISTRY_HOST" +oc whoami -t | podman login --tls-verify=false -u kubeadmin \ + --password-stdin "$REGISTRY_HOST" ``` -## Required Secrets Setup +--- -**IMPORTANT**: Create all required secrets **before** deploying. The deployment will fail if these secrets are missing. +## Build and Push to Internal Registry -Create the project namespace: ```bash -oc new-project ambient-code -``` - -**MinIO credentials:** - -```bash -oc create secret generic minio-credentials -n ambient-code \ - --from-literal=root-user=admin \ - --from-literal=root-password=changeme123 -``` - -**PostgreSQL credentials (for Unleash feature flag database):** - -```bash -oc create secret generic postgresql-credentials -n ambient-code \ - --from-literal=db.host="postgresql" \ - --from-literal=db.port="5432" \ - --from-literal=db.name="postgres" \ - --from-literal=db.user="postgres" \ - --from-literal=db.password="postgres123" -``` +REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \ + --template='{{ .spec.host }}') +INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code" -**Unleash credentials (for feature flag service):** +for img in vteam_frontend vteam_backend vteam_operator vteam_public_api vteam_claude_runner vteam_api_server; do + podman tag localhost/${img}:latest ${REGISTRY_HOST}/ambient-code/${img}:latest + podman push ${REGISTRY_HOST}/ambient-code/${img}:latest +done -```bash -oc create secret generic unleash-credentials -n ambient-code \ - --from-literal=database-url="postgres://postgres:postgres123@postgresql:5432/unleash" \ - --from-literal=database-ssl="false" \ - --from-literal=admin-api-token="*:*.unleash-admin-token" \ - --from-literal=client-api-token="default:development.unleash-client-token" \ - --from-literal=frontend-api-token="default:development.unleash-frontend-token" \ - --from-literal=default-admin-password="unleash123" +oc rollout restart deployment backend-api frontend agentic-operator public-api ambient-api-server -n ambient-code ``` -## Platform Deployment +--- -The production kustomization in `components/manifests/overlays/production/kustomization.yaml` references `quay.io/ambient_code/*` images by default. When deploying to an OpenShift cluster using the internal registry, you must temporarily point the image refs at the internal registry, deploy, then **immediately revert** before committing. +## Deploy with Internal Registry Images -**โš ๏ธ CRITICAL**: Never commit `kustomization.yaml` while it contains internal registry refs. - -**Patch kustomization to internal registry, deploy, then revert:** +**โš ๏ธ CRITICAL**: Never commit `kustomization.yaml` with internal registry refs. ```bash -REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') -INTERNAL_REG="image-registry.openshift-image-registry.svc:5000/ambient-code" +REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry \ + --template='{{ .spec.host }}') -# Temporarily override image refs to internal registry cd components/manifests/overlays/production -sed -i "s#newName: quay.io/ambient_code/#newName: ${INTERNAL_REG}/#g" kustomization.yaml +sed -i "s#newName: quay.io/ambient_code/#newName: ${REGISTRY_HOST}/ambient-code/#g" kustomization.yaml -# Deploy cd ../.. ./deploy.sh -# IMMEDIATELY revert โ€” do not commit with internal registry refs cd overlays/production git checkout kustomization.yaml ``` -## Common Deployment Issues and Fixes - -### Issue 1: Images not found (ImagePullBackOff) - -```bash -# Build and push required images to internal registry -REGISTRY_HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}') - -# Tag and push key images (adjust based on what's available locally) -podman tag localhost/ambient_control_plane:latest ${REGISTRY_HOST}/ambient-code/ambient_control_plane:latest -podman tag localhost/vteam_frontend:latest ${REGISTRY_HOST}/ambient-code/vteam_frontend:latest -podman tag localhost/vteam_api_server:latest ${REGISTRY_HOST}/ambient-code/vteam_api_server:latest -podman tag localhost/vteam_backend:latest ${REGISTRY_HOST}/ambient-code/vteam_backend:latest -podman tag localhost/vteam_operator:latest ${REGISTRY_HOST}/ambient-code/vteam_operator:latest -podman tag localhost/vteam_public_api:latest ${REGISTRY_HOST}/ambient-code/vteam_public_api:latest -podman tag localhost/vteam_claude_runner:latest ${REGISTRY_HOST}/ambient-code/vteam_claude_runner:latest - -# Push images -for img in ambient_control_plane vteam_frontend vteam_api_server vteam_backend vteam_operator vteam_public_api vteam_claude_runner; do - podman push ${REGISTRY_HOST}/ambient-code/${img}:latest -done - -# Restart deployments to pick up new images -oc rollout restart deployment ambient-control-plane backend-api frontend public-api agentic-operator -n ambient-code -``` - -### Issue 2: API server TLS certificate missing - -```bash -# Add service annotation to generate TLS certificate -oc annotate service ambient-api-server service.beta.openshift.io/serving-cert-secret-name=ambient-api-server-tls -n ambient-code - -# Wait for certificate generation -sleep 10 - -# Restart API server to mount certificate -oc rollout restart deployment ambient-api-server -n ambient-code -``` - -### Issue 3: API server HTTPS configuration - -The ambient-api-server includes TLS support for production deployments. For development clusters, you may need to adjust the configuration: - -```bash -# Check if HTTPS is properly configured in the deployment -oc get deployment ambient-api-server -n ambient-code -o yaml | grep -A5 -B5 enable-https - -# Verify TLS certificate is mounted -oc describe deployment ambient-api-server -n ambient-code | grep -A10 -B5 tls -``` - -**Note:** The gRPC TLS for control plane communication provides end-to-end encryption for session monitoring. - -## Cross-Namespace Image Access - -The operator creates runner pods in dynamically-created project namespaces (e.g. `hyperfleet-test`). Those pods need to pull images from the `ambient-code` namespace. Grant all service accounts pull access: - -```bash -oc policy add-role-to-group system:image-puller system:serviceaccounts --namespace=ambient-code -``` - -Without this, runner pods will fail with `ErrImagePull` / `authentication required`. - -## Deployment Verification - -### Check Pod Status - -```bash -oc get pods -n ambient-code -``` - -**Expected output:** All pods should show `1/1 Running` or `2/2 Running` (frontend has oauth-proxy): -``` -NAME READY STATUS RESTARTS AGE -agentic-operator-xxxxx-xxxxx 1/1 Running 0 5m -ambient-api-server-xxxxx-xxxxx 1/1 Running 0 5m -ambient-api-server-db-xxxxx-xxxxx 1/1 Running 0 5m -ambient-control-plane-xxxxx-xxxxx 1/1 Running 0 5m -backend-api-xxxxx-xxxxx 1/1 Running 0 5m -frontend-xxxxx-xxxxx 2/2 Running 0 5m -minio-xxxxx-xxxxx 1/1 Running 0 5m -postgresql-xxxxx-xxxxx 1/1 Running 0 5m -public-api-xxxxx-xxxxx 1/1 Running 0 5m -unleash-xxxxx-xxxxx 1/1 Running 0 5m -``` - -### Test Database Connection - -```bash -oc exec deployment/ambient-api-server-db -n ambient-code -- psql -U ambient -d ambient_api_server -c "\dt" -``` - -**Expected:** Should show 6 database tables (events, migrations, project_settings, projects, sessions, users). - -### Verify Control Plane TLS Functionality - -```bash -# Check control plane is connecting via TLS gRPC -oc logs deployment/ambient-control-plane -n ambient-code --tail=10 | grep -i grpc - -# Verify API server gRPC streams are active -oc logs deployment/ambient-api-server -n ambient-code --tail=20 | grep "gRPC stream started" -``` - -**Expected:** You should see successful gRPC stream connections like: -``` -gRPC stream started /ambient.v1.ProjectService/WatchProjects -gRPC stream started /ambient.v1.SessionService/WatchSessions -``` - -## Platform Access - -### Get Platform URLs - -```bash -oc get route -n ambient-code -``` - -**Main routes:** -- **Frontend**: https://ambient-code.apps./ -- **Backend API**: https://backend-route-ambient-code.apps./ -- **Public API**: https://public-api-route-ambient-code.apps./ -- **Ambient API Server**: https://ambient-api-server-ambient-code.apps./ - -### Health Check - -```bash -curl -k https://backend-route-ambient-code.apps./health -# Expected: {"status":"healthy"} -``` - -## SDK Testing - -### Setup Environment Variables - -Set the SDK environment variables based on your current `oc` client configuration: - -```bash -# Auto-configure from current oc context -export AMBIENT_TOKEN="$(oc whoami -t)" # Use current user token -export AMBIENT_PROJECT="$(oc project -q)" # Use current project/namespace -export AMBIENT_API_URL="$(oc get route public-api-route --template='https://{{.spec.host}}')" # Get public API route -``` +--- -**Verify configuration:** -```bash -echo "Token: ${AMBIENT_TOKEN:0:12}... (${#AMBIENT_TOKEN} chars)" -echo "Project: $AMBIENT_PROJECT" -echo "API URL: $AMBIENT_API_URL" -``` +## JWT Configuration for Dev Clusters -### Test Go SDK +The production overlay configures JWT against Red Hat SSO (`sso.redhat.com`). On a personal dev cluster without SSO, disable JWT: ```bash -cd components/ambient-sdk/go-sdk -go run main.go +oc set env deployment/ambient-api-server -n ambient-code \ + --containers=api-server \ + ENABLE_JWT=false +oc rollout restart deployment/ambient-api-server -n ambient-code ``` -### Test Python SDK +Or patch the `ambient-api-server-jwt-args-patch.yaml` to set `--enable-jwt=false` before deploying. -```bash -cd components/ambient-sdk/python-sdk -./test.sh -``` +--- -Both SDKs should output successful session creation and listing. +## Cross-Namespace Image Pull -## CLI Testing - -Login to the ambient-control-plane using the CLI: +Runner pods are created in dynamic project namespaces and must pull from the `ambient-code` namespace in the internal registry: ```bash -acpctl login --url https://ambient-api-server-ambient-code.apps. --token $(oc whoami -t) +oc policy add-role-to-group system:image-puller system:serviceaccounts \ + --namespace=ambient-code ``` -## Authentication Configuration - -### API Token Setup +Without this, runner pods fail with `ErrImagePull` / `authentication required`. -The control plane authenticates to the API server using a bearer token. By default `deploy.sh` uses `oc whoami -t` (your current cluster token). To use a dedicated long-lived token instead, set it before deploying: - -```bash -export AMBIENT_API_TOKEN= -``` - -If `AMBIENT_API_TOKEN` is not set, the deploy script automatically creates the secret using your current `oc` session token. - -### Vertex AI Integration (Optional) - -The `deploy.sh` script reads `ANTHROPIC_VERTEX_PROJECT_ID` from your environment and sets `CLAUDE_CODE_USE_VERTEX=1` in the operator configmap. The operator then **requires** the `ambient-vertex` secret to exist in `ambient-code`. - -**Create this secret before running `make deploy` if using Vertex AI:** - -First, ensure you have Application Default Credentials: - -```bash -gcloud auth application-default login -``` - -Then create the secret: - -```bash -oc create secret generic ambient-vertex -n ambient-code \ - --from-file=ambient-code-key.json="$HOME/.config/gcloud/application_default_credentials.json" -``` - -Alternatively, if you have a service account key file: - -```bash -oc create secret generic ambient-vertex -n ambient-code \ - --from-file=ambient-code-key.json="/path/to/your-service-account-key.json" -``` - -**Note:** If you do NOT want to use Vertex AI and prefer direct Anthropic API, unset the env var before deploying: - -```bash -unset ANTHROPIC_VERTEX_PROJECT_ID -``` - -## OAuth Configuration - -OAuth configuration requires cluster-admin permissions for creating the OAuthClient resource. If you don't have cluster-admin, the deployment will warn you but other components will still deploy. - -## What the Deployment Provides - -- โœ… **Applies all CRDs** (Custom Resource Definitions) -- โœ… **Creates RBAC** roles and service accounts -- โœ… **Deploys all components** with correct OpenShift-compatible security contexts -- โœ… **Configures OAuth** integration automatically (with cluster-admin) -- โœ… **Creates all routes** for external access -- โœ… **Database migrations** run automatically with proper permissions - -## Troubleshooting - -### Missing public-api-route - -```bash -# Check if public-api is deployed -oc get route public-api-route -n $AMBIENT_PROJECT - -# If missing, deploy public-api component: -cd components/manifests -./deploy.sh -``` - -### Authentication errors - -```bash -# Verify token is valid -oc whoami - -# Check project access -oc get pods -n $AMBIENT_PROJECT -``` - -### API connection errors - -```bash -# Test API directly -curl -H "Authorization: Bearer $(oc whoami -t)" \ - -H "X-Ambient-Project: $(oc project -q)" \ - "$AMBIENT_API_URL/health" -``` +--- ## Next Steps -1. Access the frontend URL (from `oc get route -n ambient-code`) -2. Configure ANTHROPIC_API_KEY in project settings -3. Test SDKs using the commands above -4. Create your first AgenticSession via UI or SDK -5. Monitor with: `oc get pods -n ambient-code -w` +Once deployed, follow the verification and access steps in the [ambient skill](../../../../.claude/skills/ambient/SKILL.md#step-6-verify-installation).