generative-computing · lastras · May 21, 2026 · May 21, 2026 · May 21, 2026 · May 21, 2026
@@ -4,35 +4,7 @@
    "cell_type": "markdown",
    "id": "a0000001",
    "metadata": {},
-   "source": [
-    "# ALORA vs LoRA Race\n",
-    "\n",
-    "**Duration:** ~20-40 min (composes a LoRA checkpoint, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n",
-    "\n",
-    "**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n",
-    "(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\n",
-    "one GPU; `race_live.html` replays them as if they raced.\n",
-    "\n",
-    "This notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n",
-    "\n",
-    "*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n",
-    "\n",
-    "**What you'll learn:**\n",
-    "- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n",
-    "- How `--technology-filter lora` lets you compose a like-for-like LoRA-only counterpart to a published ALORA checkpoint\n",
-    "- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n",
-    "- Where ALORA's prefill savings show up in the per-step latency breakdown\n",
-    "\n",
-    "**Adapters used:** the embedded ALORA checkpoint [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) and a LoRA-only build of the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — composed in section 3.\n",
-    "\n",
-    "## Prerequisites\n",
-    "\n",
-    "1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n",
-    "2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n",
-    "3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n",
-    "\n",
-    "New to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that section 3 calls. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
-   ]
+   "source": "# ALORA vs LoRA Race\n\n**Duration:** ~20-40 min (composes both checkpoints, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n\n**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\none GPU; `race_live.html` replays them as if they raced.\n\nThis notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n\n*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n\n**What you'll learn:**\n- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n- How composing without `--technology-filter` prefers ALORA adapters but falls back to LoRA, while `--technology-filter lora` forces a LoRA-only build\n- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n- Where ALORA's prefill savings show up in the per-step latency breakdown\n\n**Adapters used:** both checkpoints are composed from the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — on top of [granite-4.1-3b](https://huggingface.co/ibm-granite/granite-4.1-3b). The ALORA build (section 3) uses the default technology preference; the LoRA-only build (section 4) adds `--technology-filter lora`.\n\n## Prerequisites\n\n1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n\nNew to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that sections 3 and 4 call. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
   },
   {
    "cell_type": "markdown",
@@ -191,11 +163,15 @@
    "cell_type": "markdown",
    "id": "a0000011",
    "metadata": {},
-   "source": [
-    "## 3 · ALORA server\n",
-    "\n",
-    "Start `granite-switch-4.1-3b-preview` on port 8111 and run the benchmark against it."
-   ]
+   "source": "## 3 · Compose the ALORA model and run the server\n\nCompose a checkpoint from the three IBM granitelib libraries without a technology filter —\nthe composer prefers ALORA adapters and falls back to LoRA where ALORA is unavailable.\nThen start the composed model on port 8111 and run the benchmark against it."
+  },
+  {
+   "cell_type": "code",
+   "id": "0960d016",
+   "source": "ALORA_MODEL_DIR = \"/content/granite-switch-alora-prefer\"\n\nimport os\nif os.path.exists(os.path.join(ALORA_MODEL_DIR, \"adapter_index.json\")):\n    print(f\"ALORA model already composed at {ALORA_MODEL_DIR} — skipping\")\nelse:\n    !python -m granite_switch.composer.compose_granite_switch \\\n      --base-model ibm-granite/granite-4.1-3b \\\n      --adapters ibm-granite/granitelib-rag-r1.0 \\\n                 ibm-granite/granitelib-core-r1.0 \\\n                 ibm-granite/granitelib-guardian-r1.0 \\\n      --output {ALORA_MODEL_DIR}",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
   },
   {
    "cell_type": "code",
@@ -234,28 +210,15 @@
    "id": "a0000012",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "alora_proc = launch_vllm(\n",
-    "    model    = \"ibm-granite/granite-switch-4.1-3b-preview\",\n",
-    "    port     = 8111,\n",
-    "    log_file = \"/content/vllm_alora.log\",\n",
-    ")\n",
-    "if not wait_for_server(8111):\n",
-    "    tail_log(\"/content/vllm_alora.log\")"
-   ]
+   "source": "alora_proc = launch_vllm(\n    model    = ALORA_MODEL_DIR,\n    port     = 8111,\n    log_file = \"/content/vllm_alora.log\",\n)\nif not wait_for_server(8111):\n    tail_log(\"/content/vllm_alora.log\")"
   },
   {
    "cell_type": "code",
    "execution_count": null,
    "id": "a0000013",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Benchmark the ALORA server.\n",
-    "# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n",
-    "# The animated replay comes from race_live.html at the end.\n",
-    "!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --no-live -n 16 -c 8 -k 10"
-   ]
+   "source": "# Benchmark the ALORA server.\n# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n# The animated replay comes from race_live.html at the end.\n!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --alora-model {ALORA_MODEL_DIR} --no-live -n 16 -c 8 -k 10"
   },
   {
    "cell_type": "code",
@@ -273,16 +236,7 @@
    "cell_type": "markdown",
    "id": "bm4l4b6xr5c",
    "metadata": {},
-   "source": [
-    "## 4 · Compose the LoRA-only model\n",
-    "\n",
-    "The ALORA model above is a pre-built checkpoint from the Hub. For a fair comparison we now\n",
-    "compose a **LoRA-only** version from the same adapter libraries, using `--technology-filter lora`\n",
-    "to force every adapter to its standard LoRA variant.\n",
-    "\n",
-    "This downloads the base model and adapter libraries (~6 GB on first run, cached after that)\n",
-    "and writes the composed checkpoint to `/content/granite-switch-lora-only`."
-   ]
+   "source": "## 4 · Compose the LoRA-only model\n\nFor a fair comparison we now compose a **LoRA-only** version from the same adapter\nlibraries, using `--technology-filter lora` to force every adapter to its standard\nLoRA variant.\n\nThis downloads the adapter libraries (~6 GB on first run, cached after that) and writes\nthe composed checkpoint to `/content/granite-switch-lora-only`."
   },
   {
    "cell_type": "code",
@@ -454,4 +408,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
@@ -60,12 +60,6 @@
 logging.getLogger("fancy_logger").setLevel(logging.ERROR)
 
 
-from IPython.utils import coloransi
-if not hasattr(coloransi.TermColors, 'Green'):
-    coloransi.TermColors.Green = '\033[0;32m'
-    coloransi.TermColors.Blue = '\033[0;34m'
-    coloransi.TermColors.Normal = '\033[0m'
-
 def _detect_notebook():
     """Return True when running inside Jupyter / Colab."""
     try:
@@ -262,7 +256,7 @@ def run_timed_pipeline(query, ctx, backend, conv_json_idx=None, turn_idx=None, s
     ctx_with_query = ctx.add(MelleaMessage("user", query))
     t0  = time.perf_counter()
     harm_score = _call_or_dump("guardian_harm", ctx_with_query, conv_json_idx, turn_idx,
-                               guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, target_role="user")
+                               guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, scoring_schema="user_prompt")
     timings["guardian_harm"] = time.perf_counter() - t0
     if harm_score >= 0.5:
         work["exit"] = "harm_blocked"
@@ -887,7 +881,7 @@ def write_telemetry(server_results, adapter_tech, all_conv_results, labels, race
             "mode":        mode,
             "runs":        RUNS,
             "concurrency": CONCURRENCY_PER_SERVER,
-            "timestamp":   datetime.datetime.utcnow().isoformat() + "Z",
+            "timestamp":   datetime.datetime.now(datetime.UTC).isoformat(),
             "race_wall":   race_wall,
         },
         "servers": servers_block,