From fefd2b314de9ba40c70f17f8d3f3b108375d71f6 Mon Sep 17 00:00:00 2001
From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras"
 <lastrasl@p2-r12-n1.bluevela.rmf.ibm.com>
Date: Thu, 21 May 2026 17:38:26 +0000
Subject: [PATCH 1/5] tutorials: compose ALORA model from source in race
 notebook
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace the pre-built Hub checkpoint (granite-switch-4.1-3b-preview)
with a freshly composed model, so both race legs are built from the
same three granitelib adapters using the same compose pipeline. The
ALORA leg uses no --technology-filter so the composer prefers ALORA
adapters and falls back to LoRA where unavailable — symmetric with
the existing LoRA leg which uses --technology-filter lora.
---
 tutorials/notebooks/alora_vs_lora_race.ipynb | 56 +++++---------------
 1 file changed, 12 insertions(+), 44 deletions(-)

diff --git a/tutorials/notebooks/alora_vs_lora_race.ipynb b/tutorials/notebooks/alora_vs_lora_race.ipynb
index 331a00e..56b6dab 100644
--- a/tutorials/notebooks/alora_vs_lora_race.ipynb
+++ b/tutorials/notebooks/alora_vs_lora_race.ipynb
@@ -4,35 +4,7 @@
    "cell_type": "markdown",
    "id": "a0000001",
    "metadata": {},
-   "source": [
-    "# ALORA vs LoRA Race\n",
-    "\n",
-    "**Duration:** ~20-40 min (composes a LoRA checkpoint, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n",
-    "\n",
-    "**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n",
-    "(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\n",
-    "one GPU; `race_live.html` replays them as if they raced.\n",
-    "\n",
-    "This notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n",
-    "\n",
-    "*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n",
-    "\n",
-    "**What you'll learn:**\n",
-    "- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n",
-    "- How `--technology-filter lora` lets you compose a like-for-like LoRA-only counterpart to a published ALORA checkpoint\n",
-    "- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n",
-    "- Where ALORA's prefill savings show up in the per-step latency breakdown\n",
-    "\n",
-    "**Adapters used:** the embedded ALORA checkpoint [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) and a LoRA-only build of the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — composed in section 3.\n",
-    "\n",
-    "## Prerequisites\n",
-    "\n",
-    "1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n",
-    "2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n",
-    "3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n",
-    "\n",
-    "New to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that section 3 calls. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
-   ]
+   "source": "# ALORA vs LoRA Race\n\n**Duration:** ~20-40 min (composes both checkpoints, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n\n**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\none GPU; `race_live.html` replays them as if they raced.\n\nThis notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n\n*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n\n**What you'll learn:**\n- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n- How composing without `--technology-filter` prefers ALORA adapters but falls back to LoRA, while `--technology-filter lora` forces a LoRA-only build\n- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n- Where ALORA's prefill savings show up in the per-step latency breakdown\n\n**Adapters used:** both checkpoints are composed from the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — on top of [granite-4.1-3b](https://huggingface.co/ibm-granite/granite-4.1-3b). The ALORA build (section 3) uses the default technology preference; the LoRA-only build (section 4) adds `--technology-filter lora`.\n\n## Prerequisites\n\n1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n\nNew to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that sections 3 and 4 call. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
   },
   {
    "cell_type": "markdown",
@@ -191,11 +163,15 @@
    "cell_type": "markdown",
    "id": "a0000011",
    "metadata": {},
-   "source": [
-    "## 3 · ALORA server\n",
-    "\n",
-    "Start `granite-switch-4.1-3b-preview` on port 8111 and run the benchmark against it."
-   ]
+   "source": "## 3 · Compose the ALORA model and run the server\n\nCompose a checkpoint from the three IBM granitelib libraries without a technology filter —\nthe composer prefers ALORA adapters and falls back to LoRA where ALORA is unavailable.\nThen start the composed model on port 8111 and run the benchmark against it."
+  },
+  {
+   "cell_type": "code",
+   "id": "0960d016",
+   "source": "ALORA_MODEL_DIR = \"/content/granite-switch-alora-prefer\"\n\nimport os\nif os.path.exists(os.path.join(ALORA_MODEL_DIR, \"adapter_index.json\")):\n    print(f\"ALORA model already composed at {ALORA_MODEL_DIR} — skipping\")\nelse:\n    !python -m granite_switch.composer.compose_granite_switch \\\n      --base-model ibm-granite/granite-4.1-3b \\\n      --adapters ibm-granite/granitelib-rag-r1.0 \\\n                 ibm-granite/granitelib-core-r1.0 \\\n                 ibm-granite/granitelib-guardian-r1.0 \\\n      --output {ALORA_MODEL_DIR}",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": []
   },
   {
    "cell_type": "code",
@@ -234,15 +210,7 @@
    "id": "a0000012",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "alora_proc = launch_vllm(\n",
-    "    model    = \"ibm-granite/granite-switch-4.1-3b-preview\",\n",
-    "    port     = 8111,\n",
-    "    log_file = \"/content/vllm_alora.log\",\n",
-    ")\n",
-    "if not wait_for_server(8111):\n",
-    "    tail_log(\"/content/vllm_alora.log\")"
-   ]
+   "source": "alora_proc = launch_vllm(\n    model    = ALORA_MODEL_DIR,\n    port     = 8111,\n    log_file = \"/content/vllm_alora.log\",\n)\nif not wait_for_server(8111):\n    tail_log(\"/content/vllm_alora.log\")"
   },
   {
    "cell_type": "code",
@@ -454,4 +422,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}
\ No newline at end of file

From 809620513a17652904f65f932b045e0281165d62 Mon Sep 17 00:00:00 2001
From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras"
 <lastrasl@p2-r12-n1.bluevela.rmf.ibm.com>
Date: Thu, 21 May 2026 18:07:56 +0000
Subject: [PATCH 2/5] tutorials: pass --alora-model to bench_pipeline_race in
 ALORA leg

Also fix stale section 4 markdown that still referenced the Hub
checkpoint after the previous commit switched to composing from source.
---
 tutorials/notebooks/alora_vs_lora_race.ipynb | 18 ++----------------
 1 file changed, 2 insertions(+), 16 deletions(-)

diff --git a/tutorials/notebooks/alora_vs_lora_race.ipynb b/tutorials/notebooks/alora_vs_lora_race.ipynb
index 56b6dab..ac8b215 100644
--- a/tutorials/notebooks/alora_vs_lora_race.ipynb
+++ b/tutorials/notebooks/alora_vs_lora_race.ipynb
@@ -218,12 +218,7 @@
    "id": "a0000013",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "# Benchmark the ALORA server.\n",
-    "# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n",
-    "# The animated replay comes from race_live.html at the end.\n",
-    "!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --no-live -n 16 -c 8 -k 10"
-   ]
+   "source": "# Benchmark the ALORA server.\n# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n# The animated replay comes from race_live.html at the end.\n!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --alora-model {ALORA_MODEL_DIR} --no-live -n 16 -c 8 -k 10"
   },
   {
    "cell_type": "code",
@@ -241,16 +236,7 @@
    "cell_type": "markdown",
    "id": "bm4l4b6xr5c",
    "metadata": {},
-   "source": [
-    "## 4 · Compose the LoRA-only model\n",
-    "\n",
-    "The ALORA model above is a pre-built checkpoint from the Hub. For a fair comparison we now\n",
-    "compose a **LoRA-only** version from the same adapter libraries, using `--technology-filter lora`\n",
-    "to force every adapter to its standard LoRA variant.\n",
-    "\n",
-    "This downloads the base model and adapter libraries (~6 GB on first run, cached after that)\n",
-    "and writes the composed checkpoint to `/content/granite-switch-lora-only`."
-   ]
+   "source": "## 4 · Compose the LoRA-only model\n\nFor a fair comparison we now compose a **LoRA-only** version from the same adapter\nlibraries, using `--technology-filter lora` to force every adapter to its standard\nLoRA variant.\n\nThis downloads the adapter libraries (~6 GB on first run, cached after that) and writes\nthe composed checkpoint to `/content/granite-switch-lora-only`."
   },
   {
    "cell_type": "code",

From a0abf1b6071803df02fc6d01661f0ec62db0b5e4 Mon Sep 17 00:00:00 2001
From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras"
 <lastrasl@p2-r12-n1.bluevela.rmf.ibm.com>
Date: Thu, 21 May 2026 18:13:41 +0000
Subject: [PATCH 3/5] tutorials: replace deprecated target_role with
 scoring_schema in bench

guardian_check's target_role= kwarg was removed; use scoring_schema='user_prompt'.
---
 .../comparison/alora_vs_lora_race/bench_pipeline_race.py        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
index f0e7b72..6892d4f 100644
--- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
+++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
@@ -262,7 +262,7 @@ def run_timed_pipeline(query, ctx, backend, conv_json_idx=None, turn_idx=None, s
     ctx_with_query = ctx.add(MelleaMessage("user", query))
     t0  = time.perf_counter()
     harm_score = _call_or_dump("guardian_harm", ctx_with_query, conv_json_idx, turn_idx,
-                               guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, target_role="user")
+                               guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, scoring_schema="user_prompt")
     timings["guardian_harm"] = time.perf_counter() - t0
     if harm_score >= 0.5:
         work["exit"] = "harm_blocked"

From e9b723d21582a17714f631d34fed17cd38d10c01 Mon Sep 17 00:00:00 2001
From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras"
 <lastrasl@p2-r12-n1.bluevela.rmf.ibm.com>
Date: Thu, 21 May 2026 18:25:40 +0000
Subject: [PATCH 4/5] tutorials: replace deprecated utcnow() with
 datetime.now(datetime.UTC)

---
 .../comparison/alora_vs_lora_race/bench_pipeline_race.py        | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
index 6892d4f..51e5724 100644
--- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
+++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
@@ -887,7 +887,7 @@ def write_telemetry(server_results, adapter_tech, all_conv_results, labels, race
             "mode":        mode,
             "runs":        RUNS,
             "concurrency": CONCURRENCY_PER_SERVER,
-            "timestamp":   datetime.datetime.utcnow().isoformat() + "Z",
+            "timestamp":   datetime.datetime.now(datetime.UTC).isoformat(),
             "race_wall":   race_wall,
         },
         "servers": servers_block,

From 58a4bc57ee780ad0b7ebd09c7c99c920e53c5bc9 Mon Sep 17 00:00:00 2001
From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras"
 <lastrasl@p2-r12-n1.bluevela.rmf.ibm.com>
Date: Thu, 21 May 2026 18:27:32 +0000
Subject: [PATCH 5/5] tutorials: drop IPython dependency from
 bench_pipeline_race
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The coloransi import was patching TermColors attributes that were never
read anywhere in the file — dead code. Removing it makes the script
runnable outside Jupyter without requiring ipython installed.
---
 .../comparison/alora_vs_lora_race/bench_pipeline_race.py    | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
index 51e5724..ffce581 100644
--- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
+++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py
@@ -60,12 +60,6 @@
 logging.getLogger("fancy_logger").setLevel(logging.ERROR)
 
 
-from IPython.utils import coloransi
-if not hasattr(coloransi.TermColors, 'Green'):
-    coloransi.TermColors.Green = '\033[0;32m'
-    coloransi.TermColors.Blue = '\033[0;34m'
-    coloransi.TermColors.Normal = '\033[0m'
-
 def _detect_notebook():
     """Return True when running inside Jupyter / Colab."""
     try: