From fefd2b314de9ba40c70f17f8d3f3b108375d71f6 Mon Sep 17 00:00:00 2001 From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras" Date: Thu, 21 May 2026 17:38:26 +0000 Subject: [PATCH 1/5] tutorials: compose ALORA model from source in race notebook MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace the pre-built Hub checkpoint (granite-switch-4.1-3b-preview) with a freshly composed model, so both race legs are built from the same three granitelib adapters using the same compose pipeline. The ALORA leg uses no --technology-filter so the composer prefers ALORA adapters and falls back to LoRA where unavailable — symmetric with the existing LoRA leg which uses --technology-filter lora. --- tutorials/notebooks/alora_vs_lora_race.ipynb | 56 +++++--------------- 1 file changed, 12 insertions(+), 44 deletions(-) diff --git a/tutorials/notebooks/alora_vs_lora_race.ipynb b/tutorials/notebooks/alora_vs_lora_race.ipynb index 331a00e..56b6dab 100644 --- a/tutorials/notebooks/alora_vs_lora_race.ipynb +++ b/tutorials/notebooks/alora_vs_lora_race.ipynb @@ -4,35 +4,7 @@ "cell_type": "markdown", "id": "a0000001", "metadata": {}, - "source": [ - "# ALORA vs LoRA Race\n", - "\n", - "**Duration:** ~20-40 min (composes a LoRA checkpoint, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n", - "\n", - "**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n", - "(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\n", - "one GPU; `race_live.html` replays them as if they raced.\n", - "\n", - "This notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n", - "\n", - "*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n", - "\n", - "**What you'll learn:**\n", - "- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n", - "- How `--technology-filter lora` lets you compose a like-for-like LoRA-only counterpart to a published ALORA checkpoint\n", - "- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n", - "- Where ALORA's prefill savings show up in the per-step latency breakdown\n", - "\n", - "**Adapters used:** the embedded ALORA checkpoint [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) and a LoRA-only build of the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — composed in section 3.\n", - "\n", - "## Prerequisites\n", - "\n", - "1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n", - "2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n", - "3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n", - "\n", - "New to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that section 3 calls. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)." - ] + "source": "# ALORA vs LoRA Race\n\n**Duration:** ~20-40 min (composes both checkpoints, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n\n**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\none GPU; `race_live.html` replays them as if they raced.\n\nThis notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n\n*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n\n**What you'll learn:**\n- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n- How composing without `--technology-filter` prefers ALORA adapters but falls back to LoRA, while `--technology-filter lora` forces a LoRA-only build\n- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n- Where ALORA's prefill savings show up in the per-step latency breakdown\n\n**Adapters used:** both checkpoints are composed from the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — on top of [granite-4.1-3b](https://huggingface.co/ibm-granite/granite-4.1-3b). The ALORA build (section 3) uses the default technology preference; the LoRA-only build (section 4) adds `--technology-filter lora`.\n\n## Prerequisites\n\n1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n\nNew to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that sections 3 and 4 call. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)." }, { "cell_type": "markdown", @@ -191,11 +163,15 @@ "cell_type": "markdown", "id": "a0000011", "metadata": {}, - "source": [ - "## 3 · ALORA server\n", - "\n", - "Start `granite-switch-4.1-3b-preview` on port 8111 and run the benchmark against it." - ] + "source": "## 3 · Compose the ALORA model and run the server\n\nCompose a checkpoint from the three IBM granitelib libraries without a technology filter —\nthe composer prefers ALORA adapters and falls back to LoRA where ALORA is unavailable.\nThen start the composed model on port 8111 and run the benchmark against it." + }, + { + "cell_type": "code", + "id": "0960d016", + "source": "ALORA_MODEL_DIR = \"/content/granite-switch-alora-prefer\"\n\nimport os\nif os.path.exists(os.path.join(ALORA_MODEL_DIR, \"adapter_index.json\")):\n print(f\"ALORA model already composed at {ALORA_MODEL_DIR} — skipping\")\nelse:\n !python -m granite_switch.composer.compose_granite_switch \\\n --base-model ibm-granite/granite-4.1-3b \\\n --adapters ibm-granite/granitelib-rag-r1.0 \\\n ibm-granite/granitelib-core-r1.0 \\\n ibm-granite/granitelib-guardian-r1.0 \\\n --output {ALORA_MODEL_DIR}", + "metadata": {}, + "execution_count": null, + "outputs": [] }, { "cell_type": "code", @@ -234,15 +210,7 @@ "id": "a0000012", "metadata": {}, "outputs": [], - "source": [ - "alora_proc = launch_vllm(\n", - " model = \"ibm-granite/granite-switch-4.1-3b-preview\",\n", - " port = 8111,\n", - " log_file = \"/content/vllm_alora.log\",\n", - ")\n", - "if not wait_for_server(8111):\n", - " tail_log(\"/content/vllm_alora.log\")" - ] + "source": "alora_proc = launch_vllm(\n model = ALORA_MODEL_DIR,\n port = 8111,\n log_file = \"/content/vllm_alora.log\",\n)\nif not wait_for_server(8111):\n tail_log(\"/content/vllm_alora.log\")" }, { "cell_type": "code", @@ -454,4 +422,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file From 809620513a17652904f65f932b045e0281165d62 Mon Sep 17 00:00:00 2001 From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras" Date: Thu, 21 May 2026 18:07:56 +0000 Subject: [PATCH 2/5] tutorials: pass --alora-model to bench_pipeline_race in ALORA leg Also fix stale section 4 markdown that still referenced the Hub checkpoint after the previous commit switched to composing from source. --- tutorials/notebooks/alora_vs_lora_race.ipynb | 18 ++---------------- 1 file changed, 2 insertions(+), 16 deletions(-) diff --git a/tutorials/notebooks/alora_vs_lora_race.ipynb b/tutorials/notebooks/alora_vs_lora_race.ipynb index 56b6dab..ac8b215 100644 --- a/tutorials/notebooks/alora_vs_lora_race.ipynb +++ b/tutorials/notebooks/alora_vs_lora_race.ipynb @@ -218,12 +218,7 @@ "id": "a0000013", "metadata": {}, "outputs": [], - "source": [ - "# Benchmark the ALORA server.\n", - "# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n", - "# The animated replay comes from race_live.html at the end.\n", - "!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --no-live -n 16 -c 8 -k 10" - ] + "source": "# Benchmark the ALORA server.\n# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n# The animated replay comes from race_live.html at the end.\n!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --alora-model {ALORA_MODEL_DIR} --no-live -n 16 -c 8 -k 10" }, { "cell_type": "code", @@ -241,16 +236,7 @@ "cell_type": "markdown", "id": "bm4l4b6xr5c", "metadata": {}, - "source": [ - "## 4 · Compose the LoRA-only model\n", - "\n", - "The ALORA model above is a pre-built checkpoint from the Hub. For a fair comparison we now\n", - "compose a **LoRA-only** version from the same adapter libraries, using `--technology-filter lora`\n", - "to force every adapter to its standard LoRA variant.\n", - "\n", - "This downloads the base model and adapter libraries (~6 GB on first run, cached after that)\n", - "and writes the composed checkpoint to `/content/granite-switch-lora-only`." - ] + "source": "## 4 · Compose the LoRA-only model\n\nFor a fair comparison we now compose a **LoRA-only** version from the same adapter\nlibraries, using `--technology-filter lora` to force every adapter to its standard\nLoRA variant.\n\nThis downloads the adapter libraries (~6 GB on first run, cached after that) and writes\nthe composed checkpoint to `/content/granite-switch-lora-only`." }, { "cell_type": "code", From a0abf1b6071803df02fc6d01661f0ec62db0b5e4 Mon Sep 17 00:00:00 2001 From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras" Date: Thu, 21 May 2026 18:13:41 +0000 Subject: [PATCH 3/5] tutorials: replace deprecated target_role with scoring_schema in bench guardian_check's target_role= kwarg was removed; use scoring_schema='user_prompt'. --- .../comparison/alora_vs_lora_race/bench_pipeline_race.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py index f0e7b72..6892d4f 100644 --- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py +++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py @@ -262,7 +262,7 @@ def run_timed_pipeline(query, ctx, backend, conv_json_idx=None, turn_idx=None, s ctx_with_query = ctx.add(MelleaMessage("user", query)) t0 = time.perf_counter() harm_score = _call_or_dump("guardian_harm", ctx_with_query, conv_json_idx, turn_idx, - guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, target_role="user") + guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, scoring_schema="user_prompt") timings["guardian_harm"] = time.perf_counter() - t0 if harm_score >= 0.5: work["exit"] = "harm_blocked" From e9b723d21582a17714f631d34fed17cd38d10c01 Mon Sep 17 00:00:00 2001 From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras" Date: Thu, 21 May 2026 18:25:40 +0000 Subject: [PATCH 4/5] tutorials: replace deprecated utcnow() with datetime.now(datetime.UTC) --- .../comparison/alora_vs_lora_race/bench_pipeline_race.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py index 6892d4f..51e5724 100644 --- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py +++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py @@ -887,7 +887,7 @@ def write_telemetry(server_results, adapter_tech, all_conv_results, labels, race "mode": mode, "runs": RUNS, "concurrency": CONCURRENCY_PER_SERVER, - "timestamp": datetime.datetime.utcnow().isoformat() + "Z", + "timestamp": datetime.datetime.now(datetime.UTC).isoformat(), "race_wall": race_wall, }, "servers": servers_block, From 58a4bc57ee780ad0b7ebd09c7c99c920e53c5bc9 Mon Sep 17 00:00:00 2001 From: "lastrasl@us.ibm.com;4A8621897;Luis Lastras" Date: Thu, 21 May 2026 18:27:32 +0000 Subject: [PATCH 5/5] tutorials: drop IPython dependency from bench_pipeline_race MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The coloransi import was patching TermColors attributes that were never read anywhere in the file — dead code. Removing it makes the script runnable outside Jupyter without requiring ipython installed. --- .../comparison/alora_vs_lora_race/bench_pipeline_race.py | 6 ------ 1 file changed, 6 deletions(-) diff --git a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py index 51e5724..ffce581 100644 --- a/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py +++ b/tutorials/scripts/comparison/alora_vs_lora_race/bench_pipeline_race.py @@ -60,12 +60,6 @@ logging.getLogger("fancy_logger").setLevel(logging.ERROR) -from IPython.utils import coloransi -if not hasattr(coloransi.TermColors, 'Green'): - coloransi.TermColors.Green = '\033[0;32m' - coloransi.TermColors.Blue = '\033[0;34m' - coloransi.TermColors.Normal = '\033[0m' - def _detect_notebook(): """Return True when running inside Jupyter / Colab.""" try: