Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 14 additions & 60 deletions tutorials/notebooks/alora_vs_lora_race.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,35 +4,7 @@
"cell_type": "markdown",
"id": "a0000001",
"metadata": {},
"source": [
"# ALORA vs LoRA Race\n",
"\n",
"**Duration:** ~20-40 min (composes a LoRA checkpoint, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n",
"\n",
"**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n",
"(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\n",
"one GPU; `race_live.html` replays them as if they raced.\n",
"\n",
"This notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n",
"\n",
"*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n",
"\n",
"**What you'll learn:**\n",
"- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n",
"- How `--technology-filter lora` lets you compose a like-for-like LoRA-only counterpart to a published ALORA checkpoint\n",
"- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n",
"- Where ALORA's prefill savings show up in the per-step latency breakdown\n",
"\n",
"**Adapters used:** the embedded ALORA checkpoint [`ibm-granite/granite-switch-4.1-3b-preview`](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview) and a LoRA-only build of the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — composed in section 3.\n",
"\n",
"## Prerequisites\n",
"\n",
"1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n",
"2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n",
"3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n",
"\n",
"New to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that section 3 calls. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
]
"source": "# ALORA vs LoRA Race\n\n**Duration:** ~20-40 min (composes both checkpoints, embeds the corpus, then runs two vLLM legs back to back; first run also downloads ~6 GB of weights)\n\n**Runtime note:** Each server run takes roughly the same wall time as a real race leg\n(3–8 min depending on GPU). The two runs are sequential since Colab typically provides\none GPU; `race_live.html` replays them as if they raced.\n\nThis notebook benchmarks two Granite Switch checkpoints — one using **ALORA** (which defers adapter activation to save prefill time) and one using standard **LoRA** — on the same multi-step RAG pipeline, and produces an animated HTML replay of the race. The two servers run sequentially (Colab usually provides one GPU); the replay stitches their telemetry together as if they had raced simultaneously.\n\n*Why vLLM:* much faster inference in production environments; HF support for Granite Switch in mellea coming. The ALORA prefill optimization is implemented in vLLM's Punica kernels.\n\n**What you'll learn:**\n- How to run the same pipeline (guardian → query rewrite → retrieval → answerability → clarification → generation) against two Granite Switch checkpoints and produce `race_live.html` + `race_report.html` from the result\n- How composing without `--technology-filter` prefers ALORA adapters but falls back to LoRA, while `--technology-filter lora` forces a LoRA-only build\n- How to launch, health-check, and tear down vLLM servers from a notebook without leaking GPU memory\n- Where ALORA's prefill savings show up in the per-step latency breakdown\n\n**Adapters used:** both checkpoints are composed from the same three IBM granitelib libraries — [Core](https://huggingface.co/ibm-granite/granitelib-core-r1.0), [RAG](https://huggingface.co/ibm-granite/granitelib-rag-r1.0), and [Guardian](https://huggingface.co/ibm-granite/granitelib-guardian-r1.0) — on top of [granite-4.1-3b](https://huggingface.co/ibm-granite/granite-4.1-3b). The ALORA build (section 3) uses the default technology preference; the LoRA-only build (section 4) adds `--technology-filter lora`.\n\n## Prerequisites\n\n1. **GPU runtime.** A100 or better. In Colab: *Runtime → Change runtime type → A100 GPU*.\n2. **HuggingFace login** (cell 4) so the `ibm-granite/*` checkpoints can download.\n3. **Run cells in order.** Section 0 clones the repo and `cd`s into the race-script directory; later sections assume that working directory.\n\nNew to this series? [`compose_granite_switch.ipynb`](./compose_granite_switch.ipynb) walks through the composer that sections 3 and 4 call. Full setup details (GPU sizes, multi-GPU, troubleshooting) are in [`PREREQUISITES.md`](../PREREQUISITES.md)."
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -191,11 +163,15 @@
"cell_type": "markdown",
"id": "a0000011",
"metadata": {},
"source": [
"## 3 · ALORA server\n",
"\n",
"Start `granite-switch-4.1-3b-preview` on port 8111 and run the benchmark against it."
]
"source": "## 3 · Compose the ALORA model and run the server\n\nCompose a checkpoint from the three IBM granitelib libraries without a technology filter —\nthe composer prefers ALORA adapters and falls back to LoRA where ALORA is unavailable.\nThen start the composed model on port 8111 and run the benchmark against it."
},
{
"cell_type": "code",
"id": "0960d016",
"source": "ALORA_MODEL_DIR = \"/content/granite-switch-alora-prefer\"\n\nimport os\nif os.path.exists(os.path.join(ALORA_MODEL_DIR, \"adapter_index.json\")):\n print(f\"ALORA model already composed at {ALORA_MODEL_DIR} — skipping\")\nelse:\n !python -m granite_switch.composer.compose_granite_switch \\\n --base-model ibm-granite/granite-4.1-3b \\\n --adapters ibm-granite/granitelib-rag-r1.0 \\\n ibm-granite/granitelib-core-r1.0 \\\n ibm-granite/granitelib-guardian-r1.0 \\\n --output {ALORA_MODEL_DIR}",
"metadata": {},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
Expand Down Expand Up @@ -234,28 +210,15 @@
"id": "a0000012",
"metadata": {},
"outputs": [],
"source": [
"alora_proc = launch_vllm(\n",
" model = \"ibm-granite/granite-switch-4.1-3b-preview\",\n",
" port = 8111,\n",
" log_file = \"/content/vllm_alora.log\",\n",
")\n",
"if not wait_for_server(8111):\n",
" tail_log(\"/content/vllm_alora.log\")"
]
"source": "alora_proc = launch_vllm(\n model = ALORA_MODEL_DIR,\n port = 8111,\n log_file = \"/content/vllm_alora.log\",\n)\nif not wait_for_server(8111):\n tail_log(\"/content/vllm_alora.log\")"
},
{
"cell_type": "code",
"execution_count": null,
"id": "a0000013",
"metadata": {},
"outputs": [],
"source": [
"# Benchmark the ALORA server.\n",
"# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n",
"# The animated replay comes from race_live.html at the end.\n",
"!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --no-live -n 16 -c 8 -k 10"
]
"source": "# Benchmark the ALORA server.\n# --no-live disables Rich Live (which floods notebook output with redrawn frames).\n# The animated replay comes from race_live.html at the end.\n!python bench_pipeline_race.py --mode sequential --server \"ALORA (8111)\" --alora-model {ALORA_MODEL_DIR} --no-live -n 16 -c 8 -k 10"
},
{
"cell_type": "code",
Expand All @@ -273,16 +236,7 @@
"cell_type": "markdown",
"id": "bm4l4b6xr5c",
"metadata": {},
"source": [
"## 4 · Compose the LoRA-only model\n",
"\n",
"The ALORA model above is a pre-built checkpoint from the Hub. For a fair comparison we now\n",
"compose a **LoRA-only** version from the same adapter libraries, using `--technology-filter lora`\n",
"to force every adapter to its standard LoRA variant.\n",
"\n",
"This downloads the base model and adapter libraries (~6 GB on first run, cached after that)\n",
"and writes the composed checkpoint to `/content/granite-switch-lora-only`."
]
"source": "## 4 · Compose the LoRA-only model\n\nFor a fair comparison we now compose a **LoRA-only** version from the same adapter\nlibraries, using `--technology-filter lora` to force every adapter to its standard\nLoRA variant.\n\nThis downloads the adapter libraries (~6 GB on first run, cached after that) and writes\nthe composed checkpoint to `/content/granite-switch-lora-only`."
},
{
"cell_type": "code",
Expand Down Expand Up @@ -454,4 +408,4 @@
},
"nbformat": 4,
"nbformat_minor": 5
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -60,12 +60,6 @@
logging.getLogger("fancy_logger").setLevel(logging.ERROR)


from IPython.utils import coloransi
if not hasattr(coloransi.TermColors, 'Green'):
coloransi.TermColors.Green = '\033[0;32m'
coloransi.TermColors.Blue = '\033[0;34m'
coloransi.TermColors.Normal = '\033[0m'

def _detect_notebook():
"""Return True when running inside Jupyter / Colab."""
try:
Expand Down Expand Up @@ -262,7 +256,7 @@ def run_timed_pipeline(query, ctx, backend, conv_json_idx=None, turn_idx=None, s
ctx_with_query = ctx.add(MelleaMessage("user", query))
t0 = time.perf_counter()
harm_score = _call_or_dump("guardian_harm", ctx_with_query, conv_json_idx, turn_idx,
guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, target_role="user")
guardian_check, ctx_with_query, backend, GUARDIAN_HARM_CRITERIA, scoring_schema="user_prompt")
timings["guardian_harm"] = time.perf_counter() - t0
if harm_score >= 0.5:
work["exit"] = "harm_blocked"
Expand Down Expand Up @@ -887,7 +881,7 @@ def write_telemetry(server_results, adapter_tech, all_conv_results, labels, race
"mode": mode,
"runs": RUNS,
"concurrency": CONCURRENCY_PER_SERVER,
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"timestamp": datetime.datetime.now(datetime.UTC).isoformat(),
"race_wall": race_wall,
},
"servers": servers_block,
Expand Down