Instructor Notes — Local LLM Frameworks Lab
Facilitator companion to Notebook 04. Talking points, teaching moments, and the resource decisions behind how this lab is set up.
Set the framing up front
🧠 Why we use a small model (Qwen3‑1.7B)
Be explicit with students: we are deliberately using a small model because we sized it to fit the allocation we have — not because it's the "best" model. NAIRR Starter allocations come with limited compute and disk, so a 1.7‑billion‑parameter model runs comfortably on a modest CPU instance and fits the default disk.
This is a genuine real‑world lesson, not a limitation to apologize for: matching the model to the resources you actually have is exactly the judgment researchers and educators make every day.
💽 Disk choice = how many students you can support
This is the most important capacity lesson in the lab. Each instance comes with a built‑in disk (m3.quad = 20 GB). Because we chose a small model, everything fits on that built‑in disk — so we do NOT add a separate volume.
That decision matters at scale: volumes are capped (about 10 per allocation), so if every student added a 40 GB volume, the class would be limited to ~10 people. By staying on the built‑in disk (no volume), the only limit is the compute quota — letting you run far more instances (roughly 18 on m3.quad with our quota).
Teaching moments, in order
Hardware shapes what's possible
It prints CPU mode and skips vLLM. Point out that vLLM needs a GPU, so
on a CPU instance we compare Ollama vs llama.cpp. Good moment to mention that NAIRR offers
GPU instances too (g3/g4/g5) when a project needs them.
Each app manages its own copy of the model
The model downloads twice — once for Ollama, once for llama.cpp — because each program keeps its own model store. This is why disk planning matters, and a concrete tie‑in to the small‑model decision above.
What "generating tokens" actually means
Students watch the text appear word‑by‑word. Use this to explain that the model produces one token at a time, and that tokens/second is the speed metric we care about. Have System Monitor open — all CPU cores spike while it generates.
Models live in memory — and get freed
The [RAM] lines rise when a model loads and fall when it's released. Pair with
System Monitor. Note for students: the OS keeps freed memory as cache, so watch the
"available" number, not "used."
The one big takeaway
This is the moment to land the core lesson: the program (framework) you choose mostly changes the speed; the model's size and quantization mostly change the quality. Same weights in three apps → similar answers, different speed.
Why production serving is different
(GPU only.) When many requests arrive at once, vLLM's design pulls far ahead — which is why real AI services run vLLM, not Ollama. On CPU this is skipped; mention it conceptually.
Running the session
| Setting | Use when |
|---|---|
TEST_MODE = "single" (default) | Short slot — one prompt, fastest. Great for a live demo. |
TEST_MODE = "simple" | A normal class — 3 prompts. |
TEST_MODE = "complete" | Assign for students to run later at home — all 7 prompts. |
⚙️ Capacity cheat‑sheet (current allocation)
Compute quota ≈ 72 vCPU / ~270 GB RAM. Built‑in disk, no volumes:
• m3.quad (4 vCPU, 20 GB): ~18 instances — recommended for the class.
• m3.medium (8 vCPU, 60 GB): ~9 instances — faster each, fewer total.
For a full 20‑person class, either pair students on instances or request a quota increase from Jetstream2 ahead of time.
🌱 Plant the next step: a full proposal
During the session, nudge participants to start thinking about a full NAIRR proposal — the Start-Up is just the on-ramp; a full proposal is how they sustain and scale the work. Have them jot down their goals and audience while it's fresh.
Then point them to VS Code + Claude (the build-your-own setup): the same AI assistant that creates notebooks can help them draft the full proposal. That reframes "I don't have time to write a grant" into "I can draft one in an afternoon."
🧹 Don't forget
Have students Delete their instance when finished (not just shelve) so it stops drawing allocation credits. If anyone added a volume, delete that too — volumes persist and count against the 10‑volume limit.