For Instructors

Instructor Notes — Local LLM Frameworks Lab

Facilitator companion to Notebook 04. Talking points, teaching moments, and the resource decisions behind how this lab is set up.

Set the framing up front

🧠 Why we use a small model (Qwen3‑1.7B)

Be explicit with students: we are deliberately using a small model because we sized it to fit the allocation we have — not because it's the "best" model. NAIRR Starter allocations come with limited compute and disk, so a 1.7‑billion‑parameter model runs comfortably on a modest CPU instance and fits the default disk.

This is a genuine real‑world lesson, not a limitation to apologize for: matching the model to the resources you actually have is exactly the judgment researchers and educators make every day.

Say something like: "We're running a small model on purpose — it's what fits our shared allocation. On a bigger allocation or a GPU, you'd scale up to a larger model. Choosing the right size for your resources is the real skill."

💽 Disk choice = how many students you can support

This is the most important capacity lesson in the lab. Each instance comes with a built‑in disk (m3.quad = 20 GB). Because we chose a small model, everything fits on that built‑in disk — so we do NOT add a separate volume.

That decision matters at scale: volumes are capped (about 10 per allocation), so if every student added a 40 GB volume, the class would be limited to ~10 people. By staying on the built‑in disk (no volume), the only limit is the compute quota — letting you run far more instances (roughly 18 on m3.quad with our quota).

Say something like: "Notice we kept the default disk and didn't add a volume. That's intentional — volumes are limited, so using the disk that comes with the instance lets us give more students their own machine."

Teaching moments, in order

Section 2 — Environment check

Hardware shapes what's possible

It prints CPU mode and skips vLLM. Point out that vLLM needs a GPU, so on a CPU instance we compare Ollama vs llama.cpp. Good moment to mention that NAIRR offers GPU instances too (g3/g4/g5) when a project needs them.

Section 3 — Install / model download

Each app manages its own copy of the model

The model downloads twice — once for Ollama, once for llama.cpp — because each program keeps its own model store. This is why disk planning matters, and a concrete tie‑in to the small‑model decision above.

Sections 6–7 — Watching the answers stream

What "generating tokens" actually means

Students watch the text appear word‑by‑word. Use this to explain that the model produces one token at a time, and that tokens/second is the speed metric we care about. Have System Monitor open — all CPU cores spike while it generates.

Load / unload cells (RAM readouts)

Models live in memory — and get freed

The [RAM] lines rise when a model loads and fall when it's released. Pair with System Monitor. Note for students: the OS keeps freed memory as cache, so watch the "available" number, not "used."

Section 9 — The comparison table (the payoff)

The one big takeaway

This is the moment to land the core lesson: the program (framework) you choose mostly changes the speed; the model's size and quantization mostly change the quality. Same weights in three apps → similar answers, different speed.

Section 11 — Batch test

Why production serving is different

(GPU only.) When many requests arrive at once, vLLM's design pulls far ahead — which is why real AI services run vLLM, not Ollama. On CPU this is skipped; mention it conceptually.

Running the session

Setting	Use when
`TEST_MODE = "single"` (default)	Short slot — one prompt, fastest. Great for a live demo.
`TEST_MODE = "simple"`	A normal class — 3 prompts.
`TEST_MODE = "complete"`	Assign for students to run later at home — all 7 prompts.

⚙️ Capacity cheat‑sheet (current allocation)

Compute quota ≈ 72 vCPU / ~270 GB RAM. Built‑in disk, no volumes:

• m3.quad (4 vCPU, 20 GB): ~18 instances — recommended for the class.
• m3.medium (8 vCPU, 60 GB): ~9 instances — faster each, fewer total.

For a full 20‑person class, either pair students on instances or request a quota increase from Jetstream2 ahead of time.

🌱 Plant the next step: a full proposal

During the session, nudge participants to start thinking about a full NAIRR proposal — the Start-Up is just the on-ramp; a full proposal is how they sustain and scale the work. Have them jot down their goals and audience while it's fresh.

Then point them to VS Code + Claude (the build-your-own setup): the same AI assistant that creates notebooks can help them draft the full proposal. That reframes "I don't have time to write a grant" into "I can draft one in an afternoon."

🧹 Don't forget

Have students Delete their instance when finished (not just shelve) so it stops drawing allocation credits. If anyone added a volume, delete that too — volumes persist and count against the 10‑volume limit.