1 / 17

LLM Laboratory Portfolio

IEEE CVPR Hands-On Seminar

A comprehensive walkthrough of the 4 interactive web apps.

Overview

The 4 Interactive Modules

What we built

We built 4 standalone, production-ready web applications deployed on Hugging Face Spaces. Each app isolates a specific component of the modern Large Language Model stack:

  • Module 1: The Tokenizer Visualizer
  • Module 2: The Temperature Playground
  • Module 3: The Structured Output Extractor
  • Module 4: The LoRA Injection Simulator
Lab 1

The Tokenizer Visualizer

See how the AI reads

A web application built with Flask and the `transformers` library, utilizing the native `unsloth/Llama-3.2-3B-Instruct` tokenizer.

Type any text to see real-time BPE chunking and integer mapping.

Launch Tokenizer Space
Concept 1A

The Vocabulary Bottleneck

Language models do not read English.

A neural network only understands floating-point numbers. To process text, we must compress all human language into a finite dictionary of integers. If the dictionary is too small (characters), the model lacks context. If it's too large (whole words), the matrix becomes impossibly massive.

The solution is Sub-Word Tokenization: Breaking text into optimal linguistic chunks.

Concept 1B

Byte-Pair Encoding (BPE)

How it works in practice

BPE finds the most frequent combinations of characters and merges them into single tokens. Common words become one token, while rare words are broken down into syllables.

IEEE _Seminar _Rocks
45812 1204 8831
Concept 1C

The Embedding Space

From Integers to Vectors

Once the text is converted to a sequence of integers (e.g., `[45812, 1204]`), the model looks up each integer in a massive Embedding Matrix.

This converts the discrete integers into high-dimensional continuous vectors, allowing the model to perform mathematical operations on the meaning of words.

Lab 2

The Temperature Playground

The ₹30 Boss Challenge

An interactive dashboard that visually renders probability distributions using a custom KenLM API endpoint.

Features the ₹30 Challenge: Can you perfectly slide the temperature to balance logic and creativity to hit exactly a 15% probability?

Launch Temperature Space
Concept 2A

The Logit Bottleneck

Predicting the next token

At the very end of the network, the model outputs raw scores (logits) for every single token in its vocabulary. Because these are unbounded raw numbers, they don't add up to 100%. We need a way to reliably convert them into a probability distribution.

This is done via the Softmax function.

Concept 2B

The Softmax Formula

Exponential Smoothing

The standard Softmax formula exponentiates the logits and divides by the sum. This forces the outputs to be between 0 and 1, creating a valid probability distribution.

$$ \sigma(z_i) = \frac{e^{z_i}}{\sum e^{z_j}} $$
Concept 2C

Temperature & Creativity

Modulating the Softmax

By injecting a Temperature parameter ($\theta$), we divide the raw logits before they are exponentiated. This allows us to mathematically control the model's confidence.

Low Temp ($\theta < 1$)

Sharpened distribution. The highest logit dominates completely. Deterministic and "greedy".

High Temp ($\theta > 1$)

Flattened distribution. Lower scores become viable. Highly creative and diverse.

Lab 3

Structured Output Extractor

Forcing JSON Responses

A web app showcasing 3 levels of data extraction complexity, powered by the `Groq` API.

Try pasting unstructured text and watch the LLM perfectly populate a complex nested JSON schema without hallucinations.

Launch Structured Output Space
Concept 3A

Bridging AI and Software

Why text is not enough

Traditional software engineering relies on strict data structures (APIs, Databases). If an LLM returns a conversational response like "Sure, the error code is 500!", the software will crash.

We must constrain the LLM's generation so that it only outputs valid, machine-readable syntax.

Concept 3B

Grammar-Constrained Decoding

Restricting the Latent Space

During decoding, we can forcefully set the probability of invalid tokens to 0%. If the schema requires a boolean, the model is physically prevented from outputting anything other than `true` or `false`.

{
    "error_code": 500,
    "affected_services": [
        "Database", 
        "Auth"
    ]
}
Lab 4

LoRA Interactive Simulator

Brain vs. Sticky Note

An interactive VRAM calculator paired with a live "Sticky Note" injection simulator.

Write a custom, hallucinated fact on the Sticky Note, and watch the frozen Llama-3 instantly adapt to it in real-time!

Launch LoRA Space
Concept 4A

The VRAM Wall

Why Full Fine-Tuning is Expensive

To teach a model new facts, you must update its brain ($W$). A standard 3 Billion parameter model requires roughly 6GB of VRAM just to load.

However, running backpropagation requires storing optimizer states (Adam), gradients, and activations. A full fine-tune of a 3B model requires upwards of 30 GB of VRAM, putting it out of reach for consumer hardware.

Concept 4B

Low-Rank Adaptation (LoRA)

The Sticky Note Analogy

Instead of modifying the massive 3B parameter brain, LoRA freezes it entirely. We then append two tiny matrices ($A$ and $B$) that act as "sticky notes". The new knowledge is mathematically injected during the forward pass.

$$ W_{new} = W_{frozen} + (A \times B) $$

$W_{frozen}$

3 Billion Parameters.
Fixed in memory. No gradients required.

$A \times B$

1.5 Million Parameters.
Trainable in under 2 GB of VRAM!

Start Hacking

All 4 spaces are live on Hugging Face.

Thank you for attending the IEEE CVPR LLM Laboratory!