LLM Laboratory Portfolio

IEEE CVPR Hands-On Seminar

A comprehensive walkthrough of the 4 interactive web apps.

What we built

We built 4 standalone, production-ready web applications deployed on Hugging Face Spaces. Each app isolates a specific component of the modern Large Language Model stack:

Module 1: The Tokenizer Visualizer
Module 2: The Temperature Playground
Module 3: The Structured Output Extractor
Module 4: The LoRA Injection Simulator

See how the AI reads

A web application built with Flask and the `transformers` library, utilizing the native `unsloth/Llama-3.2-3B-Instruct` tokenizer.

Type any text to see real-time BPE chunking and integer mapping.

Launch Tokenizer Space

Language models do not read English.

A neural network only understands floating-point numbers. To process text, we must compress all human language into a finite dictionary of integers. If the dictionary is too small (characters), the model lacks context. If it's too large (whole words), the matrix becomes impossibly massive.

The solution is Sub-Word Tokenization: Breaking text into optimal linguistic chunks.

How it works in practice

BPE finds the most frequent combinations of characters and merges them into single tokens. Common words become one token, while rare words are broken down into syllables.

IEEE _Seminar _Rocks

↓

45812 1204 8831

From Integers to Vectors

Once the text is converted to a sequence of integers (e.g., `[45812, 1204]`), the model looks up each integer in a massive Embedding Matrix.

This converts the discrete integers into high-dimensional continuous vectors, allowing the model to perform mathematical operations on the meaning of words.

The ₹30 Boss Challenge

An interactive dashboard that visually renders probability distributions using a custom KenLM API endpoint.

Features the ₹30 Challenge: Can you perfectly slide the temperature to balance logic and creativity to hit exactly a 15% probability?

Launch Temperature Space

Predicting the next token

At the very end of the network, the model outputs raw scores (logits) for every single token in its vocabulary. Because these are unbounded raw numbers, they don't add up to 100%. We need a way to reliably convert them into a probability distribution.

This is done via the Softmax function.

Exponential Smoothing

The standard Softmax formula exponentiates the logits and divides by the sum. This forces the outputs to be between 0 and 1, creating a valid probability distribution.

\sigma(z_i) = \frac{e^{z_i}}{\sum e^{z_j}}

Modulating the Softmax

By injecting a Temperature parameter ($\theta$), we divide the raw logits before they are exponentiated. This allows us to mathematically control the model's confidence.

Low Temp ($\theta < 1$)

Sharpened distribution. The highest logit dominates completely. Deterministic and "greedy".

High Temp ($\theta > 1$)

Flattened distribution. Lower scores become viable. Highly creative and diverse.

Forcing JSON Responses

A web app showcasing 3 levels of data extraction complexity, powered by the `Groq` API.

Try pasting unstructured text and watch the LLM perfectly populate a complex nested JSON schema without hallucinations.

Launch Structured Output Space

Why text is not enough

Traditional software engineering relies on strict data structures (APIs, Databases). If an LLM returns a conversational response like "Sure, the error code is 500!", the software will crash.

We must constrain the LLM's generation so that it only outputs valid, machine-readable syntax.

Restricting the Latent Space

During decoding, we can forcefully set the probability of invalid tokens to 0%. If the schema requires a boolean, the model is physically prevented from outputting anything other than `true` or `false`.

{
    "error_code": 500,
    "affected_services": [
        "Database", 
        "Auth"
    ]
}

Brain vs. Sticky Note

An interactive VRAM calculator paired with a live "Sticky Note" injection simulator.

Write a custom, hallucinated fact on the Sticky Note, and watch the frozen Llama-3 instantly adapt to it in real-time!

Launch LoRA Space

Why Full Fine-Tuning is Expensive

To teach a model new facts, you must update its brain ($W$). A standard 3 Billion parameter model requires roughly 6GB of VRAM just to load.

However, running backpropagation requires storing optimizer states (Adam), gradients, and activations. A full fine-tune of a 3B model requires upwards of 30 GB of VRAM, putting it out of reach for consumer hardware.

The Sticky Note Analogy

Instead of modifying the massive 3B parameter brain, LoRA freezes it entirely. We then append two tiny matrices ($A$ and $B$) that act as "sticky notes". The new knowledge is mathematically injected during the forward pass.

W_{new} = W_{frozen} + (A \times B)

$W_{frozen}$

3 Billion Parameters.
Fixed in memory. No gradients required.

$A \times B$

1.5 Million Parameters.
Trainable in under 2 GB of VRAM!

Start Hacking

All 4 spaces are live on Hugging Face.

Thank you for attending the IEEE CVPR LLM Laboratory!

LLM Laboratory Portfolio

IEEE CVPR Hands-On Seminar

The 4 Interactive Modules

What we built

The Tokenizer Visualizer

See how the AI reads

The Vocabulary Bottleneck

Language models do not read English.

Byte-Pair Encoding (BPE)

How it works in practice

The Embedding Space

From Integers to Vectors

The Temperature Playground

The ₹30 Boss Challenge

The Logit Bottleneck

Predicting the next token

The Softmax Formula

Exponential Smoothing

Temperature & Creativity

Modulating the Softmax

Low Temp ($\theta < 1$)

High Temp ($\theta > 1$)

Structured Output Extractor

Forcing JSON Responses

Bridging AI and Software

Why text is not enough

Grammar-Constrained Decoding