Tokenization for Freshers: How Text Becomes Model-Ready

If machine learning is a kitchen, tokenization is the chopping board. Before models can “cook” with text, we need to cut it into the right-sized pieces—tokens.

If machine learning is a kitchen, tokenization is the chopping board. Before models can “cook” with text, we need to cut it into the right-sized pieces—tokens.

What is Tokenization?

Tokenization is the process of breaking raw text into units (tokens) that a model can handle.
Tokens can be:

Characters: H, e, l, l, o
Words: Hello, world
Subwords: token, izer → tokenizer
IDs: numbers like [1432, 57, 901] that index into a vocabulary

Why? Because models operate on numbers, not raw text. Tokenization is the first bridge from text → numbers.

Why It Matters (A Lot)

Vocabulary coverage: Can we handle unseen words like “microservicey”?
Sequence length: Fewer tokens → faster & cheaper inference/training.
Performance: Better tokenization can improve accuracy and generalization.
Internationalization: Works well across languages with different scripts.

A Tiny Tokenizer You Can Explain in an Interview (JS)

Below is a compact, practical tokenizer you can paste into a JS/React playground.
It’s word-level with punctuation splitting, and it can auto-grow to avoid <unk> during encode (great for demos).

//Copied from chatGPT
// Minimal word+punct tokenizer with auto-grow 
const TOKEN_RE = /\p{L}+(?:'\p{L}+)?|\d+(?:\.\d+)?|[^\s\p{L}\d]/gu;
const SPECIALS = ["<pad>", "<unk>", "<bos>", "<eos>"];

class TinyTokenizer {
 constructor() {
 this.id2tok = [...SPECIALS];
 this.tok2id = new Map(this.id2tok.map((t, i) => [t, i]));
 }
 tokenize(text) {
 return Array.from(String(text || "").toLowerCase().matchAll(TOKEN_RE), m => m[0]);
 }
 _add(tok) {
 if (this.tok2id.has(tok)) return this.tok2id.get(tok);
 const id = this.id2tok.length;
 this.id2tok.push(tok);
 this.tok2id.set(tok, id);
 return id;
 }
 encode(text, { autoAdd = true } = {}) {
 const toks = this.tokenize(text);
 const ids = [this.tok2id.get("<bos>")];
 for (const t of toks) {
 const id = this.tok2id.get(t);
 ids.push(id != null ? id : (autoAdd ? this._add(t) : this.tok2id.get("<unk>")));
 }
 ids.push(this.tok2id.get("<eos>"));
 return ids;
 }
 decode(ids) {
 const specials = new Set(SPECIALS);
 const toks = ids.map(i => this.id2tok[i]).filter(Boolean);
 let out = "";
 for (let i = 0; i < toks.length; i++) {
 const t = toks[i];
 if (specials.has(t)) continue;
 const prev = toks[i - 1];
 const isPunct = t.length === 1 && /[^\p{L}\d\s]/u.test(t);
 const prevPunct = prev && prev.length === 1 && /[^\p{L}\d\s]/u.test(prev);
 if (out && !isPunct) out += prevPunct ? " " : " ";
 out += t;
 }
 return out.trim();
 }
}

How Tokenization Fits in an LLM Pipeline

Preprocess: normalize (lowercase? Unicode), clean text
Tokenize: text → tokens → token IDs
Batch: pad/truncate to max_len, add masks
Model: IDs → embeddings → transformer layers → logits
Postprocess: decode IDs → tokens → text, apply spacing rules

"Tokenization is step zero for working with text in ML/LLMs."

Join Arshad on Peerlist!

Join amazing folks like Arshad and thousands of other builders on Peerlist.