If machine learning is a kitchen, tokenization is the chopping board. Before models can “cook” with text, we need to cut it into the right-sized pieces—tokens.
If machine learning is a kitchen, tokenization is the chopping board. Before models can “cook” with text, we need to cut it into the right-sized pieces—tokens.
Tokenization is the process of breaking raw text into units (tokens) that a model can handle.
Tokens can be:
Characters: H, e, l, l, o
Words: Hello, world
Subwords: token, izer → tokenizer
IDs: numbers like [1432, 57, 901] that index into a vocabulary
Why? Because models operate on numbers, not raw text. Tokenization is the first bridge from text → numbers.
Vocabulary coverage: Can we handle unseen words like “microservicey”?
Sequence length: Fewer tokens → faster & cheaper inference/training.
Performance: Better tokenization can improve accuracy and generalization.
Internationalization: Works well across languages with different scripts.
Below is a compact, practical tokenizer you can paste into a JS/React playground.
It’s word-level with punctuation splitting, and it can auto-grow to avoid <unk> during encode (great for demos).
//Copied from chatGPT
// Minimal word+punct tokenizer with auto-grow
const TOKEN_RE = /\p{L}+(?:'\p{L}+)?|\d+(?:\.\d+)?|[^\s\p{L}\d]/gu;
const SPECIALS = ["<pad>", "<unk>", "<bos>", "<eos>"];
class TinyTokenizer {
constructor() {
this.id2tok = [...SPECIALS];
this.tok2id = new Map(this.id2tok.map((t, i) => [t, i]));
}
tokenize(text) {
return Array.from(String(text || "").toLowerCase().matchAll(TOKEN_RE), m => m[0]);
}
_add(tok) {
if (this.tok2id.has(tok)) return this.tok2id.get(tok);
const id = this.id2tok.length;
this.id2tok.push(tok);
this.tok2id.set(tok, id);
return id;
}
encode(text, { autoAdd = true } = {}) {
const toks = this.tokenize(text);
const ids = [this.tok2id.get("<bos>")];
for (const t of toks) {
const id = this.tok2id.get(t);
ids.push(id != null ? id : (autoAdd ? this._add(t) : this.tok2id.get("<unk>")));
}
ids.push(this.tok2id.get("<eos>"));
return ids;
}
decode(ids) {
const specials = new Set(SPECIALS);
const toks = ids.map(i => this.id2tok[i]).filter(Boolean);
let out = "";
for (let i = 0; i < toks.length; i++) {
const t = toks[i];
if (specials.has(t)) continue;
const prev = toks[i - 1];
const isPunct = t.length === 1 && /[^\p{L}\d\s]/u.test(t);
const prevPunct = prev && prev.length === 1 && /[^\p{L}\d\s]/u.test(prev);
if (out && !isPunct) out += prevPunct ? " " : " ";
out += t;
}
return out.trim();
}
}Preprocess: normalize (lowercase? Unicode), clean text
Tokenize: text → tokens → token IDs
Batch: pad/truncate to max_len, add masks
Model: IDs → embeddings → transformer layers → logits
Postprocess: decode IDs → tokens → text, apply spacing rules
"Tokenization is step zero for working with text in ML/LLMs."
0
2
0