Chapter 2

The Lexer

Breaking code into tokens

The lexer (also called a tokenizer or scanner) is the first stage of understanding your code. It reads your source code character by character and groups them into meaningful chunks called tokens.

What is a Token?

A token is the smallest meaningful unit in a programming language. Think of it like words in a sentence — just as "The cat sat" has three words, code has tokens.

Let's see what happens when the lexer processes this simple line:

Input

circle(100, 50, 30)

The lexer produces this stream of tokens:

identifier circle

punctuation (

number 100

punctuation ,

number 50

punctuation ,

number 30

punctuation )

Types of Tokens in Bloom

Bloom recognizes several categories of tokens:

Keywords

Reserved words with special meaning:

keyword fn

keyword let

keyword if

keyword else

keyword for

keyword while

keyword return

keyword true

keyword false

Identifiers

Names you give to things (variables, functions):

identifier myVariable

identifier circle

identifier playerX

Literals

Values written directly in code:

number 42

number 3.14

string "hello"

Operators

Symbols for operations:

operator +

operator -

operator *

operator /

operator ==

operator !=

operator ..

How the Lexer Works

The lexer reads your code one character at a time. Here's the mental model:

Source code

let x = 5

↑ cursor

Tokens produced

keyword LET

When the lexer sees l, it doesn't immediately create a token. It keeps reading (e, t) until it hits a space. Then it checks: is "let" a keyword? Yes! So it creates a LET token.

Step-by-Step Example

Let's trace through tokenizing this code:

Input

let x = 10 + 5

let

Read "l", "e", "t". Hit space. Check keyword table → Found "let" → Emit LET token

Read "x". Hit space. Not a keyword → Emit IDENTIFIER token with value "x"

Single character operator → Emit EQUALS token

Read "1", "0". Hit space. All digits → Emit NUMBER token with value 10

Single character operator → Emit PLUS token

Read "5". End of input. → Emit NUMBER token with value 5

Final result:

keyword let

identifier x

operator =

number 10

operator +

number 5

Handling Multi-Character Operators

Some operators are more than one character. When the lexer sees =, it peeks ahead:

Is the next character =? → Emit == (equals comparison)
Otherwise → Emit = (assignment)

This is called lookahead. Bloom uses 1-character lookahead for these operators:

2-char ==

2-char !=

2-char <=

2-char >=

2-char +=

2-char -=

2-char ..

2-char ++

2-char --

Error Tracking

Each token remembers where it came from in your source code:

{
  type: "NUMBER",
  lexeme: "42",
  literal: 42,
  line: 3,
  column: 10
}

This is crucial for error messages. When something goes wrong later (during parsing or execution), Bloom can point you to the exact location in your code.

Fun fact The lexer ignores whitespace and comments. They're not tokens — they're just there for humans. The parser never sees them.

What the Lexer Ignores

Whitespace — spaces, tabs, newlines (except to separate tokens)
Comments — everything after // until the end of line

Input

// This is a comment
let x = 5   // Another comment

Output: Just LET, IDENTIFIER(x), EQUALS, NUMBER(5)

Common Lexer Errors

The lexer catches these problems early:

"hello

String never closed — Unterminated string

@#$

Unknown character — Unexpected character '@'

In the Source Code

The lexer implementation lives in src/lang/lexer.ts. Key methods:

scanToken() — Main loop, decides what kind of token to make
string() — Handles string literals
number() — Handles number literals
identifier() — Handles keywords and identifiers

← Chapter 1: Introduction Chapter 3: The Parser →