The Lexer

Breaking code into tokens

The lexer (also called a tokenizer or scanner) is the first stage of understanding your code. It reads your source code character by character and groups them into meaningful chunks called tokens.

What is a Token?

A token is the smallest meaningful unit in a programming language. Think of it like words in a sentence — just as "The cat sat" has three words, code has tokens.

Let's see what happens when the lexer processes this simple line:

Input
circle(100, 50, 30)

The lexer produces this stream of tokens:

identifier circle
punctuation (
number 100
punctuation ,
number 50
punctuation ,
number 30
punctuation )

Types of Tokens in Bloom

Bloom recognizes several categories of tokens:

Keywords

Reserved words with special meaning:

keyword fn
keyword let
keyword if
keyword else
keyword for
keyword while
keyword return
keyword true
keyword false

Identifiers

Names you give to things (variables, functions):

identifier myVariable
identifier circle
identifier playerX

Literals

Values written directly in code:

number 42
number 3.14
string "hello"

Operators

Symbols for operations:

operator +
operator -
operator *
operator /
operator ==
operator !=
operator ..

How the Lexer Works

The lexer reads your code one character at a time. Here's the mental model:

Source code
let x = 5
↑ cursor
Tokens produced
keyword LET

When the lexer sees l, it doesn't immediately create a token. It keeps reading (e, t) until it hits a space. Then it checks: is "let" a keyword? Yes! So it creates a LET token.

Step-by-Step Example

Let's trace through tokenizing this code:

Input
let x = 10 + 5
let
Read "l", "e", "t". Hit space. Check keyword table → Found "let" → Emit LET token
x
Read "x". Hit space. Not a keyword → Emit IDENTIFIER token with value "x"
=
Single character operator → Emit EQUALS token
10
Read "1", "0". Hit space. All digits → Emit NUMBER token with value 10
+
Single character operator → Emit PLUS token
5
Read "5". End of input. → Emit NUMBER token with value 5

Final result:

keyword let
identifier x
operator =
number 10
operator +
number 5

Handling Multi-Character Operators

Some operators are more than one character. When the lexer sees =, it peeks ahead:

This is called lookahead. Bloom uses 1-character lookahead for these operators:

2-char ==
2-char !=
2-char <=
2-char >=
2-char +=
2-char -=
2-char ..
2-char ++
2-char --

Error Tracking

Each token remembers where it came from in your source code:

{
  type: "NUMBER",
  lexeme: "42",
  literal: 42,
  line: 3,
  column: 10
}

This is crucial for error messages. When something goes wrong later (during parsing or execution), Bloom can point you to the exact location in your code.

Fun fact The lexer ignores whitespace and comments. They're not tokens — they're just there for humans. The parser never sees them.

What the Lexer Ignores

Input
// This is a comment
let x = 5   // Another comment

Output: Just LET, IDENTIFIER(x), EQUALS, NUMBER(5)

Common Lexer Errors

The lexer catches these problems early:

"hello
String never closed — Unterminated string
@#$
Unknown character — Unexpected character '@'

In the Source Code

The lexer implementation lives in src/lang/lexer.ts. Key methods: