Architecture

sgrep combines several key technologies to deliver fast, offline semantic code search. This document explains the technical foundations.

Model2Vec

Static Embeddings vs Transformers

Traditional transformer models like BERT and CodeBERT run inference through deep neural networks at search time. This is computationally expensive and slow.

Model2Vec takes a different approach:

Pre-computation: During model creation, the transformer runs on a massive vocabulary once, computing embeddings for every possible token.
Lookup at runtime: At search time, we simply look up pre-computed embeddings from a static table.

This trades a small amount of quality for massive speed gains and smaller model size.

Transformer at search time:
query → [attention layers] → embedding (~100ms per query)

Model2Vec at search time:
query → [tokenize] → [lookup table] → embedding (~1ms per query)

Embedding Pipeline

Given a code snippet or query, sgrep produces a 256-dimensional embedding:

"function authenticate user"

Step 1: Tokenization (WordPiece)

["function", "authenticate", "user"]

Step 2: Embedding Lookup

Each token maps to a 256-dimensional vector:

"function"    → [0.12, -0.34, 0.56, ...]  (256 values)
"authenticate" → [-0.23, 0.45, -0.67, ...]
"user"        → [0.78, -0.12, 0.34, ...]

Step 3: Mean Pooling

Average all token embeddings:

embedding = mean(token_embeddings)
          = [0.22, -0.003, 0.08, ...]

Step 4: L2 Normalization

Normalize to unit length for consistent similarity scores:

embedding = embedding / ||embedding||

Cosine Similarity

The Math

To find similar code, we compute cosine similarity between query and document embeddings:

similarity(A, B) = (A · B) / (||A|| × ||B||)

Since embeddings are L2-normalized, this simplifies to:

similarity(A, B) = A · B

A single dot product operation: extremely fast on modern CPUs.

Score Interpretation

1.0: Identical meaning (same direction in vector space)
0.8-0.9: Very similar (likely relevant)
0.6-0.8: Somewhat related (may be relevant)
0.5-0.6: Weakly related (edge cases)
below 0.5: Likely unrelated

Note: These thresholds vary by domain. Experiment with --threshold to find what works for your codebase.

Int8 Quantization

Why Quantize?

Floating point embeddings (f32) use 4 bytes per dimension:

256 dimensions × 4 bytes = 1,024 bytes per vector

With 50k vocabulary tokens, that’s about 50 MB just for embeddings.

Int8 Quantization

We convert f32 vectors to int8 using a scale factor:

int8_value = round(float32_value / scale)

The scale factor is computed per embedding vector:

scale = max(abs(vector)) / 127

Size Reduction

256 dimensions × 1 byte = 256 bytes per vector
(4x smaller than f32)

Dequantization

When computing similarity, we convert back to f32:

float32_value = int8_value × scale

This maintains most of the quality while dramatically reducing memory usage and improving cache locality.

Trade-offs

Aspect	f32	int8
Size	1KB per vector	256B per vector
Precision	Full	Slight loss
Speed	Good	Better (cache)
Quality	Baseline	2-5% MRR loss

For code search, this trade-off is well worth it.

WASM Compilation

sgrep can be compiled to WebAssembly for browser-based search:

Rust source → wasm32-wasi target → .wasm file

Benefits

No server: Search runs entirely in the browser
Privacy: Code never leaves your machine
Speed: Native performance via Wasm SIMD
Offline: Works without internet connection

Wasm Size

sgrep.wasm: ~500KB compressed
embeddings.bin: ~12MB compressed (int8 quantized)
vocab.json: ~500KB
tokenizer.json: ~1MB

Total download is under 8MB and cached permanently.

SIMD Acceleration

When available, sgrep uses Wasm SIMD for faster dot products:

// Pseudo-code
f32x8.dot_product(a, b) -> single instruction for 8 values

This provides 2-4x speedup on supported browsers.

Token-Level Coloring

Score Gradient

sgrep highlights tokens based on their contribution to the match. The color gradient uses a quadratic curve:

score ∈ [0, 1]
t = score
t² = normalized score

rgb(r, g, b) where:
r = 140 + (255 - 140) × t²
g = 140 + (50 - 140) × t²
b = 140 + (50 - 140) × t²

At t=0: gray (140, 140, 140) At t=1: red (255, 50, 50)

The quadratic curve means:

Low similarity tokens stay gray
Medium tokens show slight red
High similarity tokens become bright red

Computing Token Scores

Each token’s contribution is computed as:

token_score = cosine_similarity(query_embedding, token_embedding)

Tokens are then normalized:

normalized = (score - min_score) / (max_score - min_score)

This ensures the gradient always spans the full range.

Caching Architecture

SQLite-Vec Index

sgrep caches embeddings in a local SQLite database using the sqlite-vec extension:

CREATE TABLE embeddings (
  path TEXT,
  line INTEGER,
  hash TEXT UNIQUE,
  embedding BLOB,  -- int8 quantized
  mtime REAL
);

CREATE VIRTUAL TABLE vec_index USING vec0(
  embedding float_vector(256)
);

Cache Invalidation

Files are re-indexed when:

mtime changes: File modification time differs from cached value
content hash differs: SHA-256 of file content changed
manual rebuild: User runs sgrep --rebuild-index

Cache Location

~/.cache/sgrep/
├── embeddings.db      # SQLite index
├── model/
│   ├── vocab.json
│   ├── embeddings.bin
│   └── tokenizer.json
└── registry/
    └── sgrep-code-v1/

Performance Characteristics

Typical Latencies

Tokenization: ~0.1ms per line
Embedding lookup: ~0.05ms per token
Similarity computation: ~0.001ms per comparison
Total per query: ~5-50ms depending on codebase size

Scalability

Lines of Code	Index Time	Search Time
1K	~1s	~5ms
10K	~5s	~10ms
100K	~30s	~20ms
1M	~5min	~50ms

Index time is one-time. Subsequent searches use cached embeddings.

Memory Usage

Base binary: ~2MB
Model (int8): ~7.5MB
Per-file overhead: ~line_count × 256 bytes

Typical usage: 50-100MB for large projects

Memory usage scales linearly with codebase size.