# Architecture

sgrep combines several key technologies to deliver fast, offline semantic code search. This document explains the technical foundations.

## Model2Vec

### Static Embeddings vs Transformers

Traditional transformer models like BERT and CodeBERT run inference through deep neural networks at search time. This is computationally expensive and slow.

**Model2Vec** takes a different approach:

1. **Pre-computation**: During model creation, the transformer runs on a massive vocabulary once, computing embeddings for every possible token.
2. **Lookup at runtime**: At search time, we simply look up pre-computed embeddings from a static table.

This trades a small amount of quality for massive speed gains and smaller model size.

```
Transformer at search time:
query → [attention layers] → embedding (~100ms per query)

Model2Vec at search time:
query → [tokenize] → [lookup table] → embedding (~1ms per query)
```

### Embedding Pipeline

Given a code snippet or query, sgrep produces a 256-dimensional embedding:

```
"function authenticate user"
```

**Step 1: Tokenization (WordPiece)**

```
["function", "authenticate", "user"]
```

**Step 2: Embedding Lookup**

Each token maps to a 256-dimensional vector:

```
"function"    → [0.12, -0.34, 0.56, ...]  (256 values)
"authenticate" → [-0.23, 0.45, -0.67, ...]
"user"        → [0.78, -0.12, 0.34, ...]
```

**Step 3: Mean Pooling**

Average all token embeddings:

```
embedding = mean(token_embeddings)
          = [0.22, -0.003, 0.08, ...]
```

**Step 4: L2 Normalization**

Normalize to unit length for consistent similarity scores:

```
embedding = embedding / ||embedding||
```

## Cosine Similarity

### The Math

To find similar code, we compute cosine similarity between query and document embeddings:

```
similarity(A, B) = (A · B) / (||A|| × ||B||)
```

Since embeddings are L2-normalized, this simplifies to:

```
similarity(A, B) = A · B
```

A single dot product operation: extremely fast on modern CPUs.

### Score Interpretation

- **1.0**: Identical meaning (same direction in vector space)
- **0.8-0.9**: Very similar (likely relevant)
- **0.6-0.8**: Somewhat related (may be relevant)
- **0.5-0.6**: Weakly related (edge cases)
- **below 0.5**: Likely unrelated

> **Note**: These thresholds vary by domain. Experiment with `--threshold` to find what works for your codebase.

## Int8 Quantization

### Why Quantize?

Floating point embeddings (f32) use 4 bytes per dimension:

```
256 dimensions × 4 bytes = 1,024 bytes per vector
```

With 50k vocabulary tokens, that's about 50 MB just for embeddings.

### Int8 Quantization

We convert f32 vectors to int8 using a scale factor:

```
int8_value = round(float32_value / scale)
```

The scale factor is computed per embedding vector:

```
scale = max(abs(vector)) / 127
```

### Size Reduction

```
256 dimensions × 1 byte = 256 bytes per vector
(4x smaller than f32)
```

### Dequantization

When computing similarity, we convert back to f32:

```
float32_value = int8_value × scale
```

This maintains most of the quality while dramatically reducing memory usage and improving cache locality.

### Trade-offs

| Aspect | f32 | int8 |
|--------|-----|------|
| Size | 1KB per vector | 256B per vector |
| Precision | Full | Slight loss |
| Speed | Good | Better (cache) |
| Quality | Baseline | 2-5% MRR loss |

For code search, this trade-off is well worth it.

## WASM Compilation

sgrep can be compiled to WebAssembly for browser-based search:

```
Rust source → wasm32-wasi target → .wasm file
```

### Benefits

- **No server**: Search runs entirely in the browser
- **Privacy**: Code never leaves your machine
- **Speed**: Native performance via Wasm SIMD
- **Offline**: Works without internet connection

### Wasm Size

```
sgrep.wasm: ~500KB compressed
embeddings.bin: ~12MB compressed (int8 quantized)
vocab.json: ~500KB
tokenizer.json: ~1MB
```

Total download is under 8MB and cached permanently.

### SIMD Acceleration

When available, sgrep uses Wasm SIMD for faster dot products:

```wasm
// Pseudo-code
f32x8.dot_product(a, b) -> single instruction for 8 values
```

This provides 2-4x speedup on supported browsers.

## Token-Level Coloring

### Score Gradient

sgrep highlights tokens based on their contribution to the match. The color gradient uses a quadratic curve:

```
score ∈ [0, 1]
t = score
t² = normalized score
```

```
rgb(r, g, b) where:
r = 140 + (255 - 140) × t²
g = 140 + (50 - 140) × t²
b = 140 + (50 - 140) × t²
```

At t=0: gray (140, 140, 140)
At t=1: red (255, 50, 50)

The quadratic curve means:
- Low similarity tokens stay gray
- Medium tokens show slight red
- High similarity tokens become bright red

### Computing Token Scores

Each token's contribution is computed as:

```
token_score = cosine_similarity(query_embedding, token_embedding)
```

Tokens are then normalized:

```
normalized = (score - min_score) / (max_score - min_score)
```

This ensures the gradient always spans the full range.

## Caching Architecture

### SQLite-Vec Index

sgrep caches embeddings in a local SQLite database using the `sqlite-vec` extension:

```sql
CREATE TABLE embeddings (
  path TEXT,
  line INTEGER,
  hash TEXT UNIQUE,
  embedding BLOB,  -- int8 quantized
  mtime REAL
);

CREATE VIRTUAL TABLE vec_index USING vec0(
  embedding float_vector(256)
);
```

### Cache Invalidation

Files are re-indexed when:

1. **mtime changes**: File modification time differs from cached value
2. **content hash differs**: SHA-256 of file content changed
3. **manual rebuild**: User runs `sgrep --rebuild-index`

### Cache Location

```
~/.cache/sgrep/
├── embeddings.db      # SQLite index
├── model/
│   ├── vocab.json
│   ├── embeddings.bin
│   └── tokenizer.json
└── registry/
    └── sgrep-code-v1/
```

## Performance Characteristics

### Typical Latencies

```
Tokenization: ~0.1ms per line
Embedding lookup: ~0.05ms per token
Similarity computation: ~0.001ms per comparison
Total per query: ~5-50ms depending on codebase size
```

### Scalability

| Lines of Code | Index Time | Search Time |
|---------------|------------|-------------|
| 1K | ~1s | ~5ms |
| 10K | ~5s | ~10ms |
| 100K | ~30s | ~20ms |
| 1M | ~5min | ~50ms |

Index time is one-time. Subsequent searches use cached embeddings.

## Memory Usage

```
Base binary: ~2MB
Model (int8): ~7.5MB
Per-file overhead: ~line_count × 256 bytes

Typical usage: 50-100MB for large projects
```

Memory usage scales linearly with codebase size.