# How It Works

sgrep transforms your search query into a semantic journey through your codebase. This document explains the complete pipeline from query to colored results.

## Query Pipeline

When you run `sgrep "authenticate user" src/`, several steps occur:

### Step 1: Query Tokenization

Your natural language query is tokenized using WordPiece:

```
"authenticate user"
  ↓
["authenticate", "user"]
```

WordPiece breaks words into subword units, handling unknown words by splitting them:

```
"authentication" → ["auth", "##ent", "##ication"]
```

### Step 2: Query Embedding

Each token is looked up in the pre-computed embedding table:

```
"authenticate" → [0.12, -0.34, 0.56, ...]  (256 dimensions)
"user"        → [0.78, -0.12, 0.34, ...]
```

These embeddings are averaged (mean pooling) and normalized:

```
query_embedding = mean([token_embeddings]) / ||mean||
```

Result: A single 256-dimensional vector representing your query's meaning.

### Step 3: File Discovery

sgrep walks the directory tree to find files:

```
src/
├── main.rs
├── auth/
│   ├── login.rs
│   └── oauth.rs
└── db/
    └── connection.rs
```

Files are filtered by:
- Extension (config language support)
- `.gitignore` and `.ignore` files
- Binary file detection

### Step 4: Line Embedding

For each line in discovered files, sgrep:

1. **Tokenizes** the line: `fn login(user: User) {` → `["fn", "login", "(", "user", ":", "User", ")", "{"]`
2. **Looks up** each token's embedding
3. **Mean pools** the embeddings
4. **Normalizes** the result

Cached embeddings are used when available (see Caching section).

### Step 5: Similarity Computation

Cosine similarity is computed between query and each line:

```
score = query_embedding · line_embedding
```

This is a single dot product operation: extremely fast.

### Step 6: Ranking and Filtering

Results are sorted by score and filtered:

```
threshold = 0.5  # default
top_k = 10       # default

results = all_matches
  .filter(m => m.score >= threshold)
  .sort_by(m => m.score.descending())
  .take(top_k)
```

### Step 7: Display with Token Coloring

Each matching line is displayed with colored tokens. The color gradient represents each token's contribution to the match (see Token-Level Coloring below).

## Chunking Strategies

sgrep supports different granularity for search:

### Line-Level (Default)

Each line is embedded separately:

```rust
fn authenticate_user(username: &str, password: &str) -> Result<User, Error> {
    let user = db.find_user(username)?;
    if verify_password(password, &user.hash) {
        Ok(user)
    } else {
        Err(Error::AuthenticationFailed)
    }
}
```

Becomes 7 separate embeddings.

**Pros:** Precise location, easier to scan
**Cons:** May miss multi-line concepts

### Paragraph-Level

Multiple lines grouped together (planned feature):

```rust
fn authenticate_user(...) {
    let user = db.find_user(username)?;
    if verify_password(password, &user.hash) {
```

One embedding for the whole function body.

**Pros:** Captures multi-line semantics
**Cons:** Harder to locate exact match

## Token-Level Coloring

The colored output shows which tokens contributed most to the match.

### Score Calculation

For each token in the matching line:

```
token_score = cosine_similarity(query_embedding, token_embedding)
```

### Normalization

Scores are normalized across all tokens in the match:

```
normalized = (score - min_score) / (max_score - min_score)
```

### Gradient Formula

A quadratic curve maps scores to colors:

```
t = normalized_score
t² = t * t

rgb(r, g, b):
  r = 140 + (255 - 140) × t²
  g = 140 + (50 - 140) × t²
  b = 140 + (50 - 140) × t²
```

**Result:**
- t=0: gray (140, 140, 140) — irrelevant
- t=0.5: slight red tint — somewhat relevant
- t=1.0: bright red (255, 50, 50) — highly relevant

### Example

Query: `"database connection"`

Match: `let pool = PgPool::connect(&db_url).await?;`

Token colors:
- `let` — gray (irrelevant)
- `pool` — gray (variable name)
- `PgPool` — gray (type name)
- `connect` — bright red (highly relevant!)
- `&db_url` — light red (somewhat relevant)
- `.await?` — gray (syntax)

## Caching Architecture

### SQLite Storage

Embeddings are cached in `~/.cache/sgrep/embeddings.db`:

```sql
CREATE TABLE embeddings (
  path TEXT NOT NULL,
  line_number INTEGER NOT NULL,
  content_hash TEXT NOT NULL,
  embedding BLOB NOT NULL,  -- 256 × int8
  modified_time REAL NOT NULL,
  PRIMARY KEY (path, line_number)
);

CREATE INDEX idx_path ON embeddings(path);
CREATE INDEX idx_hash ON embeddings(content_hash);
```

### Cache Lookup

When searching a file:

1. **Check cache** for existing embeddings
2. **Verify** file modification time matches
3. **Compute SHA-256** of file content for integrity check
4. **Use cached** embeddings if valid
5. **Re-compute** only if file changed

### Cache Invalidation

Files are re-indexed when:

```rust
if cached_mtime != file_mtime || cached_hash != compute_hash(file) {
    recompute_embeddings(file);
    update_cache(file);
}
```

### Hybrid Mode Cache

In `--hybrid` mode, BM25 keyword results are also cached:

```sql
CREATE TABLE bm25_cache (
  path TEXT,
  pattern TEXT,
  results TEXT,  -- JSON array
  timestamp REAL
);
```

## Model Loading

### First Run

On first execution, sgrep downloads the model:

```bash
$ sgrep "test" src/
Downloading model: sgrep-code-v1 (~7.5MB)
Extracting to: ~/.cache/sgrep/model/sgrep-code-v1/
Model ready. Searching...
```

### Model Files

```
~/.cache/sgrep/model/sgrep-code-v1/
├── vocab.json        # Tokenizer vocabulary (500KB)
├── embeddings.bin    # Pre-computed token embeddings (7.5MB)
└── tokenizer.json    # WordPiece configuration (1MB)
```

### Lazy Loading

Embeddings are memory-mapped, not fully loaded:

```rust
let embeddings = MmappedEmbeddings::new("embeddings.bin")?;
// Only accessed pages are loaded into RAM
```

This keeps memory usage low even with large models.

## Performance Optimizations

### Parallel Processing

Files are processed in parallel using Rayon:

```rust
files.par_iter()
    .flat_map(|file| search_file(file, query))
    .collect()
```

### SIMD Dot Product

When available, SIMD instructions accelerate similarity computation:

```rust
#[cfg(target_arch = "x86_64")]
use std::arch::x86_64::*;

unsafe fn cosine_simd(a: &[f32], b: &[f32]) -> f32 {
    // AVX2 processes 8 floats per instruction
}
```

### Early Termination

Low-score matches are discarded early:

```rust
if partial_score < threshold_lower_bound {
    continue; // Skip remaining tokens
}
```

### Result Batching

Results are streamed, not stored all at once:

```rust
search_results
    .take(top_k)
    .for_each(|result| print_result(result));
```

## End-to-End Example

```bash
$ sgrep "database query error" src/

[1. Computing query embedding]
  "database query error"
  → tokenize → ["database", "query", "error"]
  → embed → mean([emb_db, emb_query, emb_error])
  → normalize → [0.23, -0.45, 0.67, ...]

[2. Scanning files]
  src/main.rs: skip (not matching pattern)
  src/db/postgres.rs: process...

[3. Processing src/db/postgres.rs]
  Line 1: "use postgres::Client;"
    → embed → [0.12, 0.34, ...]
    → similarity = 0.32 → below threshold
  Line 42: "conn.query(\"SELECT * FROM users\").await?"
    → embed → [0.34, 0.56, ...]
    → similarity = 0.89 → match!

[4. Displaying results]
  src/db/postgres.rs:42
  conn.query("SELECT * FROM users").await?
       ^^^^ bright red (query!)
  ```

The entire process typically completes in 10-50ms for medium codebases.

## Summary

1. Query → embedding via tokenization + lookup + mean pool
2. Files → discovered via filesystem walk
3. Lines → embedded and cached
4. Similarities → computed via dot product
5. Results → ranked and filtered
6. Output → colored by token contribution

All offline, all local, all fast.