I Built a Rust-Backed Python ANN Library Because Spark's LSH Was Too Slow
A pure Rust approximate nearest neighbor search library with Python bindings. No JVM, no Spark overhead, no serialization penalties. Just deltalake + polars + fast_lsh_ann, all Rust under the hood.
The Problem
I was working on a vector search problem: querying 1 to 5 million user embeddings against a space of 400 million users to find nearest neighbors. The existing vector search API didn't support batch queries, which meant scaling it was both slow and expensive.
Spark's built-in LSH implementation was the obvious choice. It has BucketedRandomProjectionLSH, it reads Delta tables natively, and it's already in the ecosystem. But on a single node, Spark's overhead is brutal. You're paying for JVM startup, serialization between Python and Java, and a distributed coordination layer that does nothing useful when all your data fits in memory on one machine. For interactive queries and medium-scale datasets (hundreds of thousands to low millions of vectors), Spark LSH is overkill in the worst way: slow and resource-hungry for no benefit.
So I built my own.
What fast_lsh_ann Is
fast_lsh_ann is a Bucketed Random Projection LSH library written in Rust with Python bindings via PyO3. It's API-compatible with Spark's LSH (same concepts, same operations) but it runs as a native library inside your Python process. No JVM. No Spark context. No cluster.
The stack is intentionally all-Rust:
- deltalake (Rust) for reading Delta tables directly
- polars (Rust) for DataFrame operations
- fast_lsh_ann (Rust) for the actual LSH computation
This means you can go from a Delta table on disk to nearest neighbor results without ever touching the JVM or serializing data between language runtimes. The data stays in native memory the entire time.
Core Operations
The library covers the same surface area as Spark's LSH:
- transform: hash vectors into buckets
- approxNearestNeighbors: find the k closest vectors to a query
- approxSimilarityJoin: find all pairs of vectors within a distance threshold across two sets
- batch queries: run multiple nearest neighbor queries in parallel
Plus Delta Lake integration and a streaming mode for datasets larger than memory.
Performance
Numbers on an M1 Mac, single node, 128-dimensional vectors:
| Dataset Size | Operation | Time | Throughput |
|---|---|---|---|
| 100K vectors | Read Delta table | 825ms | 121K vectors/sec |
| 100K vectors | Transform (hash) | 451ms | 222K vectors/sec |
| 100K vectors | Batch NN (100 queries) | 3.8ms | 26K queries/sec |
The batch nearest neighbor number is the one that matters most. 3.8ms for 100 queries against 100K vectors, which scales to ~30ms for 100 queries against 1 million 128-dimensional vectors. At that speed, you can do interactive vector search in a Python notebook without waiting for a Spark cluster to spin up.
Models serialize to about 84 bytes and are picklable, so you can distribute them across workers if you do need to scale out. But the whole point is that for most workloads, you don't need to.
Delta Lake Without Spark
One of the more useful parts of this library is the Delta Lake integration. A lot of teams have embeddings sitting in Delta tables because that's what their Spark pipelines produce. Reading those tables back for vector search previously meant firing up Spark again, even if the query itself is trivial.
With fast_lsh_ann, you can read a Delta table, hash the embeddings, and run similarity queries, all in a regular Python script:
from fast_lsh_ann import DeltaLSHProcessor
processor = DeltaLSHProcessor(
bucket_length=2.0,
num_hash_tables=5,
seed=42
)
# Transform Delta table, no Spark needed
result_df = processor.transform_delta(
"/path/to/delta",
embedding_column="embedding"
)
# Similarity join between two Delta tables
pairs = processor.similarity_join_delta(
"/path/to/delta_a",
"/path/to/delta_b",
threshold=5.0
)
For datasets that don't fit in memory, there's a streaming processor that reads and processes chunks:
from fast_lsh_ann import StreamingLSHProcessor
processor = StreamingLSHProcessor(
bucket_length=2.0,
num_hash_tables=5,
seed=42,
chunk_size=50000,
)
processor.transform_to_delta(
input_path="/path/to/large_delta",
output_path="/path/to/output",
embedding_column="embedding",
)
The Count-Min Sketch Detour
During development, I spent time exploring Count-Min Sketch as a memory-efficient alternative for similarity joins. The idea was sound: instead of materializing all candidate pairs in memory, use a probabilistic data structure to track collision frequencies and only emit pairs that collide enough times.
In practice, it was slower. The overhead of maintaining and querying the sketch ate into the time savings from reduced memory allocation. The straightforward approach, just tracking pairs directly, turned out to be faster for the dataset sizes I was targeting. The CMS-based join is still in the library as approx_similarity_join_cms for cases where memory is the hard constraint (it uses a fixed ~10MB regardless of pair count), but for most workloads the default method wins.
This is the kind of thing you only learn by building it and measuring. Reading papers about Count-Min Sketch makes it seem like an obvious win. Real benchmarks tell a different story.
On Building It Without LLMs
I deliberately wrote the core Rust implementation without using LLMs. The tests, benchmarks, documentation, and PyO3 boilerplate? Sure, those are fair game for code generation. But the actual LSH logic, the hashing, the candidate filtering, the distance calculations: I wrote those by hand.
The reasoning is simple: the struggle of figuring things out, debugging, staring at wrong numbers until you find the off-by-one, that's where the learning happens. I wanted to understand LSH deeply enough to know exactly why my implementation makes the tradeoffs it does, not just have something that passes tests. When a user reports an edge case six months from now, I need to be able to reason about the behavior from first principles, not go ask an LLM what my own code does.
This doesn't mean LLMs are bad for coding. They're great at boilerplate, tests, and documentation, exactly the parts where correctness is easy to verify and the work is tedious. But for the core algorithm of a library you're publishing, I think you need to own every line.
When to Use This vs. Spark LSH
| Scenario | Recommendation |
|---|---|
| Data fits on a single node | Use fast_lsh_ann |
| Interactive queries in a notebook | Use fast_lsh_ann |
| No existing Spark infrastructure | Use fast_lsh_ann |
| Already deep in a Spark ETL pipeline | Stick with Spark LSH |
| Need distributed processing across a cluster | Use Spark LSH |
The honest answer is: if your data fits on one machine, you probably don't need Spark for LSH. And a surprising amount of data fits on one machine if you're not wasting memory on JVM overhead.
Installation and Getting Started
# Basic install
uv add fast_lsh_ann
# With Delta Lake support
uv add fast_lsh_ann deltalake polars
from fast_lsh_ann import BucketedRandomProjectionLSH
import numpy as np
lsh = BucketedRandomProjectionLSH(
bucket_length=2.0,
num_hash_tables=5,
seed=42
)
vectors = np.random.randn(100000, 128).astype(np.float32)
model = lsh.fit(vectors)
query = np.random.randn(128).astype(np.float32)
results = model.approx_nearest_neighbors(vectors, query, k=10)
# Returns: [(idx, distance), ...]
The library is on PyPI. Source is on GitHub. MIT licensed.
We help teams build and deploy search, retrieval, and ML infrastructure, from prototype to production. If you're dealing with vector search at scale and want to skip the Spark overhead, let's talk.