PostgreSQL · Search1. decembra 2024

Search is Not Magic with PostgreSQL

Oliver Kriška

The numbers first

78% of visitors on e-commerce sites go to search first. 68% of them never return if results are bad.

Quick math: 1,000 visitors/day → 790 use search → 300 get bad results → with 20% conversion rate → you lose 47 potential buyers → at €40 average order = €1,880 lost per day.

From bad search results. That's it.

Four approaches, one PostgreSQL database

You don't need Elasticsearch. You don't need a third-party search service on day one. PostgreSQL handles all of this natively.

1. ILIKE / Full-text LIKE

The one everyone knows from university. Works for exact matches. Banana finds banana. Bananna (typo) finds nothing. Zero results = lost conversion.

2. Similarity Search (trigrams)

PostgreSQL extension pg_trgm. Splits text into 3-character chunks, compares them, returns a similarity score. Handles typos, different word order, fuzzy matching. Fast with proper indexes.

Downside: can match parts of unrelated words. On a Norwegian project, the word "jul" (Christmas) appeared in middle of product names and ranked them higher than actual Christmas products.

3. TS Vectors

Language-aware full-text search. Understands that "cakes" and "cake" are the same word. Supports weighted fields — name matches rank higher than description matches. Works great when your data is in one language.

Downside: sensitive to language mismatches. Norwegian index + English product name = broken ranking.

4. Semantic / Hybrid Search

Not searching for what's written — searching for what the user means. Uses vector embeddings (mathematical coordinates of meaning). Combine it with trigram or TS Vector search using Reciprocal Rank Fusion (RRF) = Hybrid Search.

This is what we ended up using in production.

Real numbers from a real project

Platform: Norwegian/Swedish q-commerce marketplace, 11,000+ products, multiple languages.

Implemented hybrid search in 3 days (everyone expected weeks)
After 1 year of production use + development: $700 total in OpenAI embedding costs for 20,000 products
Key insight: cache your search vectors. If someone searches "banana", store that vector. Don't call OpenAI again for the same query.

What most people get wrong with embeddings

The input matters as much as the approach.

Don't just send product name + description. Send context: what kind of store this is, what category the product is in, what language the store operates in. Better input → better coordinates → better results.

Also: don't return vectors in your ORM queries. 1,500-dimensional float arrays on 100 products per page = performance disaster. Keep vector calculations on the database side.

When NOT to use semantic search

Exact product codes, medical item numbers, serial numbers, anything where precision beats relevance — stick to ILIKE or trigrams.

The right approach depends on your data. There's no universal answer.

Third-party options worth knowing

Timescale (PGAI extension, Rust-based, handles vectorization automatically)
Supabase (PostgreSQL + background workers for embeddings)
ParadeDB (custom index, good for complex use cases)

Trade-off: they handle complexity, but you're sending data outside your database and paying for it.

Three things to remember

Index everything. Wrong or missing index = slow queries = bad UX.
Set ef_search on HNSW indexes — default returns only 10 results.
Cache embedding vectors for repeated search queries.

Extended version of the talk from CODECON 2024, Žilina. Watch the full talk →

← Späť na blog