Full-text searching the Web — on a single Postgres instance?

Neon offers BM25 full-text search directly within Postgres, via ParadeDB’s pg_search extension. You can read more about that in our announcement post, and in our comparison with Elasticsearch and with tsvector/tsquery in Postgres.

The TL;DR is that supporting flexible full-text search previously meant running a separate Elasticsearch instance: an instance that had to be stood up, maintained, and continuously updated with data from Postgres. This took substantial effort. But pg_search changes that: now, all you need is Postgres.

pg_search provides a new index type, plus the functions and operators to make use of it. It’s open-source, so it’s free to use — aside from the CPU time and disk space involved in creating, storing and searching your full-text indexes — and there’s no lock-in.

This all sounds pretty good. But before we start using a tool like pg_search in production, we need to know what it does not only qualitatively, but also quantitatively. How much data can pg_search deal with? And how much data can pg_search deal with and still maintain a reasonable search speed?

Does it scale?

Thinking about the sorts of data sets that might tax a full-text search system, I decided to aim high. I set out to do a full-text search of the public Web on a single Postgres instance.

My first question was: would pg_search cope? Could it deal with a data set this large and not fall over? If it could, my second question was: how fast would it be? Might I stand up a (naïve) Google competitor on a single machine? Or would this perhaps be better suited to one patient researcher issuing one query at a time?

Here’s my setup:

The database I used is the biggest on Neon’s mid-size Scale plan, auto-scaling up to 8 vCPUs and 32GB RAM. These are not toy specs, but they’re not monstrous.
For raw data, I loaded up one flavour of the Colossal Common Crawl Corpus (C4). So ‘the public Web’ here means: clean, English-language web pages saved by Common Crawl in 2019. This is 156 billion words spread across 365 million documents. Postgres needs 596GB to store it, and pg_search takes another 284GB to index it.
As a fallback, I also loaded up a roughly 10% sample of the same C4 data set (on a database with the same specs) in a separate Neon project. This one takes 56GB of storage for the table, plus 27GB of pg_search index data.

In answer to my first question, there’s good news: pg_search absolutely can deal with the ~600GB full Web data set, indexing it in around a day.

In answer to my second question, things aren’t quite as rosy: full-text searches using this index take anywhere between several seconds and several minutes. That makes full-text search of the Web on one ordinary-sized Postgres instance (1) eminently usable for research purposes, but (2) clearly not ready to power a production service.

note

You can try this full-text Web search experiment at: https://c4-pg-search.jawj.workers.dev/. Bear in mind that if you found this post via any widely-distributed link, you’re almost guaranteed to get gateway timeout issues: there’s no hope that it will be managing the traffic it gets!

Searches take as long as they do on this full Web data set because our index is much larger than the memory available to cache it. If we EXPLAIN ANALYZE a query, we can see that nearly all the time is spent streaming in parts of the index from storage right at the start.

So, if memory is the key issue, how do things go with the 10% subset of the data, where the index is smaller than available RAM? Over this data set, it turns out that the great majority of queries complete in under a second. In terms of services we could sensibly launch to users, this is much more like it.

And although the data set is smaller, it’s still large enough that this performance by pg_search is no mean feat. We’re doing a full-text search of about 15 billion words across 35 million documents, in the blink of an eye on one pretty ordinary machine.

How do I use it?

In case it’s useful in your own exploration and use of pg_search, let me walk you through the steps to making a full-text web search work.

note

By following along here, you’ll store almost 1TB of data. If you’re on a local machine, check your disk space. If you’re on Neon, make sure this fits within your budget.

Loading

The first step is loading the data. As mentioned above, the Colossal Common Crawl Corpus (C4) is a little under 600GB loaded into the database, plus about another 50% of that for the full-text pg_search index.

Here are some tips to make ingesting this amount of data into Postgres relatively painless:

When loading data from a public location into Neon, consider using Google Colab. This is a cloud-based Jupyter notebook. Why? Firstly, because Colab notebooks are easy to update, annotate, and share. This is great for team working, transparency and reproducibility. And secondly, because Colab has data-centre-grade Internet connectivity. That means your home/office/mobile Internet speed won’t be a bottleneck.
Use a simple Postgres queue to track your data-loading progress. The C4 data set, for example, is split across 1024 files. You can create a table with one row per file, then use a `SELECT … FOR UPDATE SKIP LOCKED` query to identify and process each of them in turn. The great thing about this is that it’s effortlessly robust both to interruptions — it always picks back up exactly where it left off — and to parallel processing, such as running multiple Colab copies at once.
Batch up your `INSERT` statements: it’s faster to insert multiple rows at once. But bear in mind that Postgres supports a maximum of 65,535 parameter placeholders per query. If you’re inserting values into two columns per row (here: page URL + page text), you can therefore insert about 32,000 rows per query.
Note that Postgres text columns can’t hold null characters. For page text in C4, we can substitute the Unicode replacement character: in Python, that’s `data[‘text’].replace(“\x00”, “\ufffd”)`.

You can see all four tips in action in this example Colab, which loads the full C4 data into a Postgres database.

Indexing

Next up, we need to index the page text. Install pg_search:

CREATE EXTENSION pg_search;

With a data set of this size, we need to tweak some parameters to improve performance and not exhaust available memory:

SET maintenance_work_mem = '16GB';
-- half our available 32GB

SET paradedb.create_index_memory_budget = 4096;
-- default is 1024: higher means fewer segments and better search performance

SET paradedb.create_index_parallelism = 1;
-- parallelism on a data set this big may lead to OOM errors, especially on older versions of pg_search

\timing
-- assuming you're in psql, let's find out how long the next step takes

And now, create the index:

CREATE INDEX search_idx ON pages USING bm25 (id, body) WITH (key_field = 'id');

You can check progress with:

\x
Expanded display is on.

SELECT * FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+---------------
pid                | 5023
datid              | 16389
datname            | c4_pg_search
relid              | 101373
index_relid        | 0
command            | CREATE INDEX
phase              | building index
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 38106473
blocks_done        | 28154821
tuples_total       | 0
tuples_done        | 0
partitions_total   | 0
partitions_done    | 0

Indexing took a little over 24 hours for me. Note that it can take an hour or so to finish even after block_done reaches blocks_total in the above display. Check total sizes with:

SELECT 
pg_size_pretty(pg_table_size('pages')) AS data_size,
pg_size_pretty(pg_relation_size('search_idx')) AS index_size;
data_size | index_size 
-----------+------------
 596 GB    | 284 GB

You can also get more details about your index using:

SELECT * FROM paradedb.index_info('search_idx');

Querying

Lastly, we need queries to use this new index. You can see how my simple web app does this on GitHub. I use a phrase search with a slop value of 2. For example:

SELECT id, url, body, paradedb.score(id), paradedb.snippet(body)
FROM pages
WHERE body @@@ '"bm25 search"~2'
ORDER BY paradedb.score(id) DESC, id ASC
OFFSET 0 LIMIT 20;

The end

And that about wraps it up. The pg_search extension is actively maintained by ParadeDB, and available to all Neon users. On the largest Neon Scale plan instance it can search tens of GB of full-text data near-instantaneously (and hundreds of GB of full-text data eventually).

Try pg_search on the Neon Free Plan — no credit card required.

Full-text searching the Web — on a single Postgres instance?

Does it scale?

note

How do I use it?

note

Loading

Indexing

Querying

The end

More from Neon

Neon Now Runs in VS Code

app.build Can Now Build Python Data Apps

Postgres for Everything: Why Vectsore Replaced Pinecone and RDS with Neon

Full-text searching the Web — on a single Postgres instance?

Does it scale?

note

How do I use it?

note

Loading

Indexing

Querying

The end

Subscribe to receive our latest updates

More from Neon

Neon Now Runs in VS Code

app.build Can Now Build Python Data Apps

Postgres for Everything: Why Vectsore Replaced Pinecone and RDS with Neon