Search and organize Within a lot of Text !
Let’s look at full-text search indexing in PostgreSQL. Text search is a common task for a lot of application developers and startups in the initial stages. But one may not have the infra/capacity to deal with this. This is when PostgreSQL as a database can handle a lot of functionality to get started.
Let’s take an example and learn how PostgreSQL handles full-text search and indexing.
It’s a simple schema. We’re creating an application where users can save notes. And the goal is to provide users with a search capability on these notes. Since it’s not a primary functionality of the application, and the usage is expected to be low in the start, we can choose to have postgresql handle our text indexing and search for a while.
create table notes (
id serial primary key ,
title text ,
content text
)
Let’s insert some data into this table.
insert into notes (title, content) VALUES ('I love Running',
' I wrote a post about running. ' ||
'And it was my first run on the streets. ');
insert into notes (title, content) VALUES (
'Indexing in Postgresql. fulltext searching',
'Ill write an article on full txt indexing in '
|| 'postgres and avoid setting up a new ||
infrastructure') ;
So, every time a user hits a search query, our database takes the responsibility to provide search results based on the title and content columns. First let’s look at some components which handle full text search in Postgres, and then build upon this to implement a proper indexing to search quickly.
To search a text column, we need to use two special functions within PostgreSQL.
This method takes in a text field as an input and generates terms that can be used for searching. Execute the query below to see the output of how our note title’s will be parsed.
select title, to_tsvector('english' , title) from notes ;
to_tsvector output
This method accepts input text, parses it into tokens and lexemes that
can be used for search. For instance, running
, run —
all these words can be thought to have derived from run
. It’s easier to search these and users will feel the search to be more natural and accurate. Note that this is a language dependent feature. That’s the reason we pass English
as our language every time with these functions.
The numbers 2 and 3
next to love and run
represent the position of that word within the input string.
When we create an index for text data, internally, we’ll tell PostgreSQL to index these tokens and store row information along with these. So, when a user performs a search, PostgreSQL can get us the target rows from individual keywords.
A word of caution — Do not use this function on a large database without indexing. It may involve a lot of IO and compute to lookup, parse and generate these tokens.
This is the next part to parsing user input. to_tsquery
is a special function within PostgreSQL which parses user input and creates a query which can be used to search the tokens we built earlier. A text query is more than a simple set of terms with some conditions. An example is shown below.
select to_tsquery('english', 'run | ( first & time ) | time');
Look how this constructed a set of tokens separated by & and |.
This is the part where we tell the database that search for either run
or time
or first time
. These operators are necessary to instruct PostgreSQL on how to search our text . We’ll use this when we build our indexes for search.
PostgreSQL will use this to deliver results when we query our database.
Let’s put these two together and search our database for a phrase.
SELECT title
FROM notes
WHERE to_tsvector('english', title || content ) @@ to_tsquery('English', 'run | first | time');
And there we have our record. You may want to use Ids when fetching rows in your application.
Now that we understand what an index looks like, we can create our index so that terms are indexed while we are writing rows. This way we don’t need to build a term vector from title
and content
every single time a user performs a query.
There are two steps to create a full text index in Postgres.
to_tsvector
to generate the tokens.We alter the table to add a new Generated Column. Note that since we want to query both title and content, we concatenate both columns for the terms.
ALTER TABLE notes ADD COLUMN indexed_tokens tsvector
GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || content)) STORED
CREATE INDEX idx_fulltext_notes ON notes USING GIN (indexed_tokens);
Now, to search any text , we just use the generated column and perform a text match. Use the query below :
select id, title from notes where
notes.indexed_tokens @@ to_tsquery('English', 'run | love')And you should see an output like the one shown below
Output of Full text search
It is not very straightforward to work with full-text search in Postgres. But it can be very powerful when we want to enable search in your application. Something worth considering if you don’t have a huge volume of search users/data. A lot of complexity is handled by the database until you can figure out the real architecture for scaling it up. Or leave it fully functional without any additional infrastructure/services. This is just the tip of the iceberg. PostgreSQL provides more ways to search and a way to rank documents as well.