Set session variable in postgresql

Before version 9.2, you needed to add your custom class variable to custom_variable_classes parameter in postgresql.conf, like:

custom_variable_classes = 'myapp'

In 9.2, this requirement has been removed:

Remove the custom_variable_classes parameter (Tom Lane).
The checking provided by this setting was dubious. Now any setting can be prefixed by any class name.

So, since 9.2 you can just set your custom class variable as you are doing currently, no need to worry about changing postgresql.conf.

-- set_config(var_name, value, is_only_for_local_transaction)
-- if local_transaction is false, then var is setted for entire connection/session
-- Also:  set session "myapp.user" = 'admin';
SELECT set_config('myapp.user', 'admin', FALSE);
-- get var:
SELECT current_setting('myapp.user');

With SET LOCAL (Only in transaction):
Specifies that the command takes effect for only the current transaction. After COMMIT or ROLLBACK, the session-level setting takes effect again. Issuing this outside of a transaction block emits a warning and otherwise has no effect.

-- Also:  set LOCAL "myapp.user" = '22';
SELECT set_config('myapp.user', 'admin', TRUE);
-- get var:
SELECT current_setting('myapp.user');

Source:
https://www.postgresql.org/docs/9.5/static/functions-admin.html#FUNCTIONS-ADMIN-SET-TABLE
http://dba.stackexchange.com/questions/97095/set-session-custom-variable-to-store-user-id
https://www.postgresql.org/docs/current/static/sql-set.html

Replacing EAV with JSONB in PostgreSQL

TL;DR: JSONB has potential for greatly simplifying schema design without sacrificing query performance.

Introduction

It must be one of the oldest use cases in the world of relational databases: you have an entity, and you need to store certain properties of this entity. But, not all entities have the same set of properties, and properties will be added in the future.

The most naive way to solve this problem would be to create a column in your table for each property, and just fill in the ones that are relevant. Great! Problem solved. Until your table contains millions of records and you need to add a new non-null property.

Enter Entity-Attribute-Value. I’ve seen this pattern in almost every database I’ve worked with. One table contains the entities, another table contains the names of the properties (attributes) and a third table links the entities with their attributes and holds the value. This gives you the flexibility for having different sets of properties (attributes) for different entities, and also for adding properties on the fly without locking your table for 3 days.

Nonetheless, I wouldn’t be writing this post if there were no downsides to this approach. Selecting one or more entities based on 1 attribute value requires 2 joins: one with the attribute table and one with the value table. Need entities bases on 2 attributes? That’s 4 joins! Also, the properties usually are all stored as strings, which results in type casting, both for the result as for the WHERE clause. If you write a lot of ad-hoc queries, this is very tedious.

Despite these obvious shortcomings, EAV has been used for a long time to solve this kind of problem. It was a necessary evil, and there just was no better alternative. But then PostgreSQL came along with a new feature…

Starting from PostgreSQL 9.4, a JSONB datatype was added for storing binary JSON data. While storing JSON in this format usualy takes slightly more space and time then plain text JSON, executing operations on it is much faster. Also JSONB supports indexing, making querying it even faster.

This new data type enables us to replace the tedious EAV pattern by adding a single JSONB column to our entity table, greatly simplifying the database design. But many argue that this must come with a performance cost. That’s why I created this benchmark.

Test database setup

For this comparison, I created a database on a fresh PostgreSQL 9.5 installation on an 80 $ DigitalOcean Ubuntu 14.04 box. After tuning some settings in postgresql.conf, I ran this script using psql.

The following tables were created for representing the data as EAV.

CREATE TABLE entity ( 
  id           SERIAL PRIMARY KEY, 
  name         TEXT, 
  description  TEXT
);
CREATE TABLE entity_attribute (
  id          SERIAL PRIMARY KEY, 
  name        TEXT
);
CREATE TABLE entity_attribute_value (
  id                  SERIAL PRIMARY KEY, 
  entity_id           INT    REFERENCES entity(id), 
  entity_attribute_id INT    REFERENCES entity_attribute(id), 
  value               TEXT
);

The table below represents the same data, but with the attributes in a JSONB column which I called properties.

CREATE TABLE entity_jsonb (
  id          SERIAL PRIMARY KEY, 
  name        TEXT, 
  description TEXT,
  properties  JSONB
);

A lot simpler, isn’t it?

Then, I loaded the exact same data for both patterns for a total of 10 million entities in the form of the one below. This way, we have some different data types among the attribute set.

{
  id:          1
  name:        "Entity1"
  description: "Test entity no. 1"
  properties:  {
    color:        "red"
    lenght:       120
    width:        3.1882420
    hassomething: true
    country:      "Belgium"
  } 
}

So, we now have the same data stored in both formats. Let’s start comparing!

Query aesthetics

Earlier it was already clear that the design of the database was greatly simplified by using a JSONB column for the properties instead of using a 3 tabes EAV model. But does this also reflect in the queries?

Updating a single entity property looks like this:

-- EAV
UPDATE entity_attribute_value 
SET value = 'blue' 
WHERE entity_attribute_id = 1 
  AND entity_id = 120;

-- JSONB
UPDATE entity_jsonb 
SET properties = jsonb_set(properties, '{"color"}', '"blue"') 
WHERE id = 120;

Admittedly, the latter doesn’t look simpler. To update the property in the JSONB object, we have to use the jsonb_set() function, and we have to pass our new value as a JSONB object. However, we don’t need to know any id’s upfront. When you look at the EAV example, you have to know both the entity_id and the entity_attribute_id to perform the update. If you want to update a property in the JSONB column based on the entity name, go ahead, it’s all in the same row.

Now, let’s select that entity we just updated, based on its new color:

-- EAV
SELECT e.name 
FROM entity e 
  INNER JOIN entity_attribute_value eav ON e.id = eav.entity_id
  INNER JOIN entity_attribute ea ON eav.entity_attribute_id = ea.id
WHERE ea.name = 'color' AND eav.value = 'blue';

-- JSONB
SELECT name 
FROM entity_jsonb 
WHERE properties ->> 'color' = 'blue';

I think we can agree that the second is both shorter (no joins!) and more pleasing to the eye. A clear win for JSONB! Here, we use the JSON ->> operator to get the color as a text value from the JSONB object. There is also a second way to achieve the same result in the JSONB model, using the @> containment operator:

-- JSONB 
SELECT name 
FROM entity_jsonb 
WHERE properties @> '{"color": "blue"}';

This is a bit more complex: we check if the JSON object in the properties column contains the object on the right of the operator. Less readable, more performant (see later).

The simplification of using JSONB is even stronger when you need to select multiple properties at once. This is where the JSONB approach really shines: we just select the properties as extra columns in our result set, without the need for joins.

-- JSONB 
SELECT name
  , properties ->> 'color'
  , properties ->> 'country'
FROM entity_jsonb 
WHERE id = 120;

With EAV, you would need 2 joins per property you want to query.

In my opinion, the queries above show a great simplification in database design. If you want more examples on how to query JSONB data, check out this post.

Now, let’s talk performance.

Performance

To compare performance, I used EXPLAIN ANALYSE on the queries above to see how long they take. I did each query at least three times, because the first time the query planner needs some more time. At first, I executed the queries without any indexes. This will obviously be in the advantage of JSONB, as the joins required for EAV can’t make use of index scans (the foreign key fields aren’t indexed). After that, I created an index on the 2 foreign key columns in the EAV value table, and also a GIN index on the JSONB column

For updating the data, this resulted in these execution times (in ms). Note that the scale is logarithmic:

Update results

Here, we see that the JSONB is much (>50000x) faster than EAV when not using any indexes, for the reason mentioned above. When we index the foreing key columns the difference is almost eliminated, but JSONB is still 1.3x faster than EAV. Notice that the index on the JSONB column does not have any effect here, as we don’t use the properties column in the criteria.

For selecting data based on a property value, we get the following results (normal scale):

Update results

Here we can see that JSONB was again faster without indexes for EAV, but when the index is used EAV is the fastest. But then I noticed the times for the JSONB queries were the same, pointing me to the fact that the GIN index is not used. Apparently, when you use a GIN index on the full properties column, it only has effect when using the containment (@>) operator. I added this to the benchmark and it had a huge effect on the timing: only 0.153ms! That’s 15000x faster then EAV, and 25000x faster than the ->> operator.

For me, that’s fast enough.

Table size

Let’s compare the sizes of both approaches. In psql we can show the size of all tables and indexes using the \dti+ command:

test-# \dti+
                                                     List of relations
                      Name                      | Type  |         Table          |  Size   |
------------------------------------------------+-------+------------------------+---------+
 entity                                         | table |                        | 730 MB  |
 entity_attribute                               | table |                        | 48 kB   |
 entity_attribute_name_idx                      | index | entity_attribute       | 16 kB   |
 entity_attribute_name_key                      | index | entity_attribute       | 16 kB   |
 entity_attribute_pkey                          | index | entity_attribute       | 16 kB   |
 entity_attribute_value                         | table |                        | 2338 MB |
 entity_attribute_value_entity_attribute_id_idx | index | entity_attribute_value | 1071 MB |
 entity_attribute_value_entity_id_idx           | index | entity_attribute_value | 1071 MB |
 entity_attribute_value_pkey                    | index | entity_attribute_value | 1071 MB |
 entity_jsonb                                   | table |                        | 1817 MB |
 entity_jsonb_pkey                              | index | entity_jsonb           | 214 MB  |
 entity_jsonb_properties_idx                    | index | entity_jsonb           | 104 MB  |
 entity_pkey                                    | index | entity                 | 214 MB  |
(13 rows)

For the EAV model, the tables add up to 3068MB and the indexes add up to 3427MB, resulting in a total 6.43GB. On the other hand, the JSONB model uses 1817MB for the table, and 318MB for the indexes, totalling 2,08GB. Thats 3x less. This suprised me a bit, because we store the property names in each JSONB object. But when you think about it, in EAV we store 2 integer foreign keys per attribute value, resulting in 8 bytes of extra data. Also, in EAV all property values are stored as text, while JSONB will use numeric and boolean values internally where possible, resulting in less space.

Conclusion

In general, I think storing entity properties in JSONB format can greatly simplify your database design and maintenance. If, like me, you do a lot of ad-hoc querying, having everything stored in the same table as the entity is really useful. The fact that it simplifies interacting with your data is already a plus, but the resulting database is also 3x smaller and from my tests it seems that the performance penalties are very limited. In some cases JSONB even performes faster then EAV, which makes it even better.

However, this benchmark does of course not cover all aspects (like entities with a very large number of properties, a large increase in the number of properties of existing data, …), so if you have any suggestions on how to improve it, please feel free to leave a comment!

Repost from:
http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/

 

 

Postgresql Point in Time Recovery, WAL and continuous archiving

In postgresql.conf archive set following settings:

# Write Ahead Log section:

# WAL LEVEL
# wal_level = archive, hot_standby (9.1+)
# wal_level =  archive, hot_standby, or logical (9.4+)
# wal_level = replica or logical (9.6+)

wal_level = archive

archive_mode = on               # enables archiving; off, on, or always
                                # (change requires restart)

# archive_command for copy wal segments to /home/wal/ directory
# dont forget(linux): chown postgres:postgres /home/wal

archive_command = 'test ! -f /home/wal/%f && cp %p /home/wal/%f'     # command to use to archive a logfile segment
                                # placeholders: %p = path of file to archive
                                #               %f = file name only
# Windows version:  archive_command = 'copy "%p" "C:\\server\\wal\\%f"'  # Windows

# In replication section:
max_wal_senders = 2

In pg_hba.conf activate user and domain for replication.
I will use postgres user by Unix domain socket file:

local   replication     postgres                                peer

Then, restart Postgresql service.

Now, make the origin point backup, using pg_basebackup tool:

cd /home
mkdir basebackup
chown postgres:postgres basebackup/
su postgres

pg_basebackup -F t -z -x -D /home/basebackup

# pointing Unix domain socket file directory
# using setting: unix_socket_directories = '/var/run/postgresql, /tmp'  # comma-separated list of directories
pg_basebackup -h /tmp -F t -z -x -D /home/basebackup

# Another example with host, user, requires password:
# /usr/pgsql-9.1/bin/pg_basebackup -h 127.0.0.1 -F t -z -U sisapp -P -x -D /home/basedump

Documentation:
https://www.postgresql.org/docs/9.1/static/continuous-archiving.html
https://www.postgresql.org/docs/9.5/static/continuous-archiving.html
https://www.postgresql.org/docs/9.6/static/continuous-archiving.html

https://www.postgresql.org/docs/current/static/app-pgbasebackup.html

SQL COPY table data from CSV file in Postgresql

SQL COPY command examples:

COPY censo.clases_materiales FROM '/home/csv/clases_materiales.csv' DELIMITER ',' CSV;

To ignore the column headers:

COPY catastro_censal.aval_carac_valor FROM '/home/csv/valores-caracteristicas.csv' 
    WITH DELIMITER ';' CSV HEADER;

Define columns to insert (for ignore PK field and postgres takes its sequence as default), also ignore header:

COPY catastro_censal.aval_depreciacion (espec, anios, bueno, regular, malo) 
	FROM '/home/csv/depreciacion.csv' WITH DELIMITER ';' CSV HEADER;

Using TAB character:

COPY censo.json_hclaves FROM '/home/csv/json_hclaves.csv' 
    WITH DELIMITER E'\t' CSV HEADER;

Install Postgresql 9.5 on CentOS 6

# Get correct RPM from http://yum.postgresql.org/repopackages.php

cd /home/
mkdir installers
cd installers

wget https://download.postgresql.org/pub/repos/yum/9.5/redhat/rhel-6-x86_64/pgdg-centos95-9.5-2.noarch.rpm

rpm -i pgdg-centos95-9.5-2.noarch.rpm

# Install EPEL repos as:
# http://www.tecmint.com/how-to-enable-epel-repository-for-rhel-centos-6-5/

wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
rpm -ivh epel-release-6-8.noarch.rpm

yum install postgresql95 postgresql95-contrib postgresql95-server postgresql95-devel postgis2_95 postgis2_95-devel

service postgresql-9.5 initdb 'en_US.UTF8'
chkconfig postgresql-9.5 on

# Darle password al usuario postgres linux/db:

passwd postgres
service postgresql-9.5 restart
su postgres
psql -c "ALTER USER postgres WITH PASSWORD 'xxxxxx'" -d template1
exit

nano /var/lib/pgsql/9.5/data/postgresql.conf

# En la línea #listen_addresses quitamos el carácter # (habilitar) y reemplazamos la palabra localhost por el carácter *
# Configure la cantidad máxima de conexiones (max_connection) a 200.
# Guarde y cierre el archivo con Crtl+o, Enter, y Ctrl+x

nano /var/lib/pgsql/9.5/data/pg_hba.conf

# Definir en la seccion #IPv4 local connections :
host    all             all             127.0.0.1/32            password
host    all             all             0.0.0.0/0               password

service postgresql-9.5 restart

# firewall rule for 5432 port
nano /etc/sysconfig/iptables
# Then add this entry before -A INPUT -j REJECT line
-A INPUT -p tcp -m state --state NEW -m tcp --dport 5432 -j ACCEPT
# restart firewall
service iptables restart
# verify open ports:
netstat -tulpn
# Make sure iptables is allowing port 80 / 110 / 143 connections:
iptables -L -n

# Disable enforce
nano etc/selinux/config
# change the SELINUX line to 
SELINUX=disabled

Deleting OID type BLOB for PostgreSQL

Specifically deleting a row that contains a Large Object reference does not delete the Large Object. Deleting the Large Object is a separate operation that needs to be performed..

In JDBC this can be done in two steps:

Delete the large object (call PostgreSQL function lo_unlink)

long oid = jdbcTemplate.queryForObject("select image from InventoryItem where id = ?",
			new Object[] { id },
			new RowMapper<Long>() {
			    @Override
			    public Long mapRow(ResultSet rs, int row) throws SQLException {
				return new Long(rs.getLong(1));
			    }
			});
jdbcTemplate.execute("select lo_unlink(" + oid + ")");

Delete the row in the referring table:

jdbcTemplate.update("delete from InventoryItem where id = '" + id + "'");

Super Fuzzy Searching on PostgreSQL / like search indexing trigrams

Super Fuzzy Searching on PostgreSQL

JB

I have been working on doing some fuzzy searching on PostgreSQL for a while, and thought I’d share some of my experiences here.  Many people are not yet aware of the incredible fuzzy search capabilities of PostgreSQL, especially that were added in the 9.1 release.

Most people use an external engine such as Lucene to do fuzzy searches.  While, as you will see, there are still reasons to do so, they are much fewer now than they used to be.

Fuzzy Search Using Trigrams

While PostgreSQL has lots of different fuzzy search options (soundex, levenshtein, etc.), the one which has the most support is trigram searching.  What is a trigram?  Trigrams break a word up into 3-letter sequences, which can then be used to do similarity matches.  Think of the word “hello”.  It has the following trigrams: “h”, “he”, “hel”, “ell”, “llo”, “lo”, and “o”.  Now look at the word “hallo”.  It has the following trigrams: “h”, “ha”, “hal”, “all”, “llo”, “lo”, and “o”.  You can see that there are several trigrams that match, and several that don’t match.  Similarity is computed via cosine similarity (which I don’t totally understand), but, basically, the more trigrams you have in common the closer the match.  In the present case, ‘hallo’ and ‘hello’ have a similarity of 0.333333.  You can see this in PostgreSQL by saying:

select similarity('hello', 'hallo')

PostgreSQL also has two other relevant operators that use similarity – the “<->” operator and the “%” operator.  “<->” is the “distance” operator, which is simply one minus the similarity (i.e. if the similarity is 0.2 the distance will be 0.8).  For text searching, “%” is the “similar” operator, which returns true if two strings are similar, or false if they are not.  Two strings are defined as “similar” if their similarity is 0.3 or greater.  This can be set with set_limit(), but it is not usually useful to do so (we will deal with custom similarities later).  The “%” operator is important, because of its heavy use in indexing.

Installing Trigrams

The trigram module (pg_trgm) is not installed by default in PostgreSQL, but is one of their standard (and supported!) extensions.  To install, you need to make sure you build the module in the source code.  To do this, go into the contrib/pg_trgm directory of the source code and do a “make; make install”.  This will install the plugin, but will not activate it.  Then, in any database you want to use this plugin, put in the command:

CREATE EXTENSION pg_trgm;

This will load the necessary SQL to install the functions and operators needed to make trigrams work.  If you type that command in the template1 database, then all future databases created on that system will have the extension installed.

Using Trigrams Without an Index

The simplest use of trigrams would be to have a table with a name column, and do a search like this:

SELECT * FROM people WHERE lastname % 'Schmoe'

This will return a list of people where the trigram similarity is greater than the default threshold.  If you want to specify the similarity you are looking for, you can change that to:

SELECT * FROM people WHERE similarity(lastname, 'Schmoe') > 0.5

This will give a less-tolerant match than the default.

The problem with doing this is that it takes a lot of processing power.  The reason people usually use databases is because they have a huge number of records and need to be able to search them quickly.  If we have 10,000 records, manually doing trigram matches will take quite a bit of time.

So, one of the best features of PostgreSQL 9.1 is the ability for PostgreSQL to incorporate trigrams into an index.  So, to make an index to speed up our search, do:

CREATE INDEX people_lastname_trigram_idx ON people USING gist(lastname gist_trgm_ops);

This will create a “gist” index, which is a special type of index which is often used in spacial and other vector-oriented matching.  The “gist_trgm_ops” is a little bit of magic that tells gist how to use a text field as a vector of trigrams.

Now, your query will use the index, provided you have enough rows to make it worthwhile.  HOWEVER, it is not yet using the index efficiently.  While PostgreSQLcan (and will) use the index for this query, it doesn’t actually check the condition until a later step (I believe, but am not sure, that it is using the index to get everything in the right order, and then at a later stage doing filtering).  For large tables, this can still be several orders of magnitude worse than you want.  To fix this, you MUST use the “%” operator.  This will cause the index to have an “index condition”, which it will use to limit the results that it puts out.  So, if you have:

SELECT * FROM people WHERE lastname % 'Schmoe';

This will use the index AND the index condition, and will be super-fast (except in extremely large datasets – see next section).  However, this uses the built-in cutoff point of 0.3 similarity.  What if I want my results to be tighter than that?  The best way is to combine % with similarity(), like this:

SELECT * FROM people WHERE lastname % 'Schmoe' AND similarity(lastname, 'Schmoe') > 0.5;

In this case, the database will use the “%” condition for the index, which will be incredibly fast, and the “similarity()” call as a post-index filter, which, since there is a much smaller dataset, will still be pretty fast.

Working With Larger Datasets

I had trouble working with trigrams because my main dataset is about 14 million titles.  At this size, trigram searching is really slow.  With the whole database loaded in-memory (either in PG’s cache or in the filesystem cache – I have many, many gigs of memory), a trigram search of the database USING THE INDEXtook between 3 and 12 seconds, depending on the string.  Poking around, I came out with the following basic results for a database on an Amazon EC2 4XL instance without any other load:

  • 100,000 records: 40 milliseconds
  • 1,000,000 records: 0.4 seconds
  • 14,000,000 records: 7 seconds

So, as you can see, the useful limit to these queries is about 1 million records.  Therefore, if you have more than 14 million rows, if you want to do trigrams, you need to find a way to boost performance.  I’ve found there are two basic methods you can use to speed up trigrams on large datasets:

  1. Create specialized indexes with subsets of your data
  2. Create a separate wordlist table with unique words

#1 is much preferred if you can do it, because it is an entirely database-oriented method.  Basically, what I did, is I figured out that on my queries I only needed to trigram search on about 1 million of my 14 million records.  Therefore, I specialized the index by adding a “WHERE” clause to the index itself.  On my database, I have a “type” field, and I’m only searching certain types of records.  Therefore, I changed my index to the following:

CREATE INDEX people_lastname_restricted_trigram_idx ON people USING gist(lastname gist_trgm_ops) WHERE type IN ('my_type1', 'my_type2');

This meant that only about a million records were being stored in the index, which brought down my search time considerably.  Another search that I was doing was based on the person’s “popularity”, in which case, I added another index for popularity being greater than 20.  This restricted the set down to about 100,000 records, which made it blazingly fast.

The other alternative (#2) I mentioned is to create a separate table of words which is searched.  The idea is that while you may have 14 million records, there probably aren’t 14 million words.  Therefore, you create a table using unique words from the database, and then add a trigram index on that.  The problem, though, is that this takes either (a) a lot of manual work, or (b) a lot of triggers, none of which is the way I like to administer databases.  Nonetheless, it can work if you are merely wanting to use trigrams to fuzzy-match individual words.  It can be especially helpful in conjunction with PostgreSQL’s full-text search (previously known as tsearch2), which still operates blazingly-fast on 14 million records.  Fuzzy string matching on words can be combined with full-text search on records to present a powerful and fast method of searching.   The wordlist can be fuzzy-matched to provide better hints at what words are in the fulltext.

To create a wordlist, do something like this:

CREATE TABLE search_words (word text);
CREATE INDEX search_words_word_trigram_idx ON search_words USING gist(word gist_trgm_ops);

Then, every time you need to regenerate your wordlist, do the following:

DELETE FROM search_words;
INSERT INTO search_words (word) SELECT DISTINCT lower(word) FROM (SELECT unnest(regexp_split_to_array(lastname, E'\\\\W+')) AS word FROM people) words WHERE word IS NOT NULL AND length(word) > 3;

What that does is explode the lastname field out into words (in case a lastname is multi-word), where each word is its own record.  It also filters out excessively short words (optional).  It then inserts these into the search_words table, which has a trigram index.  So, now, if you want to fuzzy-match individual words, you can use the search_words table!

Anyway, additional fun can be had by using unaccent to remove accents before all of this processing, or other fun to massage your data however you feel is best.  The important thing is that, for all but the most strenuous data sets, we can now keep all of the searching located within the database itself, managed by database tools, without having to rely on external search engines such as Lucene, which take time, management, machine power, and their own sets of headaches.  Lucene itself often has to be hand-rigged for large datasets even.  I am a big believer in pushing as much data work to the database as possible, and PostgreSQL keeps pushing the bar higher!

 

source: http://bartlettpublishing.com/site/bartpub/blog/3/entry/350

 

Remove accents in postgresql / quitar acentos

somedb=#CREATE EXTENSION unaccent;CREATE EXTENSION
somedb=#SELECT unaccent('Hôtel');
 unaccent
----------
 Hotel

somedb=#SELECT * FROM table WHERE lower(unaccent(table.id))= lower('Jose');

And speed things up a bit by generating an accentless, lower-cased index:

somedb=#CREATE INDEX CONCURRENTLY ON table(lower(unaccent(id)));


I use the unaccent module for that – which is something different entirely than what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

It also provides the function unaccent() you could use with your example (where LIKE seems not needed).

SELECT*FROM   users
WHERE  unaccent(name)= unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for functional indexes. If a function can return a different result under different circumstances, the index could silently break.

unaccent() only STABLE not IMMUTABLE

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

  1. It depends on the behavior of a dictionary.
  2. There is no hard-wired connection to this dictionary.
  3. It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function and declare it IMMUTABLE. This is a brute-force method that might break under rare circumstances.

Others suggest a simple IMMUTABLE wrapper function, like I did myself in the past.

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven’t tested it myself. I think I have come up with a better idea:

Best for now

All of this put together I propose this mostly safe approach that is just as efficient as other solutions floating around, but safer: Create a simple wrapper function with the two-parameter variant that hard-codes the correct dictionary as well as the right search_path:

CREATEOR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS$func$SELECT unaccent('unaccent',$1)$func$  LANGUAGE sql IMMUTABLE SET search_path =public, pg_temp;

Set the search_path to the schema where you install your extensions (default is public).
Why the dangling pg_temp? To rule out temporary objects coming first. More in the manual here.
You can build a functional index using that, since it is declared IMMUTABLE.

CREATEINDEX users_unaccent_name_idx ON users(f_unaccent(name));

Adapt your query to use the index:

SELECT*FROM   users
WHERE  f_unaccent(name)= f_unaccent('João');

Pattern matching

If you want to use it with LIKE (and a pattern that is not left-anchored), you can combine this with the module pg_tgrm in PostgreSQL 9.1 or later. Create a functional GIN or GIST index. Example for GIN:

CREATEINDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Be aware that GIN and GIST indexes are somewhat more expensive to maintain. Would be used in a query like:

SELECT*FROM   users
WHERE  f_unaccent(name)LIKE('%'|| f_unaccent('João')||'%');

I have written a more detailed answer about pattern matching and performance in a recent answer on dba.SE.

pg_tgrm also provides very useful operators for “similarity” % and “distance” <->.

Postgresql 9.1 trigram for like and ilike patterns performance using indexes

POSTGRESQL 9.1 TRIGRAMS TEACHING LIKE AND ILIKE NEW TRICKS

Printer Friendly