Remove accents in postgresql / quitar acentos

somedb=#CREATE EXTENSION unaccent;CREATE EXTENSION
somedb=#SELECT unaccent('Hôtel');
 unaccent
----------
 Hotel

somedb=#SELECT * FROM table WHERE lower(unaccent(table.id))= lower('Jose');

And speed things up a bit by generating an accentless, lower-cased index:

somedb=#CREATE INDEX CONCURRENTLY ON table(lower(unaccent(id)));


I use the unaccent module for that – which is something different entirely than what you are linking to.

unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes.

Install once per database with:

CREATE EXTENSION unaccent;

It also provides the function unaccent() you could use with your example (where LIKE seems not needed).

SELECT*FROM   users
WHERE  unaccent(name)= unaccent('João');

Index

To use an index for that kind of query, create an index on the expression. However, Postgres only accepts IMMUTABLE functions for functional indexes. If a function can return a different result under different circumstances, the index could silently break.

unaccent() only STABLE not IMMUTABLE

Unfortunately, unaccent() is only STABLE, not IMMUTABLE. According to this thread on pgsql-bugs, this is due to three reasons:

  1. It depends on the behavior of a dictionary.
  2. There is no hard-wired connection to this dictionary.
  3. It therefore also depends on the current search_path, which can change easily.

Some tutorials on the web instruct to just alter the function and declare it IMMUTABLE. This is a brute-force method that might break under rare circumstances.

Others suggest a simple IMMUTABLE wrapper function, like I did myself in the past.

There is an ongoing debate whether to make the variant with two parameters IMMUTABLE which declares the used dictionary explicitly. Read here or here.

Another alternative would be this module with an IMMUTABLE unaccent() function by Musicbrainz, provided on Github. Haven’t tested it myself. I think I have come up with a better idea:

Best for now

All of this put together I propose this mostly safe approach that is just as efficient as other solutions floating around, but safer: Create a simple wrapper function with the two-parameter variant that hard-codes the correct dictionary as well as the right search_path:

CREATEOR REPLACE FUNCTION f_unaccent(text)
  RETURNS text AS$func$SELECT unaccent('unaccent',$1)$func$  LANGUAGE sql IMMUTABLE SET search_path =public, pg_temp;

Set the search_path to the schema where you install your extensions (default is public).
Why the dangling pg_temp? To rule out temporary objects coming first. More in the manual here.
You can build a functional index using that, since it is declared IMMUTABLE.

CREATEINDEX users_unaccent_name_idx ON users(f_unaccent(name));

Adapt your query to use the index:

SELECT*FROM   users
WHERE  f_unaccent(name)= f_unaccent('João');

Pattern matching

If you want to use it with LIKE (and a pattern that is not left-anchored), you can combine this with the module pg_tgrm in PostgreSQL 9.1 or later. Create a functional GIN or GIST index. Example for GIN:

CREATEINDEX users_unaccent_name_trgm_idx ON users
USING gin (f_unaccent(name) gin_trgm_ops);

Be aware that GIN and GIST indexes are somewhat more expensive to maintain. Would be used in a query like:

SELECT*FROM   users
WHERE  f_unaccent(name)LIKE('%'|| f_unaccent('João')||'%');

I have written a more detailed answer about pattern matching and performance in a recent answer on dba.SE.

pg_tgrm also provides very useful operators for “similarity” % and “distance” <->.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s