Postgresql full text search tokenizer -
just run issue. i'm trying set full text search on localized content (russian in particular). problem default configuration (as custom) not deal letter cases. example:
select * to_tsvector('test_russian', 'На рынке появились новые рублевые облигации'); > 'На':1 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2
'На' stopword , should removed, not lowercased in result vector. if pass lowercased string, works properly
select * to_tsvector('test_russian', 'на рынке появились новые рублевые облигации'); > 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2
sure can pass pre-lowercased strings, manual says
the simple dictionary template operates converting input token lower case , checking against file of stop words.
config russian_test
looks this:
create text search configuration test_russian (copy = 'russian'); create text search dictionary russian_simple ( template = pg_catalog.simple, stopwords = russian ); create text search dictionary russian_snowball ( template = snowball, language = russian, stopwords = russian ); alter text search configuration test_russian alter mapping word russian_simple,russian_snowball;
but same results built-in russian
config.
i tried ts_debug , tokens treated word
, expected.
any ideas?
problem solved. reason database initiated default ("c") ctype
, collate
. used
initdb --locale=utf-8 --lc-collate=utf-8 --encoding=utf-8 -u pgsql *pgsql data dir*
to recreate instance ,
create database "scratch" owner "postgres" encoding 'utf8' lc_collate = 'ru_ru.utf-8' lc_ctype = 'ru_ru.utf-8';
to recreate db , simple dictionary works.
Comments
Post a Comment