Postgresql full text search tokenizer -
just run issue. i'm trying set full text search on localized content (russian in particular). problem default configuration (as custom) not deal letter cases. example:
select * to_tsvector('test_russian', 'На рынке появились новые рублевые облигации'); > 'На':1 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2 'На' stopword , should removed, not lowercased in result vector. if pass lowercased string, works properly
select * to_tsvector('test_russian', 'на рынке появились новые рублевые облигации'); > 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2 sure can pass pre-lowercased strings, manual says
the simple dictionary template operates converting input token lower case , checking against file of stop words.
config russian_test looks this:
create text search configuration test_russian (copy = 'russian'); create text search dictionary russian_simple ( template = pg_catalog.simple, stopwords = russian ); create text search dictionary russian_snowball ( template = snowball, language = russian, stopwords = russian ); alter text search configuration test_russian alter mapping word russian_simple,russian_snowball; but same results built-in russian config.
i tried ts_debug , tokens treated word, expected.
any ideas?
problem solved. reason database initiated default ("c") ctype , collate. used
initdb --locale=utf-8 --lc-collate=utf-8 --encoding=utf-8 -u pgsql *pgsql data dir* to recreate instance ,
create database "scratch" owner "postgres" encoding 'utf8' lc_collate = 'ru_ru.utf-8' lc_ctype = 'ru_ru.utf-8'; to recreate db , simple dictionary works.
Comments
Post a Comment