Postgresql full text search tokenizer -

March 15, 2010

just run issue. i'm trying set full text search on localized content (russian in particular). problem default configuration (as custom) not deal letter cases. example:

select * to_tsvector('test_russian', 'На рынке появились новые рублевые облигации'); > 'На':1 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2

'На' stopword , should removed, not lowercased in result vector. if pass lowercased string, works properly

select * to_tsvector('test_russian', 'на рынке появились новые рублевые облигации'); > 'новые':4 'облигации':6 'появились':3 'рублевые':5 'рынке':2

sure can pass pre-lowercased strings, manual says

the simple dictionary template operates converting input token lower case , checking against file of stop words.

config russian_test looks this:

create text search configuration test_russian (copy = 'russian');  create text search dictionary russian_simple (     template = pg_catalog.simple,     stopwords = russian );  create text search dictionary russian_snowball (     template = snowball,     language = russian,     stopwords = russian );  alter text search configuration test_russian      alter mapping word     russian_simple,russian_snowball;

but same results built-in russian config.

i tried ts_debug , tokens treated word, expected.

any ideas?

problem solved. reason database initiated default ("c") ctype , collate. used

initdb --locale=utf-8 --lc-collate=utf-8 --encoding=utf-8 -u pgsql *pgsql data dir*

to recreate instance ,

create database "scratch"   owner "postgres"   encoding 'utf8'   lc_collate = 'ru_ru.utf-8'   lc_ctype = 'ru_ru.utf-8';

to recreate db , simple dictionary works.

Search This Blog

Copy

Postgresql full text search tokenizer -

Comments

Post a Comment

Popular posts from this blog

matlab - Deleting rows with specific rules -

asp.net - redirect .aspx with query string to html page using htaccess -

image - ClassNotFoundException when add a prebuilt apk into system.img in android -