numpy - Term document matrix and cosine similarity in Python -


i have following situation want address using python (preferably using numpy , scipy):

  1. collection of documents want convert sparse term document matrix.
  2. extract sparse vector representation of each document (i.e. row in matrix) , find out top 10 similary documents using cosine similarity within subset of documents (documents labelled categories , want find similar documents within same category).

how achieve in python? know can use scipy.sparse.coo_matrix represent documents sparse vectors , take dot product find cosine similarity, how convert entire corpus large sparse term document matrix (so can extract it's rows scipy.sparse.coo_matrix row vectors)?

thanks.

may recommend take @ scikit-learn? regarded library in python community simple consistent api. have implemented cosine similarity metric. example taken here of how in 3 lines of code:

>>> sklearn.feature_extraction.text import tfidfvectorizer  >>> vect = tfidfvectorizer(min_df=1) >>> tfidf = vect.fit_transform(["i'd apple", ...                             "an apple day keeps doctor away", ...                             "never compare apple orange", ...                             "i prefer scikit-learn orange"]) >>> (tfidf * tfidf.t).a array([[ 1.        ,  0.25082859,  0.39482963,  0.        ],        [ 0.25082859,  1.        ,  0.22057609,  0.        ],        [ 0.39482963,  0.22057609,  1.        ,  0.26264139],        [ 0.        ,  0.        ,  0.26264139,  1.        ]]) 

Comments

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -