r - Function `dist` not behaving as expected on vectors with missing values -


edit: think, discussion below @joran , @joran helped me figure out how dist altering distance value (it appears scaling sum of squares of coordinates value [total dimensions]/[non-missing dimensions], guess). i'd know, if know, are: going on? if so, why considered reasonable thing do? can there, or should there options dist compute way proposed (that question might vague or of opinionated nature answer, though).

i wondering how dist function works on vectors have missing values. below recreated example. use dist function , more fundamental implementation of believe should definition of euclidian distance sqrt, sum, , powers. expected if component of either vector na, that dimension thrown out of sum, how implemented it. can see that definition doesn't agree dist.

i using basic implementation handle na values, wondering how dist arriving @ value when vectors have na, , why doesn't agree how calculate below. think basic implementation should default/common one, , can't figure out alternate method dist using getting.

thanks, matt

v1 <- c(1,1,1) v2 <- c(1,2,3) v3 <- c(1,na,3)  # agree on vectors non-missing components # -------------------------------------------- dist(rbind(v1, v2)) #          v1 # v2 2.236068  sqrt(sum((v1 - v2)^2, na.rm=true)) # [1] 2.236068    # don't agree when there missing component # under logic sqrt(6) make sense answer dist? # -------------------------------------------- dist(rbind(v1, v3)) #         v1 # v3 2.44949  sqrt(sum((v1 - v3)^2, na.rm=true)) # [1] 2 

yes, scaling happens described. maybe better example:

set.seed(123) v1 <- sample(c(1:3, na), 100, true) v2 <- sample(c(1:3, na), 100, true)  dist(rbind(v1, v2)) #          v1 # v2 12.24745  na.idx <- is.na(v1) | is.na(v2)  v1a  <- v1[!na.idx] v2a  <- v2[!na.idx]  sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a)) # [1] 12.24745 

the scaling makes sense me. things being equal, distance increases number of dimensions increases. if somewhere have na dimension i, reasonable guess contribution of dimension i squared sum mean contribution of other dimensions. hence linear up-scaling.

while suggesting when find na dimension i, dimension should not contribute squared sum. assuming v1[i] == v2[i] totally different.

to summarize dist doing type of maximum-likelihood estimation, while suggestion more worst (or best) case scenario.


Comments

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -