r - Function `dist` not behaving as expected on vectors with missing values -
edit: think, discussion below @joran , @joran helped me figure out how dist
altering distance value (it appears scaling sum of squares of coordinates value [total dimensions]/[non-missing dimensions], guess). i'd know, if know, are: going on? if so, why considered reasonable thing do? can there, or should there options dist
compute way proposed (that question might vague or of opinionated nature answer, though).
i wondering how dist
function works on vectors have missing values. below recreated example. use dist
function , more fundamental implementation of believe should definition of euclidian distance sqrt, sum, , powers. expected if component of either vector na
, that dimension thrown out of sum, how implemented it. can see that definition doesn't agree dist
.
i using basic implementation handle na
values, wondering how dist
arriving @ value when vectors have na
, , why doesn't agree how calculate below. think basic implementation should default/common one, , can't figure out alternate method dist
using getting.
thanks, matt
v1 <- c(1,1,1) v2 <- c(1,2,3) v3 <- c(1,na,3) # agree on vectors non-missing components # -------------------------------------------- dist(rbind(v1, v2)) # v1 # v2 2.236068 sqrt(sum((v1 - v2)^2, na.rm=true)) # [1] 2.236068 # don't agree when there missing component # under logic sqrt(6) make sense answer dist? # -------------------------------------------- dist(rbind(v1, v3)) # v1 # v3 2.44949 sqrt(sum((v1 - v3)^2, na.rm=true)) # [1] 2
yes, scaling happens described. maybe better example:
set.seed(123) v1 <- sample(c(1:3, na), 100, true) v2 <- sample(c(1:3, na), 100, true) dist(rbind(v1, v2)) # v1 # v2 12.24745 na.idx <- is.na(v1) | is.na(v2) v1a <- v1[!na.idx] v2a <- v2[!na.idx] sqrt(sum((v1a - v2a)^2) * length(v1) / length(v1a)) # [1] 12.24745
the scaling makes sense me. things being equal, distance increases number of dimensions increases. if somewhere have na
dimension i
, reasonable guess contribution of dimension i
squared sum mean contribution of other dimensions. hence linear up-scaling.
while suggesting when find na
dimension i
, dimension should not contribute squared sum. assuming v1[i] == v2[i]
totally different.
to summarize dist
doing type of maximum-likelihood estimation, while suggestion more worst (or best) case scenario.
Comments
Post a Comment