R grouping by condition in data.table -


in r, have large data.table. every row, want count rows similar value of x1 (+/- tolerance, tol). can work using adply, it's slow. seems sort of thing data.table - in fact, i'm using data.table part of computation.

is there way entirely data.table? here example:

library(data.table) library(plyr) my.df = data.table(x1 = 1:1000,                    x2 = 4:1003) tol = 3 adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .n]) 

results:

        x1   x2 v1    1:    1    4  3    2:    2    5  4    3:    3    6  5    4:    4    7  5    5:    5    8  5   ---               996:  996  999  5  997:  997 1000  5  998:  998 1001  5  999:  999 1002  4 1000: 1000 1003  3 

update:

here's sample dataset little closer real data:

set.seed(10) x = seq(1,100000000,100000) x = x + sample(1:50000, length(x), replace=t) x2 = x + sample(1:50000, length(x), replace=t) my.df = data.table(x1 = x,                    x2 = x2) setkey(my.df,x1) tol = 100000  og = function(my.df) {   adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .n]) }  microbenchmark(r_ed <- ed(copy(my.df)),                r_ar <- ar(copy(my.df)),                r_og <- og(copy(my.df)),                times = 1)  unit: milliseconds                     expr         min          lq      median          uq         max neval  r_ed <- ed(copy(my.df))    8.553137    8.553137    8.553137    8.553137    8.553137     1  r_ar <- ar(copy(my.df))   10.229438   10.229438   10.229438   10.229438   10.229438     1  r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844     1 

obviously, solutions both @eddi , @arun faster mine. have try understand rolls.

here's faster data.table solution. idea use rolling merge functionality of data.table, before we need modify data , make column x1 numeric instead of integer. because op using strict inequality , use rolling joins we're going have decrease tolerance tiny amount, making floating point number.

my.df[, x1 := as.numeric(x1)]  # set key x1 merges , sort # (note, if data sorted can make step instantaneous using setattr) setkey(my.df, x1)  # , we're going 2 rolling merges, 1 upper bound # , 1 lower, index of match , subtract ends # (+1, count) my.df[, res := my.df[j(x1 + tol - 1e-6), list(ind = .i), roll = inf]$ind -                my.df[j(x1 - tol + 1e-6), list(ind = .i), roll = -inf]$ind + 1]   # , here's bench vs @arun's solution ed = function(my.df) {   my.df[, x1 := as.numeric(x1)]   setkey(my.df, x1)   my.df[, res := my.df[j(x1 + tol - 1e-6), list(ind = .i), roll = inf]$ind -                  my.df[j(x1 - tol + 1e-6), list(ind = .i), roll = -inf]$ind + 1] }  microbenchmark(ed(copy(my.df)), ar(copy(my.df))) #unit: milliseconds #            expr       min       lq   median       uq      max neval # ed(copy(my.df))  7.297928 10.09947 10.87561 11.80083 23.05907   100 # ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761   100 

note: both arun , matthew pointed out, if x1 integer, 1 doesn't have convert numeric , subtract small amount tol , can use tol - 1l instead of tol - 1e-6 above.


Comments

Popular posts from this blog

matlab - Deleting rows with specific rules -

jquery - How would i go about shortening this code? And to cancel the previous click on click of new section? -