R grouping by condition in data.table -
in r, have large data.table. every row, want count rows similar value of x1 (+/- tolerance, tol). can work using adply, it's slow. seems sort of thing data.table - in fact, i'm using data.table part of computation.
is there way entirely data.table? here example:
library(data.table) library(plyr) my.df = data.table(x1 = 1:1000, x2 = 4:1003) tol = 3 adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .n]) results:
x1 x2 v1 1: 1 4 3 2: 2 5 4 3: 3 6 5 4: 4 7 5 5: 5 8 5 --- 996: 996 999 5 997: 997 1000 5 998: 998 1001 5 999: 999 1002 4 1000: 1000 1003 3 update:
here's sample dataset little closer real data:
set.seed(10) x = seq(1,100000000,100000) x = x + sample(1:50000, length(x), replace=t) x2 = x + sample(1:50000, length(x), replace=t) my.df = data.table(x1 = x, x2 = x2) setkey(my.df,x1) tol = 100000 og = function(my.df) { adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .n]) } microbenchmark(r_ed <- ed(copy(my.df)), r_ar <- ar(copy(my.df)), r_og <- og(copy(my.df)), times = 1) unit: milliseconds expr min lq median uq max neval r_ed <- ed(copy(my.df)) 8.553137 8.553137 8.553137 8.553137 8.553137 1 r_ar <- ar(copy(my.df)) 10.229438 10.229438 10.229438 10.229438 10.229438 1 r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844 1 obviously, solutions both @eddi , @arun faster mine. have try understand rolls.
here's faster data.table solution. idea use rolling merge functionality of data.table, before we need modify data , make column x1 numeric instead of integer. because op using strict inequality , use rolling joins we're going have decrease tolerance tiny amount, making floating point number.
my.df[, x1 := as.numeric(x1)] # set key x1 merges , sort # (note, if data sorted can make step instantaneous using setattr) setkey(my.df, x1) # , we're going 2 rolling merges, 1 upper bound # , 1 lower, index of match , subtract ends # (+1, count) my.df[, res := my.df[j(x1 + tol - 1e-6), list(ind = .i), roll = inf]$ind - my.df[j(x1 - tol + 1e-6), list(ind = .i), roll = -inf]$ind + 1] # , here's bench vs @arun's solution ed = function(my.df) { my.df[, x1 := as.numeric(x1)] setkey(my.df, x1) my.df[, res := my.df[j(x1 + tol - 1e-6), list(ind = .i), roll = inf]$ind - my.df[j(x1 - tol + 1e-6), list(ind = .i), roll = -inf]$ind + 1] } microbenchmark(ed(copy(my.df)), ar(copy(my.df))) #unit: milliseconds # expr min lq median uq max neval # ed(copy(my.df)) 7.297928 10.09947 10.87561 11.80083 23.05907 100 # ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761 100 note: both arun , matthew pointed out, if x1 integer, 1 doesn't have convert numeric , subtract small amount tol , can use tol - 1l instead of tol - 1e-6 above.
Comments
Post a Comment