How can I bootstrap the innermost array of a numpy array?
up vote
1
down vote
favorite
I have a numpy array of these dimensions
data.shape (categories, models, types, events): (10, 11, 50, 100)
Now I want to do sample with replacement
in the innermost array (100) only. For a single array such as this:
data[0][0][0]
array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)
I can do sample with replacement
using this: np.random.choice(data[0][0][0], 100)
, which I will be doing thousands of times.
array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)
But since there is no axis
in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?
python arrays numpy numpy-ndarray statistics-bootstrap
|
show 3 more comments
up vote
1
down vote
favorite
I have a numpy array of these dimensions
data.shape (categories, models, types, events): (10, 11, 50, 100)
Now I want to do sample with replacement
in the innermost array (100) only. For a single array such as this:
data[0][0][0]
array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)
I can do sample with replacement
using this: np.random.choice(data[0][0][0], 100)
, which I will be doing thousands of times.
array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)
But since there is no axis
in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?
python arrays numpy numpy-ndarray statistics-bootstrap
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
@coldspeed It needs to besampling with replacement
. But if you have ashuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!
– maximusdooku
Nov 10 at 4:46
Innumpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g.arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
– hpaulj
Nov 10 at 4:48
I'd try applyingchoice
to a 1d indexing array, e.g.idx = np.random.choice(100, size); arr[:,:,: idx]
.
– hpaulj
Nov 10 at 4:50
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52
|
show 3 more comments
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I have a numpy array of these dimensions
data.shape (categories, models, types, events): (10, 11, 50, 100)
Now I want to do sample with replacement
in the innermost array (100) only. For a single array such as this:
data[0][0][0]
array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)
I can do sample with replacement
using this: np.random.choice(data[0][0][0], 100)
, which I will be doing thousands of times.
array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)
But since there is no axis
in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?
python arrays numpy numpy-ndarray statistics-bootstrap
I have a numpy array of these dimensions
data.shape (categories, models, types, events): (10, 11, 50, 100)
Now I want to do sample with replacement
in the innermost array (100) only. For a single array such as this:
data[0][0][0]
array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)
I can do sample with replacement
using this: np.random.choice(data[0][0][0], 100)
, which I will be doing thousands of times.
array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)
But since there is no axis
in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?
python arrays numpy numpy-ndarray statistics-bootstrap
python arrays numpy numpy-ndarray statistics-bootstrap
edited Nov 11 at 2:39
merv
24.5k671108
24.5k671108
asked Nov 10 at 4:40
maximusdooku
1,38821343
1,38821343
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
@coldspeed It needs to besampling with replacement
. But if you have ashuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!
– maximusdooku
Nov 10 at 4:46
Innumpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g.arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
– hpaulj
Nov 10 at 4:48
I'd try applyingchoice
to a 1d indexing array, e.g.idx = np.random.choice(100, size); arr[:,:,: idx]
.
– hpaulj
Nov 10 at 4:50
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52
|
show 3 more comments
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
@coldspeed It needs to besampling with replacement
. But if you have ashuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!
– maximusdooku
Nov 10 at 4:46
Innumpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g.arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
– hpaulj
Nov 10 at 4:48
I'd try applyingchoice
to a 1d indexing array, e.g.idx = np.random.choice(100, size); arr[:,:,: idx]
.
– hpaulj
Nov 10 at 4:50
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
@coldspeed It needs to be
sampling with replacement
. But if you have a shuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!– maximusdooku
Nov 10 at 4:46
@coldspeed It needs to be
sampling with replacement
. But if you have a shuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!– maximusdooku
Nov 10 at 4:46
In
numpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.– hpaulj
Nov 10 at 4:48
In
numpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.– hpaulj
Nov 10 at 4:48
I'd try applying
choice
to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx]
.– hpaulj
Nov 10 at 4:50
I'd try applying
choice
to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx]
.– hpaulj
Nov 10 at 4:50
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52
|
show 3 more comments
3 Answers
3
active
oldest
votes
up vote
1
down vote
accepted
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
def resampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shape
return ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
add a comment |
up vote
1
down vote
You can draw the indices of your samples and then apply fancy indexing:
>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from(2, 2, 2, 4)
to(2, 2, 2, 6)
. But I need to generate 6 samples of(2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: givenN_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape(2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape(2,2,2,N*4)
vsN
arrays of shape(2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38
|
show 3 more comments
up vote
0
down vote
databoot =
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between theevents
in different(categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
def resampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shape
return ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
add a comment |
up vote
1
down vote
accepted
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
def resampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shape
return ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
def resampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shape
return ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
def resampFlat(arr, reps):
n = arr.shape[-1]
# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)
# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])
Timings confirm that this is the fastest answer.
Timings
I tested out the above resampFlat
function alongside a simpler for
loop based solution:
def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))
for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]
# give the return value the appropriate shape
return ret.reshape((reps, *shape))
and a solution based on Paul Panzer's fancy indexing approach:
def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]
return arr[I, J, K, idx]
I tested with the following data:
shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)
Here's the results from the array flattening approach:
%%timeit
resampFlat(data, 100)
1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
the results from the for
loop approach:
%%timeit
resampFor(data, 100)
1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and from Paul's fancy indexing:
%%timeit
resampFancyIdx(data, 100)
1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Contrary to my expectations, resampFancyIdx
beat resampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.
edited Nov 10 at 13:02
answered Nov 10 at 5:32
tel
3,5711428
3,5711428
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
add a comment |
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09
add a comment |
up vote
1
down vote
You can draw the indices of your samples and then apply fancy indexing:
>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from(2, 2, 2, 4)
to(2, 2, 2, 6)
. But I need to generate 6 samples of(2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: givenN_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape(2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape(2,2,2,N*4)
vsN
arrays of shape(2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38
|
show 3 more comments
up vote
1
down vote
You can draw the indices of your samples and then apply fancy indexing:
>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from(2, 2, 2, 4)
to(2, 2, 2, 6)
. But I need to generate 6 samples of(2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: givenN_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape(2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape(2,2,2,N*4)
vsN
arrays of shape(2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38
|
show 3 more comments
up vote
1
down vote
up vote
1
down vote
You can draw the indices of your samples and then apply fancy indexing:
>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
You can draw the indices of your samples and then apply fancy indexing:
>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],
[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],
[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],
[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],
[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],
[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],
[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],
[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],
[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],
[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],
[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],
[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],
[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],
[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],
[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)
edited Nov 10 at 7:13
answered Nov 10 at 5:41
Paul Panzer
28.9k21240
28.9k21240
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from(2, 2, 2, 4)
to(2, 2, 2, 6)
. But I need to generate 6 samples of(2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: givenN_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape(2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape(2,2,2,N*4)
vsN
arrays of shape(2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38
|
show 3 more comments
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from(2, 2, 2, 4)
to(2, 2, 2, 6)
. But I need to generate 6 samples of(2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: givenN_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape(2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape(2,2,2,N*4)
vsN
arrays of shape(2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46
I am not sure I understand this code unfortunately. Is it doing "global"
sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...– maximusdooku
Nov 10 at 5:49
I am not sure I understand this code unfortunately. Is it doing "global"
sample with replacement
in all events? Because the sampling needs to happen in each 100 event array...– maximusdooku
Nov 10 at 5:49
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00
I am going through every step. But the shape of data changes from
(2, 2, 2, 4)
to (2, 2, 2, 6)
. But I need to generate 6 samples of (2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...– maximusdooku
Nov 10 at 6:29
I am going through every step. But the shape of data changes from
(2, 2, 2, 4)
to (2, 2, 2, 6)
. But I need to generate 6 samples of (2,2,2,4)
for example...Perhaps I am not understanding the code because of fancy indexing...– maximusdooku
Nov 10 at 6:29
@maximusdooku think about it this way: given
N_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4)
vs N
arrays of shape (2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math– tel
Nov 10 at 6:38
@maximusdooku think about it this way: given
N_samples=12
, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4)
and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4)
vs N
arrays of shape (2,2,2,4)
) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math– tel
Nov 10 at 6:38
|
show 3 more comments
up vote
0
down vote
databoot =
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between theevents
in different(categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
add a comment |
up vote
0
down vote
databoot =
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between theevents
in different(categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
add a comment |
up vote
0
down vote
up vote
0
down vote
databoot =
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
databoot =
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])
- shape of
databoot
-> (5, 10, 11, 50, 100) - shape of
data
-> (10, 11, 50, 100)
answered Nov 10 at 5:12
maximusdooku
1,38821343
1,38821343
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between theevents
in different(categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
add a comment |
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between theevents
in different(categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the
events
in different (categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap– tel
Nov 10 at 5:19
@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the
events
in different (categories, models, types)
, the above answer may bias your resampled array, as well as the results from your bootstrap– tel
Nov 10 at 5:19
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236040%2fhow-can-i-bootstrap-the-innermost-array-of-a-numpy-array%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Is shuffling okay?
– coldspeed
Nov 10 at 4:44
@coldspeed It needs to be
sampling with replacement
. But if you have ashuffling
example/answer, please do provide, as I will test both at a later stage. Thanks!– maximusdooku
Nov 10 at 4:46
In
numpy
, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g.arr[0,0,0,:]
should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.– hpaulj
Nov 10 at 4:48
I'd try applying
choice
to a 1d indexing array, e.g.idx = np.random.choice(100, size); arr[:,:,: idx]
.– hpaulj
Nov 10 at 4:50
@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52