How can I bootstrap the innermost array of a numpy array?









up vote
1
down vote

favorite












I have a numpy array of these dimensions



data.shape (categories, models, types, events): (10, 11, 50, 100)



Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:



data[0][0][0]




array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)



I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.




array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)



But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?










share|improve this question























  • Is shuffling okay?
    – coldspeed
    Nov 10 at 4:44










  • @coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
    – maximusdooku
    Nov 10 at 4:46











  • In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
    – hpaulj
    Nov 10 at 4:48











  • I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
    – hpaulj
    Nov 10 at 4:50










  • @hpaulj Thank you so much for the correction. I have updated the question.
    – maximusdooku
    Nov 10 at 4:52














up vote
1
down vote

favorite












I have a numpy array of these dimensions



data.shape (categories, models, types, events): (10, 11, 50, 100)



Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:



data[0][0][0]




array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)



I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.




array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)



But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?










share|improve this question























  • Is shuffling okay?
    – coldspeed
    Nov 10 at 4:44










  • @coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
    – maximusdooku
    Nov 10 at 4:46











  • In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
    – hpaulj
    Nov 10 at 4:48











  • I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
    – hpaulj
    Nov 10 at 4:50










  • @hpaulj Thank you so much for the correction. I have updated the question.
    – maximusdooku
    Nov 10 at 4:52












up vote
1
down vote

favorite









up vote
1
down vote

favorite











I have a numpy array of these dimensions



data.shape (categories, models, types, events): (10, 11, 50, 100)



Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:



data[0][0][0]




array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)



I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.




array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)



But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?










share|improve this question















I have a numpy array of these dimensions



data.shape (categories, models, types, events): (10, 11, 50, 100)



Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:



data[0][0][0]




array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)



I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.




array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)



But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?







python arrays numpy numpy-ndarray statistics-bootstrap






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 2:39









merv

24.5k671108




24.5k671108










asked Nov 10 at 4:40









maximusdooku

1,38821343




1,38821343











  • Is shuffling okay?
    – coldspeed
    Nov 10 at 4:44










  • @coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
    – maximusdooku
    Nov 10 at 4:46











  • In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
    – hpaulj
    Nov 10 at 4:48











  • I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
    – hpaulj
    Nov 10 at 4:50










  • @hpaulj Thank you so much for the correction. I have updated the question.
    – maximusdooku
    Nov 10 at 4:52
















  • Is shuffling okay?
    – coldspeed
    Nov 10 at 4:44










  • @coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
    – maximusdooku
    Nov 10 at 4:46











  • In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
    – hpaulj
    Nov 10 at 4:48











  • I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
    – hpaulj
    Nov 10 at 4:50










  • @hpaulj Thank you so much for the correction. I have updated the question.
    – maximusdooku
    Nov 10 at 4:52















Is shuffling okay?
– coldspeed
Nov 10 at 4:44




Is shuffling okay?
– coldspeed
Nov 10 at 4:44












@coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
– maximusdooku
Nov 10 at 4:46





@coldspeed It needs to be sampling with replacement. But if you have a shuffling example/answer, please do provide, as I will test both at a later stage. Thanks!
– maximusdooku
Nov 10 at 4:46













In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
– hpaulj
Nov 10 at 4:48





In numpy, the first dimension is outermost, last inner. Also, where possible we index with one set of , e.g. arr[0,0,0,:] should produce a (100,) shape array. Transposing to put that size 100 axis first (outermost) might be a good idea.
– hpaulj
Nov 10 at 4:48













I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
– hpaulj
Nov 10 at 4:50




I'd try applying choice to a 1d indexing array, e.g. idx = np.random.choice(100, size); arr[:,:,: idx].
– hpaulj
Nov 10 at 4:50












@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52




@hpaulj Thank you so much for the correction. I have updated the question.
– maximusdooku
Nov 10 at 4:52












3 Answers
3






active

oldest

votes

















up vote
1
down vote



accepted










The fastest/simplest answer turns out to be based on indexing a flattened version of your array:



def resampFlat(arr, reps):
n = arr.shape[-1]

# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])


Timings confirm that this is the fastest answer.



Timings



I tested out the above resampFlat function alongside a simpler for loop based solution:



def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]

# give the return value the appropriate shape
return ret.reshape((reps, *shape))


and a solution based on Paul Panzer's fancy indexing approach:



def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

return arr[I, J, K, idx]


I tested with the following data:



shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)


Here's the results from the array flattening approach:



%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


the results from the for loop approach:



%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and from Paul's fancy indexing:



%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.






share|improve this answer






















  • Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
    – maximusdooku
    Nov 10 at 6:33










  • @maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
    – tel
    Nov 10 at 13:09

















up vote
1
down vote













You can draw the indices of your samples and then apply fancy indexing:



>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]


A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.



>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)





share|improve this answer






















  • Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
    – maximusdooku
    Nov 10 at 5:46










  • I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
    – maximusdooku
    Nov 10 at 5:49











  • @maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
    – Paul Panzer
    Nov 10 at 6:00










  • I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
    – maximusdooku
    Nov 10 at 6:29










  • @maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
    – tel
    Nov 10 at 6:38


















up vote
0
down vote













databoot = 
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])


  • shape of databoot -> (5, 10, 11, 50, 100)

  • shape of data -> (10, 11, 50, 100)





share|improve this answer




















  • @hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
    – tel
    Nov 10 at 5:19










  • @tel I would love to see a more correct answer...If you know how to, please do post.
    – maximusdooku
    Nov 10 at 5:20











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236040%2fhow-can-i-bootstrap-the-innermost-array-of-a-numpy-array%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote



accepted










The fastest/simplest answer turns out to be based on indexing a flattened version of your array:



def resampFlat(arr, reps):
n = arr.shape[-1]

# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])


Timings confirm that this is the fastest answer.



Timings



I tested out the above resampFlat function alongside a simpler for loop based solution:



def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]

# give the return value the appropriate shape
return ret.reshape((reps, *shape))


and a solution based on Paul Panzer's fancy indexing approach:



def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

return arr[I, J, K, idx]


I tested with the following data:



shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)


Here's the results from the array flattening approach:



%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


the results from the for loop approach:



%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and from Paul's fancy indexing:



%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.






share|improve this answer






















  • Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
    – maximusdooku
    Nov 10 at 6:33










  • @maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
    – tel
    Nov 10 at 13:09














up vote
1
down vote



accepted










The fastest/simplest answer turns out to be based on indexing a flattened version of your array:



def resampFlat(arr, reps):
n = arr.shape[-1]

# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])


Timings confirm that this is the fastest answer.



Timings



I tested out the above resampFlat function alongside a simpler for loop based solution:



def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]

# give the return value the appropriate shape
return ret.reshape((reps, *shape))


and a solution based on Paul Panzer's fancy indexing approach:



def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

return arr[I, J, K, idx]


I tested with the following data:



shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)


Here's the results from the array flattening approach:



%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


the results from the for loop approach:



%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and from Paul's fancy indexing:



%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.






share|improve this answer






















  • Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
    – maximusdooku
    Nov 10 at 6:33










  • @maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
    – tel
    Nov 10 at 13:09












up vote
1
down vote



accepted







up vote
1
down vote



accepted






The fastest/simplest answer turns out to be based on indexing a flattened version of your array:



def resampFlat(arr, reps):
n = arr.shape[-1]

# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])


Timings confirm that this is the fastest answer.



Timings



I tested out the above resampFlat function alongside a simpler for loop based solution:



def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]

# give the return value the appropriate shape
return ret.reshape((reps, *shape))


and a solution based on Paul Panzer's fancy indexing approach:



def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

return arr[I, J, K, idx]


I tested with the following data:



shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)


Here's the results from the array flattening approach:



%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


the results from the for loop approach:



%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and from Paul's fancy indexing:



%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.






share|improve this answer














The fastest/simplest answer turns out to be based on indexing a flattened version of your array:



def resampFlat(arr, reps):
n = arr.shape[-1]

# create an array to shift random indexes as needed
shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

# get a flat view of the array
arrflat = arr.ravel()
# sample the array by generating random ints and shifting them appropriately
return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift]
for i in range(reps)])


Timings confirm that this is the fastest answer.



Timings



I tested out the above resampFlat function alongside a simpler for loop based solution:



def resampFor(arr, reps):
# store the shape for the return value
shape = arr.shape
# flatten all dimensions of arr except the last
arr = arr.reshape(-1, arr.shape[-1])
# preallocate the return value
ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
# generate the indices of the resampled values
idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

for rep,idx in zip(ret, idxs):
# iterate over the resampled replicates
for row,rowrep,i in zip(arr, rep, idx):
# iterate over the event arrays within a replicate
rowrep[...] = row[i]

# give the return value the appropriate shape
return ret.reshape((reps, *shape))


and a solution based on Paul Panzer's fancy indexing approach:



def resampFancyIdx(arr, reps):
idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
_, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

return arr[I, J, K, idx]


I tested with the following data:



shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)


Here's the results from the array flattening approach:



%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


the results from the for loop approach:



%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


and from Paul's fancy indexing:



%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 10 at 13:02

























answered Nov 10 at 5:32









tel

3,5711428




3,5711428











  • Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
    – maximusdooku
    Nov 10 at 6:33










  • @maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
    – tel
    Nov 10 at 13:09
















  • Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
    – maximusdooku
    Nov 10 at 6:33










  • @maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
    – tel
    Nov 10 at 13:09















Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33




Thank you. Several new concepts in your code for me. Prob will take a while to understand everything. Though the shape looks good.
– maximusdooku
Nov 10 at 6:33












@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09




@maximusdooku my first answer was a little hacky, so I came up with something better. Hopefully it'll be a little easier to understand, too
– tel
Nov 10 at 13:09












up vote
1
down vote













You can draw the indices of your samples and then apply fancy indexing:



>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]


A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.



>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)





share|improve this answer






















  • Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
    – maximusdooku
    Nov 10 at 5:46










  • I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
    – maximusdooku
    Nov 10 at 5:49











  • @maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
    – Paul Panzer
    Nov 10 at 6:00










  • I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
    – maximusdooku
    Nov 10 at 6:29










  • @maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
    – tel
    Nov 10 at 6:38















up vote
1
down vote













You can draw the indices of your samples and then apply fancy indexing:



>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]


A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.



>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)





share|improve this answer






















  • Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
    – maximusdooku
    Nov 10 at 5:46










  • I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
    – maximusdooku
    Nov 10 at 5:49











  • @maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
    – Paul Panzer
    Nov 10 at 6:00










  • I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
    – maximusdooku
    Nov 10 at 6:29










  • @maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
    – tel
    Nov 10 at 6:38













up vote
1
down vote










up vote
1
down vote









You can draw the indices of your samples and then apply fancy indexing:



>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]


A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.



>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)





share|improve this answer














You can draw the indices of your samples and then apply fancy indexing:



>>> import numpy as np
>>>
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>>
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>>
>>> resampled = data[I, J, K, idx]


A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.



>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

[['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


[[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

[['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>>
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>>
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning resampled
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

[['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


[[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

[['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



[[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

[['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


[[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

[['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



[[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

[['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


[[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

[['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 10 at 7:13

























answered Nov 10 at 5:41









Paul Panzer

28.9k21240




28.9k21240











  • Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
    – maximusdooku
    Nov 10 at 5:46










  • I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
    – maximusdooku
    Nov 10 at 5:49











  • @maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
    – Paul Panzer
    Nov 10 at 6:00










  • I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
    – maximusdooku
    Nov 10 at 6:29










  • @maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
    – tel
    Nov 10 at 6:38

















  • Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
    – maximusdooku
    Nov 10 at 5:46










  • I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
    – maximusdooku
    Nov 10 at 5:49











  • @maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
    – Paul Panzer
    Nov 10 at 6:00










  • I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
    – maximusdooku
    Nov 10 at 6:29










  • @maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
    – tel
    Nov 10 at 6:38
















Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46




Thank you! Could you please explain if this is different/better than what I posted and why? Specially with regards to cross-correlations. Looks neat!
– maximusdooku
Nov 10 at 5:46












I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49





I am not sure I understand this code unfortunately. Is it doing "global" sample with replacement in all events? Because the sampling needs to happen in each 100 event array...
– maximusdooku
Nov 10 at 5:49













@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00




@maximusdooku I've added a small example, so you can check whether it does what you need. The advantage of this method is that it avoids a python level loop. Let me know whether it works for you.
– Paul Panzer
Nov 10 at 6:00












I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29




I am going through every step. But the shape of data changes from (2, 2, 2, 4) to (2, 2, 2, 6). But I need to generate 6 samples of (2,2,2,4) for example...Perhaps I am not understanding the code because of fancy indexing...
– maximusdooku
Nov 10 at 6:29












@maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38





@maximusdooku think about it this way: given N_samples=12, Paul's answer would produce the same result as if you took 3 samples of shape (2, 2, 2, 4) and then just stuck them together along the last axis. Either approach (one array of shape (2,2,2,N*4) vs N arrays of shape (2,2,2,4)) can produce a correct bootstrap. In any case, I think this answer is a good one. It's similar to hpaulj's, but without the flaw in the math
– tel
Nov 10 at 6:38











up vote
0
down vote













databoot = 
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])


  • shape of databoot -> (5, 10, 11, 50, 100)

  • shape of data -> (10, 11, 50, 100)





share|improve this answer




















  • @hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
    – tel
    Nov 10 at 5:19










  • @tel I would love to see a more correct answer...If you know how to, please do post.
    – maximusdooku
    Nov 10 at 5:20















up vote
0
down vote













databoot = 
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])


  • shape of databoot -> (5, 10, 11, 50, 100)

  • shape of data -> (10, 11, 50, 100)





share|improve this answer




















  • @hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
    – tel
    Nov 10 at 5:19










  • @tel I would love to see a more correct answer...If you know how to, please do post.
    – maximusdooku
    Nov 10 at 5:20













up vote
0
down vote










up vote
0
down vote









databoot = 
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])


  • shape of databoot -> (5, 10, 11, 50, 100)

  • shape of data -> (10, 11, 50, 100)





share|improve this answer












databoot = 
for i in range(5):
idx = np.random.choice(100, 100)
databoot.append(data[:,:,:,idx])


  • shape of databoot -> (5, 10, 11, 50, 100)

  • shape of data -> (10, 11, 50, 100)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 10 at 5:12









maximusdooku

1,38821343




1,38821343











  • @hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
    – tel
    Nov 10 at 5:19










  • @tel I would love to see a more correct answer...If you know how to, please do post.
    – maximusdooku
    Nov 10 at 5:20

















  • @hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
    – tel
    Nov 10 at 5:19










  • @tel I would love to see a more correct answer...If you know how to, please do post.
    – maximusdooku
    Nov 10 at 5:20
















@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19




@hpaulj came up with a nicely efficient/parsimonious answer, but I don't think it's 100% mathematically correct. Basically, if there's any correlations between the events in different (categories, models, types), the above answer may bias your resampled array, as well as the results from your bootstrap
– tel
Nov 10 at 5:19












@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20





@tel I would love to see a more correct answer...If you know how to, please do post.
– maximusdooku
Nov 10 at 5:20


















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236040%2fhow-can-i-bootstrap-the-innermost-array-of-a-numpy-array%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

Syphilis

Darth Vader #20