Efficiently apply sample() in R









up vote
2
down vote

favorite












I need to sample an outcome variable given a matrix with row-wise outcome probabilities.



set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)


The fastest way I could come up with is a combination of apply() and sample().



#row-wise sampling using these probabilities
classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


However, in what I'm doing, this is the computational bottleneck. Do you have an idea how to speed this code up / how to sample more efficiently?



Thanks!










share|improve this question





















  • A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
    – RLave
    Nov 9 at 11:02










  • Other useful information might be here: gallery.rcpp.org/articles/…
    – RLave
    Nov 9 at 12:08














up vote
2
down vote

favorite












I need to sample an outcome variable given a matrix with row-wise outcome probabilities.



set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)


The fastest way I could come up with is a combination of apply() and sample().



#row-wise sampling using these probabilities
classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


However, in what I'm doing, this is the computational bottleneck. Do you have an idea how to speed this code up / how to sample more efficiently?



Thanks!










share|improve this question





















  • A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
    – RLave
    Nov 9 at 11:02










  • Other useful information might be here: gallery.rcpp.org/articles/…
    – RLave
    Nov 9 at 12:08












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I need to sample an outcome variable given a matrix with row-wise outcome probabilities.



set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)


The fastest way I could come up with is a combination of apply() and sample().



#row-wise sampling using these probabilities
classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


However, in what I'm doing, this is the computational bottleneck. Do you have an idea how to speed this code up / how to sample more efficiently?



Thanks!










share|improve this question













I need to sample an outcome variable given a matrix with row-wise outcome probabilities.



set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)


The fastest way I could come up with is a combination of apply() and sample().



#row-wise sampling using these probabilities
classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


However, in what I'm doing, this is the computational bottleneck. Do you have an idea how to speed this code up / how to sample more efficiently?



Thanks!







r apply probability sample






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 7 at 10:50









Mr. Zen

265113




265113











  • A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
    – RLave
    Nov 9 at 11:02










  • Other useful information might be here: gallery.rcpp.org/articles/…
    – RLave
    Nov 9 at 12:08
















  • A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
    – RLave
    Nov 9 at 11:02










  • Other useful information might be here: gallery.rcpp.org/articles/…
    – RLave
    Nov 9 at 12:08















A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
– RLave
Nov 9 at 11:02




A solution using cran.r-project.org/web/packages/Rcpp/index.html is probably the best option you have.
– RLave
Nov 9 at 11:02












Other useful information might be here: gallery.rcpp.org/articles/…
– RLave
Nov 9 at 12:08




Other useful information might be here: gallery.rcpp.org/articles/…
– RLave
Nov 9 at 12:08












3 Answers
3






active

oldest

votes

















up vote
5
down vote



accepted
+50










RLave's comment that Rcpp could be the way to go is spot on (you also need RcppArmadillo for sample()); I used the following C++ code to create such a function:



// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadilloExtensions/sample.h>

using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set)
int n = x.nrow();
IntegerVector result(n);
for ( int i = 0; i < n; ++i )
result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];

return result;



I then made that function available in my R session via



Rcpp::sourceCpp("sample_matrix.cpp")


Now we can test it in R against your initial approach, as well as the other suggestions to use purrr::map() and lapply():



set.seed(1010) #reproducibility

#create a matrix of probabilities
#three possible outcomes, 10.000 cases
probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)
probabilities_list <- split(probabilities, seq(nrow(probabilities)))

library(purrr)
library(microbenchmark)

microbenchmark(
apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
rcpp = sample_matrix(probabilities, 1:3),
times = 100
)

Unit: milliseconds
expr min lq mean median uq max neval
apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007 100
map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898 100
lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359 100
rcpp 12.16787 12.55429 13.47837 13.81601 14.25198 16.84859 100
cld
c
b
b
a


The time savings are considerable.






share|improve this answer






















  • This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
    – Mr. Zen
    Nov 9 at 13:02






  • 2




    @Mr.Zen Sure! I'll edit shortly to include that feature.
    – duckmayr
    Nov 9 at 13:03










  • @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
    – duckmayr
    Nov 9 at 13:12










  • Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
    – Mr. Zen
    Nov 9 at 13:16

















up vote
2
down vote













If you are willing to put probabilities in list, purrr::map or lapply seem a little faster:



probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
probabilities <- probabilities / Matrix::rowSums(probabilities)
probabilities_list <- split(probabilities, seq(nrow(probabilities)))

library(purrr)
set.seed(1010)
classification_list <- map(probabilities_list, function(x) sample(1:3, 1, prob = x))

set.seed(1010)
classification_list <- lapply(probabilities_list, function(x) sample(1:3, 1, prob = x))


Benchmarking:



microbenchmark::microbenchmark(
apply = classification = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
map = classification = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
lapply = classification = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# apply 39.92883 42.59249 48.39247 45.03080 47.86648 94.39828 100
# map 35.54077 37.13866 42.19719 39.95046 41.56323 66.05167 100
#lapply 34.54861 36.48664 42.69512 39.20139 52.31494 59.29200 100


With 100.000 cases



# Unit: milliseconds
# expr min lq mean median uq max neval
# apply 457.5310 520.4926 572.5974 552.1674 611.5640 957.3997 100
# map 391.4751 457.7326 488.3286 482.1459 512.2054 899.1380 100
#lapply 386.2698 443.6732 491.9957 475.4160 507.3677 868.6725 100





share|improve this answer





























    up vote
    0
    down vote













    You can consider




    • vapply and

    • parallization: parallel::parApply

    With your probabilities matrix:



    set.seed(1010) #reproducibility

    #create a matrix of probabilities
    #three possible outcomes, 10.000 cases
    probabilities <- matrix(runif(10000*3), nrow=10000,ncol=3)
    probabilities <- probabilities / Matrix::rowSums(probabilities)
    classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


    vapply



    By specifying the class for FUN.VALUE, you might be able to make it fast.



    classification2 <- vapply(split(probabilities, 1:nrow(probabilities)),
    function(x) sample(1:3, 1, prob = x),
    FUN.VALUE = integer(1), USE.NAMES = FALSE)
    head(classification2)
    #> [1] 1 3 3 1 2 3


    parallel package



    benchmarkme::get_cpu()
    #> $vendor_id
    #> [1] "GenuineIntel"
    #>
    #> $model_name
    #> [1] "Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz"
    #>
    #> $no_of_cores
    #> [1] 4


    In the above environment,



    cl <- parallel::makeCluster(4)
    doParallel::registerDoParallel(cl, cores = 4)


    parApply() can do what apply() do.



    classification3 <- parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))
    head(classification3)
    #> [1] 2 2 2 2 3 3


    Comparing the three, including apply() solution,



    microbenchmark::microbenchmark(
    question = # yours
    apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
    ,
    vapp =
    vapply(split(probabilities, 1:nrow(probabilities)), function(x) sample(1:3, 1, prob = x), FUN.VALUE = integer(1), USE.NAMES = FALSE)
    ,
    parr =
    parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))

    )
    #> Unit: milliseconds
    #> expr min lq mean median uq max neval
    #> question 49.93853 58.39965 65.05360 62.98119 68.28044 182.03267 100
    #> vapp 44.19828 54.84294 59.47109 58.56739 62.05269 146.14792 100
    #> parr 43.33227 48.16840 53.26599 50.87995 54.17286 98.67692 100

    parallel::stopCluster(cl)





    share|improve this answer




















      Your Answer






      StackExchange.ifUsing("editor", function ()
      StackExchange.using("externalEditor", function ()
      StackExchange.using("snippets", function ()
      StackExchange.snippets.init();
      );
      );
      , "code-snippets");

      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "1"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187985%2fefficiently-apply-sample-in-r%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      5
      down vote



      accepted
      +50










      RLave's comment that Rcpp could be the way to go is spot on (you also need RcppArmadillo for sample()); I used the following C++ code to create such a function:



      // [[Rcpp::depends(RcppArmadillo)]]
      #include <RcppArmadilloExtensions/sample.h>

      using namespace Rcpp;

      // [[Rcpp::export]]
      IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set)
      int n = x.nrow();
      IntegerVector result(n);
      for ( int i = 0; i < n; ++i )
      result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];

      return result;



      I then made that function available in my R session via



      Rcpp::sourceCpp("sample_matrix.cpp")


      Now we can test it in R against your initial approach, as well as the other suggestions to use purrr::map() and lapply():



      set.seed(1010) #reproducibility

      #create a matrix of probabilities
      #three possible outcomes, 10.000 cases
      probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
      probabilities <- probabilities / Matrix::rowSums(probabilities)
      probabilities_list <- split(probabilities, seq(nrow(probabilities)))

      library(purrr)
      library(microbenchmark)

      microbenchmark(
      apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
      map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      rcpp = sample_matrix(probabilities, 1:3),
      times = 100
      )

      Unit: milliseconds
      expr min lq mean median uq max neval
      apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007 100
      map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898 100
      lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359 100
      rcpp 12.16787 12.55429 13.47837 13.81601 14.25198 16.84859 100
      cld
      c
      b
      b
      a


      The time savings are considerable.






      share|improve this answer






















      • This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
        – Mr. Zen
        Nov 9 at 13:02






      • 2




        @Mr.Zen Sure! I'll edit shortly to include that feature.
        – duckmayr
        Nov 9 at 13:03










      • @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
        – duckmayr
        Nov 9 at 13:12










      • Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
        – Mr. Zen
        Nov 9 at 13:16














      up vote
      5
      down vote



      accepted
      +50










      RLave's comment that Rcpp could be the way to go is spot on (you also need RcppArmadillo for sample()); I used the following C++ code to create such a function:



      // [[Rcpp::depends(RcppArmadillo)]]
      #include <RcppArmadilloExtensions/sample.h>

      using namespace Rcpp;

      // [[Rcpp::export]]
      IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set)
      int n = x.nrow();
      IntegerVector result(n);
      for ( int i = 0; i < n; ++i )
      result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];

      return result;



      I then made that function available in my R session via



      Rcpp::sourceCpp("sample_matrix.cpp")


      Now we can test it in R against your initial approach, as well as the other suggestions to use purrr::map() and lapply():



      set.seed(1010) #reproducibility

      #create a matrix of probabilities
      #three possible outcomes, 10.000 cases
      probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
      probabilities <- probabilities / Matrix::rowSums(probabilities)
      probabilities_list <- split(probabilities, seq(nrow(probabilities)))

      library(purrr)
      library(microbenchmark)

      microbenchmark(
      apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
      map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      rcpp = sample_matrix(probabilities, 1:3),
      times = 100
      )

      Unit: milliseconds
      expr min lq mean median uq max neval
      apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007 100
      map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898 100
      lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359 100
      rcpp 12.16787 12.55429 13.47837 13.81601 14.25198 16.84859 100
      cld
      c
      b
      b
      a


      The time savings are considerable.






      share|improve this answer






















      • This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
        – Mr. Zen
        Nov 9 at 13:02






      • 2




        @Mr.Zen Sure! I'll edit shortly to include that feature.
        – duckmayr
        Nov 9 at 13:03










      • @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
        – duckmayr
        Nov 9 at 13:12










      • Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
        – Mr. Zen
        Nov 9 at 13:16












      up vote
      5
      down vote



      accepted
      +50







      up vote
      5
      down vote



      accepted
      +50




      +50




      RLave's comment that Rcpp could be the way to go is spot on (you also need RcppArmadillo for sample()); I used the following C++ code to create such a function:



      // [[Rcpp::depends(RcppArmadillo)]]
      #include <RcppArmadilloExtensions/sample.h>

      using namespace Rcpp;

      // [[Rcpp::export]]
      IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set)
      int n = x.nrow();
      IntegerVector result(n);
      for ( int i = 0; i < n; ++i )
      result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];

      return result;



      I then made that function available in my R session via



      Rcpp::sourceCpp("sample_matrix.cpp")


      Now we can test it in R against your initial approach, as well as the other suggestions to use purrr::map() and lapply():



      set.seed(1010) #reproducibility

      #create a matrix of probabilities
      #three possible outcomes, 10.000 cases
      probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
      probabilities <- probabilities / Matrix::rowSums(probabilities)
      probabilities_list <- split(probabilities, seq(nrow(probabilities)))

      library(purrr)
      library(microbenchmark)

      microbenchmark(
      apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
      map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      rcpp = sample_matrix(probabilities, 1:3),
      times = 100
      )

      Unit: milliseconds
      expr min lq mean median uq max neval
      apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007 100
      map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898 100
      lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359 100
      rcpp 12.16787 12.55429 13.47837 13.81601 14.25198 16.84859 100
      cld
      c
      b
      b
      a


      The time savings are considerable.






      share|improve this answer














      RLave's comment that Rcpp could be the way to go is spot on (you also need RcppArmadillo for sample()); I used the following C++ code to create such a function:



      // [[Rcpp::depends(RcppArmadillo)]]
      #include <RcppArmadilloExtensions/sample.h>

      using namespace Rcpp;

      // [[Rcpp::export]]
      IntegerVector sample_matrix(NumericMatrix x, IntegerVector choice_set)
      int n = x.nrow();
      IntegerVector result(n);
      for ( int i = 0; i < n; ++i )
      result[i] = RcppArmadillo::sample(choice_set, 1, false, x(i, _))[0];

      return result;



      I then made that function available in my R session via



      Rcpp::sourceCpp("sample_matrix.cpp")


      Now we can test it in R against your initial approach, as well as the other suggestions to use purrr::map() and lapply():



      set.seed(1010) #reproducibility

      #create a matrix of probabilities
      #three possible outcomes, 10.000 cases
      probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
      probabilities <- probabilities / Matrix::rowSums(probabilities)
      probabilities_list <- split(probabilities, seq(nrow(probabilities)))

      library(purrr)
      library(microbenchmark)

      microbenchmark(
      apply = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
      map = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      lapply = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      rcpp = sample_matrix(probabilities, 1:3),
      times = 100
      )

      Unit: milliseconds
      expr min lq mean median uq max neval
      apply 307.44702 321.30051 339.85403 342.36421 350.86090 434.56007 100
      map 254.69721 265.10187 282.85592 286.21680 295.48886 363.95898 100
      lapply 249.68224 259.70178 280.63066 279.87273 287.10062 691.21359 100
      rcpp 12.16787 12.55429 13.47837 13.81601 14.25198 16.84859 100
      cld
      c
      b
      b
      a


      The time savings are considerable.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 9 at 13:22

























      answered Nov 9 at 12:52









      duckmayr

      6,53311126




      6,53311126











      • This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
        – Mr. Zen
        Nov 9 at 13:02






      • 2




        @Mr.Zen Sure! I'll edit shortly to include that feature.
        – duckmayr
        Nov 9 at 13:03










      • @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
        – duckmayr
        Nov 9 at 13:12










      • Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
        – Mr. Zen
        Nov 9 at 13:16
















      • This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
        – Mr. Zen
        Nov 9 at 13:02






      • 2




        @Mr.Zen Sure! I'll edit shortly to include that feature.
        – duckmayr
        Nov 9 at 13:03










      • @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
        – duckmayr
        Nov 9 at 13:12










      • Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
        – Mr. Zen
        Nov 9 at 13:16















      This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
      – Mr. Zen
      Nov 9 at 13:02




      This looks extremely promising, thanks a lot. Is it possible to rewrite the Rcpp command to be dynamic with respect to the number of states? I.e. without having to rewrite as IntegerVector::create(1, 2, 3, 4) for 4 states and so on. I have no Idea of C++, so bear with me please.
      – Mr. Zen
      Nov 9 at 13:02




      2




      2




      @Mr.Zen Sure! I'll edit shortly to include that feature.
      – duckmayr
      Nov 9 at 13:03




      @Mr.Zen Sure! I'll edit shortly to include that feature.
      – duckmayr
      Nov 9 at 13:03












      @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
      – duckmayr
      Nov 9 at 13:12




      @Mr.Zen Updated; now the choice set is an argument to the function (like in R's sample()). You can see the performance boost is still there, but it now has the flexibility you want.
      – duckmayr
      Nov 9 at 13:12












      Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
      – Mr. Zen
      Nov 9 at 13:16




      Thank you very much! I will award the bounty when I'm eligible to do so in ~21 hours.
      – Mr. Zen
      Nov 9 at 13:16












      up vote
      2
      down vote













      If you are willing to put probabilities in list, purrr::map or lapply seem a little faster:



      probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
      probabilities <- probabilities / Matrix::rowSums(probabilities)
      probabilities_list <- split(probabilities, seq(nrow(probabilities)))

      library(purrr)
      set.seed(1010)
      classification_list <- map(probabilities_list, function(x) sample(1:3, 1, prob = x))

      set.seed(1010)
      classification_list <- lapply(probabilities_list, function(x) sample(1:3, 1, prob = x))


      Benchmarking:



      microbenchmark::microbenchmark(
      apply = classification = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
      map = classification = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      lapply = classification = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
      times = 100
      )
      # Unit: milliseconds
      # expr min lq mean median uq max neval
      # apply 39.92883 42.59249 48.39247 45.03080 47.86648 94.39828 100
      # map 35.54077 37.13866 42.19719 39.95046 41.56323 66.05167 100
      #lapply 34.54861 36.48664 42.69512 39.20139 52.31494 59.29200 100


      With 100.000 cases



      # Unit: milliseconds
      # expr min lq mean median uq max neval
      # apply 457.5310 520.4926 572.5974 552.1674 611.5640 957.3997 100
      # map 391.4751 457.7326 488.3286 482.1459 512.2054 899.1380 100
      #lapply 386.2698 443.6732 491.9957 475.4160 507.3677 868.6725 100





      share|improve this answer


























        up vote
        2
        down vote













        If you are willing to put probabilities in list, purrr::map or lapply seem a little faster:



        probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
        probabilities <- probabilities / Matrix::rowSums(probabilities)
        probabilities_list <- split(probabilities, seq(nrow(probabilities)))

        library(purrr)
        set.seed(1010)
        classification_list <- map(probabilities_list, function(x) sample(1:3, 1, prob = x))

        set.seed(1010)
        classification_list <- lapply(probabilities_list, function(x) sample(1:3, 1, prob = x))


        Benchmarking:



        microbenchmark::microbenchmark(
        apply = classification = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
        map = classification = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
        lapply = classification = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
        times = 100
        )
        # Unit: milliseconds
        # expr min lq mean median uq max neval
        # apply 39.92883 42.59249 48.39247 45.03080 47.86648 94.39828 100
        # map 35.54077 37.13866 42.19719 39.95046 41.56323 66.05167 100
        #lapply 34.54861 36.48664 42.69512 39.20139 52.31494 59.29200 100


        With 100.000 cases



        # Unit: milliseconds
        # expr min lq mean median uq max neval
        # apply 457.5310 520.4926 572.5974 552.1674 611.5640 957.3997 100
        # map 391.4751 457.7326 488.3286 482.1459 512.2054 899.1380 100
        #lapply 386.2698 443.6732 491.9957 475.4160 507.3677 868.6725 100





        share|improve this answer
























          up vote
          2
          down vote










          up vote
          2
          down vote









          If you are willing to put probabilities in list, purrr::map or lapply seem a little faster:



          probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
          probabilities <- probabilities / Matrix::rowSums(probabilities)
          probabilities_list <- split(probabilities, seq(nrow(probabilities)))

          library(purrr)
          set.seed(1010)
          classification_list <- map(probabilities_list, function(x) sample(1:3, 1, prob = x))

          set.seed(1010)
          classification_list <- lapply(probabilities_list, function(x) sample(1:3, 1, prob = x))


          Benchmarking:



          microbenchmark::microbenchmark(
          apply = classification = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
          map = classification = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
          lapply = classification = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
          times = 100
          )
          # Unit: milliseconds
          # expr min lq mean median uq max neval
          # apply 39.92883 42.59249 48.39247 45.03080 47.86648 94.39828 100
          # map 35.54077 37.13866 42.19719 39.95046 41.56323 66.05167 100
          #lapply 34.54861 36.48664 42.69512 39.20139 52.31494 59.29200 100


          With 100.000 cases



          # Unit: milliseconds
          # expr min lq mean median uq max neval
          # apply 457.5310 520.4926 572.5974 552.1674 611.5640 957.3997 100
          # map 391.4751 457.7326 488.3286 482.1459 512.2054 899.1380 100
          #lapply 386.2698 443.6732 491.9957 475.4160 507.3677 868.6725 100





          share|improve this answer














          If you are willing to put probabilities in list, purrr::map or lapply seem a little faster:



          probabilities <- matrix(runif(10000*3),nrow=10000,ncol=3)
          probabilities <- probabilities / Matrix::rowSums(probabilities)
          probabilities_list <- split(probabilities, seq(nrow(probabilities)))

          library(purrr)
          set.seed(1010)
          classification_list <- map(probabilities_list, function(x) sample(1:3, 1, prob = x))

          set.seed(1010)
          classification_list <- lapply(probabilities_list, function(x) sample(1:3, 1, prob = x))


          Benchmarking:



          microbenchmark::microbenchmark(
          apply = classification = apply(probabilities, 1, function(x) sample(1:3, 1, prob = x)),
          map = classification = map(probabilities_list, function(x) sample(1:3, 1, prob = x)),
          lapply = classification = lapply(probabilities_list, function(x) sample(1:3, 1, prob = x)),
          times = 100
          )
          # Unit: milliseconds
          # expr min lq mean median uq max neval
          # apply 39.92883 42.59249 48.39247 45.03080 47.86648 94.39828 100
          # map 35.54077 37.13866 42.19719 39.95046 41.56323 66.05167 100
          #lapply 34.54861 36.48664 42.69512 39.20139 52.31494 59.29200 100


          With 100.000 cases



          # Unit: milliseconds
          # expr min lq mean median uq max neval
          # apply 457.5310 520.4926 572.5974 552.1674 611.5640 957.3997 100
          # map 391.4751 457.7326 488.3286 482.1459 512.2054 899.1380 100
          #lapply 386.2698 443.6732 491.9957 475.4160 507.3677 868.6725 100






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 7 at 11:17

























          answered Nov 7 at 11:07









          RLave

          2,5381820




          2,5381820




















              up vote
              0
              down vote













              You can consider




              • vapply and

              • parallization: parallel::parApply

              With your probabilities matrix:



              set.seed(1010) #reproducibility

              #create a matrix of probabilities
              #three possible outcomes, 10.000 cases
              probabilities <- matrix(runif(10000*3), nrow=10000,ncol=3)
              probabilities <- probabilities / Matrix::rowSums(probabilities)
              classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


              vapply



              By specifying the class for FUN.VALUE, you might be able to make it fast.



              classification2 <- vapply(split(probabilities, 1:nrow(probabilities)),
              function(x) sample(1:3, 1, prob = x),
              FUN.VALUE = integer(1), USE.NAMES = FALSE)
              head(classification2)
              #> [1] 1 3 3 1 2 3


              parallel package



              benchmarkme::get_cpu()
              #> $vendor_id
              #> [1] "GenuineIntel"
              #>
              #> $model_name
              #> [1] "Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz"
              #>
              #> $no_of_cores
              #> [1] 4


              In the above environment,



              cl <- parallel::makeCluster(4)
              doParallel::registerDoParallel(cl, cores = 4)


              parApply() can do what apply() do.



              classification3 <- parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))
              head(classification3)
              #> [1] 2 2 2 2 3 3


              Comparing the three, including apply() solution,



              microbenchmark::microbenchmark(
              question = # yours
              apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
              ,
              vapp =
              vapply(split(probabilities, 1:nrow(probabilities)), function(x) sample(1:3, 1, prob = x), FUN.VALUE = integer(1), USE.NAMES = FALSE)
              ,
              parr =
              parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))

              )
              #> Unit: milliseconds
              #> expr min lq mean median uq max neval
              #> question 49.93853 58.39965 65.05360 62.98119 68.28044 182.03267 100
              #> vapp 44.19828 54.84294 59.47109 58.56739 62.05269 146.14792 100
              #> parr 43.33227 48.16840 53.26599 50.87995 54.17286 98.67692 100

              parallel::stopCluster(cl)





              share|improve this answer
























                up vote
                0
                down vote













                You can consider




                • vapply and

                • parallization: parallel::parApply

                With your probabilities matrix:



                set.seed(1010) #reproducibility

                #create a matrix of probabilities
                #three possible outcomes, 10.000 cases
                probabilities <- matrix(runif(10000*3), nrow=10000,ncol=3)
                probabilities <- probabilities / Matrix::rowSums(probabilities)
                classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


                vapply



                By specifying the class for FUN.VALUE, you might be able to make it fast.



                classification2 <- vapply(split(probabilities, 1:nrow(probabilities)),
                function(x) sample(1:3, 1, prob = x),
                FUN.VALUE = integer(1), USE.NAMES = FALSE)
                head(classification2)
                #> [1] 1 3 3 1 2 3


                parallel package



                benchmarkme::get_cpu()
                #> $vendor_id
                #> [1] "GenuineIntel"
                #>
                #> $model_name
                #> [1] "Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz"
                #>
                #> $no_of_cores
                #> [1] 4


                In the above environment,



                cl <- parallel::makeCluster(4)
                doParallel::registerDoParallel(cl, cores = 4)


                parApply() can do what apply() do.



                classification3 <- parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))
                head(classification3)
                #> [1] 2 2 2 2 3 3


                Comparing the three, including apply() solution,



                microbenchmark::microbenchmark(
                question = # yours
                apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
                ,
                vapp =
                vapply(split(probabilities, 1:nrow(probabilities)), function(x) sample(1:3, 1, prob = x), FUN.VALUE = integer(1), USE.NAMES = FALSE)
                ,
                parr =
                parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))

                )
                #> Unit: milliseconds
                #> expr min lq mean median uq max neval
                #> question 49.93853 58.39965 65.05360 62.98119 68.28044 182.03267 100
                #> vapp 44.19828 54.84294 59.47109 58.56739 62.05269 146.14792 100
                #> parr 43.33227 48.16840 53.26599 50.87995 54.17286 98.67692 100

                parallel::stopCluster(cl)





                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  You can consider




                  • vapply and

                  • parallization: parallel::parApply

                  With your probabilities matrix:



                  set.seed(1010) #reproducibility

                  #create a matrix of probabilities
                  #three possible outcomes, 10.000 cases
                  probabilities <- matrix(runif(10000*3), nrow=10000,ncol=3)
                  probabilities <- probabilities / Matrix::rowSums(probabilities)
                  classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


                  vapply



                  By specifying the class for FUN.VALUE, you might be able to make it fast.



                  classification2 <- vapply(split(probabilities, 1:nrow(probabilities)),
                  function(x) sample(1:3, 1, prob = x),
                  FUN.VALUE = integer(1), USE.NAMES = FALSE)
                  head(classification2)
                  #> [1] 1 3 3 1 2 3


                  parallel package



                  benchmarkme::get_cpu()
                  #> $vendor_id
                  #> [1] "GenuineIntel"
                  #>
                  #> $model_name
                  #> [1] "Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz"
                  #>
                  #> $no_of_cores
                  #> [1] 4


                  In the above environment,



                  cl <- parallel::makeCluster(4)
                  doParallel::registerDoParallel(cl, cores = 4)


                  parApply() can do what apply() do.



                  classification3 <- parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))
                  head(classification3)
                  #> [1] 2 2 2 2 3 3


                  Comparing the three, including apply() solution,



                  microbenchmark::microbenchmark(
                  question = # yours
                  apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
                  ,
                  vapp =
                  vapply(split(probabilities, 1:nrow(probabilities)), function(x) sample(1:3, 1, prob = x), FUN.VALUE = integer(1), USE.NAMES = FALSE)
                  ,
                  parr =
                  parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))

                  )
                  #> Unit: milliseconds
                  #> expr min lq mean median uq max neval
                  #> question 49.93853 58.39965 65.05360 62.98119 68.28044 182.03267 100
                  #> vapp 44.19828 54.84294 59.47109 58.56739 62.05269 146.14792 100
                  #> parr 43.33227 48.16840 53.26599 50.87995 54.17286 98.67692 100

                  parallel::stopCluster(cl)





                  share|improve this answer












                  You can consider




                  • vapply and

                  • parallization: parallel::parApply

                  With your probabilities matrix:



                  set.seed(1010) #reproducibility

                  #create a matrix of probabilities
                  #three possible outcomes, 10.000 cases
                  probabilities <- matrix(runif(10000*3), nrow=10000,ncol=3)
                  probabilities <- probabilities / Matrix::rowSums(probabilities)
                  classification <- apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))


                  vapply



                  By specifying the class for FUN.VALUE, you might be able to make it fast.



                  classification2 <- vapply(split(probabilities, 1:nrow(probabilities)),
                  function(x) sample(1:3, 1, prob = x),
                  FUN.VALUE = integer(1), USE.NAMES = FALSE)
                  head(classification2)
                  #> [1] 1 3 3 1 2 3


                  parallel package



                  benchmarkme::get_cpu()
                  #> $vendor_id
                  #> [1] "GenuineIntel"
                  #>
                  #> $model_name
                  #> [1] "Intel(R) Core(TM) i5-4288U CPU @ 2.60GHz"
                  #>
                  #> $no_of_cores
                  #> [1] 4


                  In the above environment,



                  cl <- parallel::makeCluster(4)
                  doParallel::registerDoParallel(cl, cores = 4)


                  parApply() can do what apply() do.



                  classification3 <- parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))
                  head(classification3)
                  #> [1] 2 2 2 2 3 3


                  Comparing the three, including apply() solution,



                  microbenchmark::microbenchmark(
                  question = # yours
                  apply(probabilities, 1, function(x) sample(1:3, 1, prob = x))
                  ,
                  vapp =
                  vapply(split(probabilities, 1:nrow(probabilities)), function(x) sample(1:3, 1, prob = x), FUN.VALUE = integer(1), USE.NAMES = FALSE)
                  ,
                  parr =
                  parallel::parApply(cl, probabilities, 1, function(x) sample(1:3, 1, prob = x))

                  )
                  #> Unit: milliseconds
                  #> expr min lq mean median uq max neval
                  #> question 49.93853 58.39965 65.05360 62.98119 68.28044 182.03267 100
                  #> vapp 44.19828 54.84294 59.47109 58.56739 62.05269 146.14792 100
                  #> parr 43.33227 48.16840 53.26599 50.87995 54.17286 98.67692 100

                  parallel::stopCluster(cl)






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 9 at 13:16









                  Blended

                  38117




                  38117



























                       

                      draft saved


                      draft discarded















































                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53187985%2fefficiently-apply-sample-in-r%23new-answer', 'question_page');

                      );

                      Post as a guest














































































                      Popular posts from this blog

                      How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

                      Darth Vader #20

                      Ondo