Bottleneck in the Eigen partial_lu_inplace when factorizing a large number of small matrices










0















I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.



I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.



I am currently computing the decomposition like this:



// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();


Profiling with HpcToolkit shows that the inplace factorization is the hotspot:



enter image description here



Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?



Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.



Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.










share|improve this question




























    0















    I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.



    I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.



    I am currently computing the decomposition like this:



    // Factorize the matrix.
    matrixFactorization_ = A_.partialPivLu();


    Profiling with HpcToolkit shows that the inplace factorization is the hotspot:



    enter image description here



    Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?



    Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.



    Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.










    share|improve this question


























      0












      0








      0








      I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.



      I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.



      I am currently computing the decomposition like this:



      // Factorize the matrix.
      matrixFactorization_ = A_.partialPivLu();


      Profiling with HpcToolkit shows that the inplace factorization is the hotspot:



      enter image description here



      Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?



      Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.



      Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.










      share|improve this question
















      I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.



      I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.



      I am currently computing the decomposition like this:



      // Factorize the matrix.
      matrixFactorization_ = A_.partialPivLu();


      Profiling with HpcToolkit shows that the inplace factorization is the hotspot:



      enter image description here



      Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?



      Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.



      Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.







      eigen






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 16:17







      tmaric

















      asked Nov 14 '18 at 10:48









      tmarictmaric

      2,7842052




      2,7842052






















          1 Answer
          1






          active

          oldest

          votes


















          4














          If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.



          Then you might save dynamic memory allocation by using:



          typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
          MatMax20 A_;
          PartialPivLU<MatMax20> matrixFactorization_;


          The matrix A_ and all internals of PartialPivLU will thus be statically allocated.



          To update an existing facto better write:



          matrixFactorization_.compute(A_);





          share|improve this answer























          • Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

            – tmaric
            Nov 14 '18 at 16:18











          • Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

            – tmaric
            Nov 14 '18 at 16:18












          • The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

            – chtz
            Nov 14 '18 at 16:22






          • 1





            @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

            – ggael
            Nov 14 '18 at 17:06












          • @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

            – tmaric
            Nov 14 '18 at 19:25











          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53298403%2fbottleneck-in-the-eigen-partial-lu-inplace-when-factorizing-a-large-number-of-sm%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4














          If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.



          Then you might save dynamic memory allocation by using:



          typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
          MatMax20 A_;
          PartialPivLU<MatMax20> matrixFactorization_;


          The matrix A_ and all internals of PartialPivLU will thus be statically allocated.



          To update an existing facto better write:



          matrixFactorization_.compute(A_);





          share|improve this answer























          • Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

            – tmaric
            Nov 14 '18 at 16:18











          • Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

            – tmaric
            Nov 14 '18 at 16:18












          • The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

            – chtz
            Nov 14 '18 at 16:22






          • 1





            @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

            – ggael
            Nov 14 '18 at 17:06












          • @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

            – tmaric
            Nov 14 '18 at 19:25
















          4














          If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.



          Then you might save dynamic memory allocation by using:



          typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
          MatMax20 A_;
          PartialPivLU<MatMax20> matrixFactorization_;


          The matrix A_ and all internals of PartialPivLU will thus be statically allocated.



          To update an existing facto better write:



          matrixFactorization_.compute(A_);





          share|improve this answer























          • Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

            – tmaric
            Nov 14 '18 at 16:18











          • Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

            – tmaric
            Nov 14 '18 at 16:18












          • The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

            – chtz
            Nov 14 '18 at 16:22






          • 1





            @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

            – ggael
            Nov 14 '18 at 17:06












          • @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

            – tmaric
            Nov 14 '18 at 19:25














          4












          4








          4







          If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.



          Then you might save dynamic memory allocation by using:



          typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
          MatMax20 A_;
          PartialPivLU<MatMax20> matrixFactorization_;


          The matrix A_ and all internals of PartialPivLU will thus be statically allocated.



          To update an existing facto better write:



          matrixFactorization_.compute(A_);





          share|improve this answer













          If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.



          Then you might save dynamic memory allocation by using:



          typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
          MatMax20 A_;
          PartialPivLU<MatMax20> matrixFactorization_;


          The matrix A_ and all internals of PartialPivLU will thus be statically allocated.



          To update an existing facto better write:



          matrixFactorization_.compute(A_);






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 13:53









          ggaelggael

          20.8k23145




          20.8k23145












          • Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

            – tmaric
            Nov 14 '18 at 16:18











          • Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

            – tmaric
            Nov 14 '18 at 16:18












          • The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

            – chtz
            Nov 14 '18 at 16:22






          • 1





            @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

            – ggael
            Nov 14 '18 at 17:06












          • @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

            – tmaric
            Nov 14 '18 at 19:25


















          • Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

            – tmaric
            Nov 14 '18 at 16:18











          • Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

            – tmaric
            Nov 14 '18 at 16:18












          • The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

            – chtz
            Nov 14 '18 at 16:22






          • 1





            @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

            – ggael
            Nov 14 '18 at 17:06












          • @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

            – tmaric
            Nov 14 '18 at 19:25

















          Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

          – tmaric
          Nov 14 '18 at 16:18





          Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

          – tmaric
          Nov 14 '18 at 16:18













          Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

          – tmaric
          Nov 14 '18 at 16:18






          Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

          – tmaric
          Nov 14 '18 at 16:18














          The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

          – chtz
          Nov 14 '18 at 16:22





          The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

          – chtz
          Nov 14 '18 at 16:22




          1




          1





          @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

          – ggael
          Nov 14 '18 at 17:06






          @tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

          – ggael
          Nov 14 '18 at 17:06














          @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

          – tmaric
          Nov 14 '18 at 19:25






          @chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

          – tmaric
          Nov 14 '18 at 19:25




















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53298403%2fbottleneck-in-the-eigen-partial-lu-inplace-when-factorizing-a-large-number-of-sm%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Darth Vader #20

          How to how show current date and time by default on contact form 7 in WordPress without taking input from user in datetimepicker

          Ondo