Bottleneck in the Eigen partial_lu_inplace when factorizing a large number of small matrices
I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace
.
I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.
I am currently computing the decomposition like this:
// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();
Profiling with HpcToolkit shows that the inplace factorization is the hotspot:
Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?
Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.
Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3
, but with the information that is required to map the measurements to the sourcecode -g3
.
eigen
add a comment |
I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace
.
I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.
I am currently computing the decomposition like this:
// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();
Profiling with HpcToolkit shows that the inplace factorization is the hotspot:
Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?
Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.
Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3
, but with the information that is required to map the measurements to the sourcecode -g3
.
eigen
add a comment |
I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace
.
I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.
I am currently computing the decomposition like this:
// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();
Profiling with HpcToolkit shows that the inplace factorization is the hotspot:
Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?
Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.
Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3
, but with the information that is required to map the measurements to the sourcecode -g3
.
eigen
I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace
.
I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.
I am currently computing the decomposition like this:
// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();
Profiling with HpcToolkit shows that the inplace factorization is the hotspot:
Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?
Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.
Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3
, but with the information that is required to map the measurements to the sourcecode -g3
.
eigen
eigen
edited Nov 14 '18 at 16:17
tmaric
asked Nov 14 '18 at 10:48
tmarictmaric
2,7842052
2,7842052
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_
and all internals of PartialPivLU
will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
@tmaric: "disable the inplace operations" -> this does not make sense, thepartial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.
– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.
– tmaric
Nov 14 '18 at 19:25
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53298403%2fbottleneck-in-the-eigen-partial-lu-inplace-when-factorizing-a-large-number-of-sm%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_
and all internals of PartialPivLU
will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
@tmaric: "disable the inplace operations" -> this does not make sense, thepartial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.
– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.
– tmaric
Nov 14 '18 at 19:25
add a comment |
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_
and all internals of PartialPivLU
will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
@tmaric: "disable the inplace operations" -> this does not make sense, thepartial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.
– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.
– tmaric
Nov 14 '18 at 19:25
add a comment |
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_
and all internals of PartialPivLU
will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);
If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.
Then you might save dynamic memory allocation by using:
typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;
The matrix A_
and all internals of PartialPivLU
will thus be statically allocated.
To update an existing facto better write:
matrixFactorization_.compute(A_);
answered Nov 14 '18 at 13:53
ggaelggael
20.8k23145
20.8k23145
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
@tmaric: "disable the inplace operations" -> this does not make sense, thepartial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.
– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.
– tmaric
Nov 14 '18 at 19:25
add a comment |
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
@tmaric: "disable the inplace operations" -> this does not make sense, thepartial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.
– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.
– tmaric
Nov 14 '18 at 19:25
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?
– tmaric
Nov 14 '18 at 16:18
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.
– chtz
Nov 14 '18 at 16:22
1
1
@tmaric: "disable the inplace operations" -> this does not make sense, the
partial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.– ggael
Nov 14 '18 at 17:06
@tmaric: "disable the inplace operations" -> this does not make sense, the
partial_lu_inplace
function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.– ggael
Nov 14 '18 at 17:06
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with
-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.– tmaric
Nov 14 '18 at 19:25
@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with
-march=native
does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.– tmaric
Nov 14 '18 at 19:25
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53298403%2fbottleneck-in-the-eigen-partial-lu-inplace-when-factorizing-a-large-number-of-sm%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown