Bottleneck in the Eigen partial_lu_inplace when factorizing a large number of small matrices

I need to factorize ~1e05 small matrices of maximal variable size of 20x20. Profiling of the matrix factorization using HpcToolkit shows that the hotspot in the code is in Eigen::internal::partial_lu_inplace.

I checked the Eigen documenation on the inplace matrix decomposition, and I understand it can be important for large matrices to use inplace decomposition, to re-use the memory and have better cache efficiency.

I am currently computing the decomposition like this:

// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();

Profiling with HpcToolkit shows that the inplace factorization is the hotspot:

enter image description here

Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?

Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.

Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

add a comment |

I am currently computing the decomposition like this:

// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();

Profiling with HpcToolkit shows that the inplace factorization is the hotspot:

enter image description here

Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?

Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.

Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

add a comment |

I am currently computing the decomposition like this:

// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();

Profiling with HpcToolkit shows that the inplace factorization is the hotspot:

enter image description here

Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?

Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.

Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

I am currently computing the decomposition like this:

// Factorize the matrix.
matrixFactorization_ = A_.partialPivLu();

Profiling with HpcToolkit shows that the inplace factorization is the hotspot:

enter image description here

Is it possible to disable the inplace decomposition and test if the code will be faster for small matrices I am dealing with?

Note: If you look at the CPU time in the column on the image, you'll notice the runtime is in seconds: I am not after microsecond optimizations here, the calculation takes ~4 seconds in total.

Edit: HPCToolkit statistically profiles the code in fully optimized mode -O3 , but with the information that is required to map the measurements to the sourcecode -g3.

eigen

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

edited Nov 14 '18 at 16:17

asked Nov 14 '18 at 10:48

tmaric

2,7842052

asked Nov 14 '18 at 10:48

tmaric

2,7842052

asked Nov 14 '18 at 10:48

tmaric

2,7842052

add a comment |

1 Answer
1

active

oldest

votes

If the profiler gives you so detailed information then you forgot to enable compiler's optimizations (e.g., -O3 -march=native -DNDEBUG, or "Release" mode + /arch:AVX with VS). With Eigen, this will make a huge difference.

Then you might save dynamic memory allocation by using:

typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;

The matrix A_ and all internals of PartialPivLU will thus be statically allocated.

To update an existing facto better write:

matrixFactorization_.compute(A_);

answered Nov 14 '18 at 13:53

ggael

20.8k23145

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

1

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53298403%2fbottleneck-in-the-eigen-partial-lu-inplace-when-factorizing-a-large-number-of-sm%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Then you might save dynamic memory allocation by using:

typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;

The matrix A_ and all internals of PartialPivLU will thus be statically allocated.

To update an existing facto better write:

matrixFactorization_.compute(A_);

answered Nov 14 '18 at 13:53

ggael

20.8k23145

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

1

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

add a comment |

Then you might save dynamic memory allocation by using:

typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;

The matrix A_ and all internals of PartialPivLU will thus be statically allocated.

To update an existing facto better write:

matrixFactorization_.compute(A_);

answered Nov 14 '18 at 13:53

ggael

20.8k23145

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

1

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

add a comment |

Then you might save dynamic memory allocation by using:

typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;

The matrix A_ and all internals of PartialPivLU will thus be statically allocated.

To update an existing facto better write:

matrixFactorization_.compute(A_);

answered Nov 14 '18 at 13:53

ggael

20.8k23145

Then you might save dynamic memory allocation by using:

typedef Matrix<double,Dynamic,Dynamic,ColMajor,20,20> MatMax20;
MatMax20 A_;
PartialPivLU<MatMax20> matrixFactorization_;

The matrix A_ and all internals of PartialPivLU will thus be statically allocated.

To update an existing facto better write:

matrixFactorization_.compute(A_);

answered Nov 14 '18 at 13:53

ggael

20.8k23145

answered Nov 14 '18 at 13:53

ggael

20.8k23145

answered Nov 14 '18 at 13:53

ggael

20.8k23145

answered Nov 14 '18 at 13:53

ggael

20.8k23145

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

1

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

add a comment |

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

1

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

Nope, I have compiled in the fully optimized mode, see the HPCToolkit documentation. The code is optimized, but with debug information that allows mapping the statistical measurements to the source code.

– tmaric
Nov 14 '18 at 16:18

Also, I have a variable sized matrices, so I cannot fix the size at compile time, the max size 20x20 was just to give a hint of the system size. Is there a simple way to disable the inplace operations in Eigen?

– tmaric
Nov 14 '18 at 16:18

The typedef suggested by ggael does give you dynamic sized-matrices which are limited to 20x20. This avoids dynamic allocations when creating matrices and the LU-decomposition.

– chtz
Nov 14 '18 at 16:22

@tmaric: "disable the inplace operations" -> this does not make sense, the partial_lu_inplace function you see is what computes the LU factorization internally. This has little to do with the documentation you referenced. So beyond the optimization I suggested, there is not much you can do, except computing multiple factorizations in parallel using, e.g., openmp on your side.

– ggael
Nov 14 '18 at 17:06

@chtz: the unit tests seem to be running faster with pre-initialized max matrix sizes. Enabling vectorization with -march=native does not bring much for the smaller unit tests, I will run the larger tests with this option and see if it helps. @ggael: OK, understood, thanks for the help! @ggael: if you remove the part of your answer regarding the debugging and optimization flags, I can accept it. Fixing max matrix size helped make the tests run faster.

– tmaric
Nov 14 '18 at 19:25

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

ZRvj,0sK4x cvq,3

搜尋此網誌

Pfthb