Running MPI on LAN Cluster with different usernames
up vote
0
down vote
favorite
I have two machines with different usernames: assume user1@master
and user2@slave
. I would like to run a MPI job on the two machines, but I have been unsuccessful until now. I have successfully setup passwordless ssh between the two machines. Both machines have the same version of OpenMPI and both machines have the PATH
and LD_LIBRARY_PATH
setup correspondingly.
The path for openmpi on each machine is /home/$USER/.openmpi
and the program I want to run is inside ~/folder
My /etc/hosts file on both machines:
master x.x.x.110
slave x.x.x.111
My /.ssh/config file on user1@master
:
Host slave
User user2
I then execute the command on user1@master
while inside ~/folder
as follows:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
I get the following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Edits
If I use a hostfile with contents:
localhost
user2@slave
along with the --mca
argument I get the following error:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
I do not get any ssh orte info as asked but maybe becuase i am mistyping the --mca
command?
mpi host
add a comment |
up vote
0
down vote
favorite
I have two machines with different usernames: assume user1@master
and user2@slave
. I would like to run a MPI job on the two machines, but I have been unsuccessful until now. I have successfully setup passwordless ssh between the two machines. Both machines have the same version of OpenMPI and both machines have the PATH
and LD_LIBRARY_PATH
setup correspondingly.
The path for openmpi on each machine is /home/$USER/.openmpi
and the program I want to run is inside ~/folder
My /etc/hosts file on both machines:
master x.x.x.110
slave x.x.x.111
My /.ssh/config file on user1@master
:
Host slave
User user2
I then execute the command on user1@master
while inside ~/folder
as follows:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
I get the following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Edits
If I use a hostfile with contents:
localhost
user2@slave
along with the --mca
argument I get the following error:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
I do not get any ssh orte info as asked but maybe becuase i am mistyping the --mca
command?
mpi host
it seems you are mixingmpiexec
syntax (e.g.-n 1
) withmpirun
syntax (e.g.-np 1
). try to use the full path tompirun
(e.g./opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is toconfigure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.
– Gilles Gouaillardet
Nov 10 at 16:46
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.
– John.Ludlum
Nov 11 at 8:00
did youconfigure --enable-mpirun-prefix-by-default
? if youmpirun --mca plm_base_verbose 10 ...
it should print thessh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using theuser@host
syntax instead of tweaking your.ssh/config
?
– Gilles Gouaillardet
Nov 11 at 20:02
@GillesGouaillardet yes, i tried with both--enable-mpirun-prefix-by-default
and--enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for--mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!
– John.Ludlum
Nov 13 at 7:23
this is the log of ampirun
crash and that should never happen ! can you runmpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?
– Gilles Gouaillardet
Nov 13 at 21:21
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have two machines with different usernames: assume user1@master
and user2@slave
. I would like to run a MPI job on the two machines, but I have been unsuccessful until now. I have successfully setup passwordless ssh between the two machines. Both machines have the same version of OpenMPI and both machines have the PATH
and LD_LIBRARY_PATH
setup correspondingly.
The path for openmpi on each machine is /home/$USER/.openmpi
and the program I want to run is inside ~/folder
My /etc/hosts file on both machines:
master x.x.x.110
slave x.x.x.111
My /.ssh/config file on user1@master
:
Host slave
User user2
I then execute the command on user1@master
while inside ~/folder
as follows:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
I get the following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Edits
If I use a hostfile with contents:
localhost
user2@slave
along with the --mca
argument I get the following error:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
I do not get any ssh orte info as asked but maybe becuase i am mistyping the --mca
command?
mpi host
I have two machines with different usernames: assume user1@master
and user2@slave
. I would like to run a MPI job on the two machines, but I have been unsuccessful until now. I have successfully setup passwordless ssh between the two machines. Both machines have the same version of OpenMPI and both machines have the PATH
and LD_LIBRARY_PATH
setup correspondingly.
The path for openmpi on each machine is /home/$USER/.openmpi
and the program I want to run is inside ~/folder
My /etc/hosts file on both machines:
master x.x.x.110
slave x.x.x.111
My /.ssh/config file on user1@master
:
Host slave
User user2
I then execute the command on user1@master
while inside ~/folder
as follows:
$ mpiexec -n 1 ./program : -np 1 -host slave -wdir /home/user2/folder ./program
I get the following error:
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
Edits
If I use a hostfile with contents:
localhost
user2@slave
along with the --mca
argument I get the following error:
$ mpirun --mca plm_base_verbose 10 -n 5 --hostfile hosts.txt ./program
[user:29277] mca: base: components_register: registering framework plm components
[user:29277] mca: base: components_register: found loaded component slurm
[user:29277] mca: base: components_register: component slurm register function successful
[user:29277] mca: base: components_register: found loaded component isolated
[user:29277] mca: base: components_register: component isolated has no register or open function
[user:29277] mca: base: components_register: found loaded component rsh
[user:29277] mca: base: components_register: component rsh register function successful
[user:29277] mca: base: components_open: opening plm components
[user:29277] mca: base: components_open: found loaded component slurm
[user:29277] mca: base: components_open: component slurm open function successful
[user:29277] mca: base: components_open: found loaded component isolated
[user:29277] mca: base: components_open: component isolated open function successful
[user:29277] mca: base: components_open: found loaded component rsh
[user:29277] mca: base: components_open: component rsh open function successful
[user:29277] mca:base:select: Auto-selecting plm components
[user:29277] mca:base:select:( plm) Querying component [slurm]
[user:29277] mca:base:select:( plm) Querying component [isolated]
[user:29277] mca:base:select:( plm) Query of component [isolated] set priority to 0
[user:29277] mca:base:select:( plm) Querying component [rsh]
[user:29277] mca:base:select:( plm) Query of component [rsh] set priority to 10
[user:29277] mca:base:select:( plm) Selected component [rsh]
[user:29277] mca: base: close: component slurm closed
[user:29277] mca: base: close: unloading component slurm
[user:29277] mca: base: close: component isolated closed
[user:29277] mca: base: close: unloading component isolated
[user:29277] *** Process received signal ***
[user:29277] Signal: Segmentation fault (11)
[user:29277] Signal code: (128)
[user:29277] Failing at address: (nil)
[user:29277] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f4226242f20]
[user:29277] [ 1] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x197)[0x7f422629b207]
[user:29277] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup_function+0x10a)[0x7f422634d06a]
[user:29277] [ 3] /lib/x86_64-linux-gnu/libc.so.6(__nss_lookup+0x3d)[0x7f422634d19d]
[user:29277] [ 4] /lib/x86_64-linux-gnu/libc.so.6(getpwuid_r+0x2f3)[0x7f42262e7ee3]
[user:29277] [ 5] /lib/x86_64-linux-gnu/libc.so.6(getpwuid+0x98)[0x7f42262e7498]
[user:29277] [ 6] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x477d)[0x7f422356977d]
[user:29277] [ 7] /home/.openmpi/lib/openmpi/mca_plm_rsh.so(+0x67a7)[0x7f422356b7a7]
[user:29277] [ 8] /home/.openmpi/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0xdc9)[0x7f4226675749]
[user:29277] [ 9] mpirun(+0x1262)[0x563fde915262]
[user:29277] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4226225b97]
[user:29277] [11] mpirun(+0xe7a)[0x563fde914e7a]
[user:29277] *** End of error message ***
Segmentation fault (core dumped)
I do not get any ssh orte info as asked but maybe becuase i am mistyping the --mca
command?
mpi host
mpi host
edited Nov 13 at 7:21
asked Nov 10 at 5:26
John.Ludlum
13
13
it seems you are mixingmpiexec
syntax (e.g.-n 1
) withmpirun
syntax (e.g.-np 1
). try to use the full path tompirun
(e.g./opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is toconfigure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.
– Gilles Gouaillardet
Nov 10 at 16:46
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.
– John.Ludlum
Nov 11 at 8:00
did youconfigure --enable-mpirun-prefix-by-default
? if youmpirun --mca plm_base_verbose 10 ...
it should print thessh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using theuser@host
syntax instead of tweaking your.ssh/config
?
– Gilles Gouaillardet
Nov 11 at 20:02
@GillesGouaillardet yes, i tried with both--enable-mpirun-prefix-by-default
and--enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for--mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!
– John.Ludlum
Nov 13 at 7:23
this is the log of ampirun
crash and that should never happen ! can you runmpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?
– Gilles Gouaillardet
Nov 13 at 21:21
add a comment |
it seems you are mixingmpiexec
syntax (e.g.-n 1
) withmpirun
syntax (e.g.-np 1
). try to use the full path tompirun
(e.g./opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is toconfigure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.
– Gilles Gouaillardet
Nov 10 at 16:46
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.
– John.Ludlum
Nov 11 at 8:00
did youconfigure --enable-mpirun-prefix-by-default
? if youmpirun --mca plm_base_verbose 10 ...
it should print thessh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using theuser@host
syntax instead of tweaking your.ssh/config
?
– Gilles Gouaillardet
Nov 11 at 20:02
@GillesGouaillardet yes, i tried with both--enable-mpirun-prefix-by-default
and--enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for--mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!
– John.Ludlum
Nov 13 at 7:23
this is the log of ampirun
crash and that should never happen ! can you runmpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?
– Gilles Gouaillardet
Nov 13 at 21:21
it seems you are mixing
mpiexec
syntax (e.g. -n 1
) with mpirun
syntax (e.g. -np 1
). try to use the full path to mpirun
(e.g. /opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is to configure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.– Gilles Gouaillardet
Nov 10 at 16:46
it seems you are mixing
mpiexec
syntax (e.g. -n 1
) with mpirun
syntax (e.g. -np 1
). try to use the full path to mpirun
(e.g. /opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is to configure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.– Gilles Gouaillardet
Nov 10 at 16:46
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:
/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.– John.Ludlum
Nov 11 at 8:00
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:
/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.– John.Ludlum
Nov 11 at 8:00
did you
configure --enable-mpirun-prefix-by-default
? if you mpirun --mca plm_base_verbose 10 ...
it should print the ssh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using the user@host
syntax instead of tweaking your .ssh/config
?– Gilles Gouaillardet
Nov 11 at 20:02
did you
configure --enable-mpirun-prefix-by-default
? if you mpirun --mca plm_base_verbose 10 ...
it should print the ssh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using the user@host
syntax instead of tweaking your .ssh/config
?– Gilles Gouaillardet
Nov 11 at 20:02
@GillesGouaillardet yes, i tried with both
--enable-mpirun-prefix-by-default
and --enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for --mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!– John.Ludlum
Nov 13 at 7:23
@GillesGouaillardet yes, i tried with both
--enable-mpirun-prefix-by-default
and --enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for --mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!– John.Ludlum
Nov 13 at 7:23
this is the log of a
mpirun
crash and that should never happen ! can you run mpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?– Gilles Gouaillardet
Nov 13 at 21:21
this is the log of a
mpirun
crash and that should never happen ! can you run mpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?– Gilles Gouaillardet
Nov 13 at 21:21
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236241%2frunning-mpi-on-lan-cluster-with-different-usernames%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
it seems you are mixing
mpiexec
syntax (e.g.-n 1
) withmpirun
syntax (e.g.-np 1
). try to use the full path tompirun
(e.g./opt/openmpi-3.1.3/bin/mpirun
and see whether it helps. an other option is toconfigure --with-enable-mpirun-prefix-by-default
and rebuild/install Open MPI.– Gilles Gouaillardet
Nov 10 at 16:46
@GillesGouaillardet I tried doing what you suggested but that did not solve the problem. To try and skip over the problem i reinstalled OpenMPI on both machines in the same directory:
/home/.openmpi
and then setting up a nfs shared folder with the code. This also did not change my output error and I have exactly the same thing as before.– John.Ludlum
Nov 11 at 8:00
did you
configure --enable-mpirun-prefix-by-default
? if youmpirun --mca plm_base_verbose 10 ...
it should print thessh ... orted ...
command line that is ran under the hood, and you can try to run it manually (could be a permission issue on the Open MPI libs). Have you tried using a hostfile using theuser@host
syntax instead of tweaking your.ssh/config
?– Gilles Gouaillardet
Nov 11 at 20:02
@GillesGouaillardet yes, i tried with both
--enable-mpirun-prefix-by-default
and--enable-orterun-prefix-by-default
(as suggested in the error message) with no change in the output. for--mca
command, see edits. I tried using a hostfile too. Thanks for all your suggestions! The problem still exists but atleast I'm learning quite a bit!– John.Ludlum
Nov 13 at 7:23
this is the log of a
mpirun
crash and that should never happen ! can you runmpirun --mca plm_base_verbose 10 ...
without the hostfile (and relying on your ssh config instead) ?– Gilles Gouaillardet
Nov 13 at 21:21