Multi CPUS/GPUs support with Horovod

Warning The use mutli-GPU is under developpement and hasn’t been thoroughly tested yet. Proceed with caution !

QMC simulations can easily be parallelized by using multiple ressources to sample the wave function. Each walker is indenpendent of the other ones and therefore multiple compute node can be used in parallel to obtain more samples. Each node can alsu use GPUs is they are available. We demonstrate here how to use the library Horovod (https://github.com/horovod/horovod) to leverage large compute ressources for QMC.

Let’s first create a simple system

[1]:
import torch
from torch import optim
from qmctorch.scf import Molecule
from qmctorch.wavefunction import SlaterJastrow
from qmctorch.sampler import Metropolis
from qmctorch.utils import (plot_energy, plot_data)
from qmctorch.utils import set_torch_double_precision
set_torch_double_precision()
mol = Molecule(atom='H 0. 0. 0; H 0. 0. 1.', unit='bohr', redo_scf=True)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_241/493514665.py in <module>
----> 1 import torch
      2 from torch import optim
      3 from qmctorch.scf import Molecule
      4 from qmctorch.wavefunction import SlaterJastrow
      5 from qmctorch.sampler import Metropolis

ModuleNotFoundError: No module named 'torch'

Let’s see if GPUs are available

[2]:
use_gpu = torch.cuda.is_available()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_241/3587241540.py in <module>
----> 1 use_gpu = torch.cuda.is_available()

NameError: name 'torch' is not defined
[3]:
wf = SlaterJastrow(mol, cuda=use_gpu).gto2sto()
sampler = Metropolis(nwalkers=100, nstep=500, step_size=0.25,
                     nelec=wf.nelec, ndim=wf.ndim,
                     init=mol.domain('atomic'),
                     move={'type': 'all-elec', 'proba': 'normal'},
                     cuda=use_gpu)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_241/3194883153.py in <module>
----> 1 wf = SlaterJastrow(mol, cuda=use_gpu).gto2sto()
      2 sampler = Metropolis(nwalkers=100, nstep=500, step_size=0.25,
      3                      nelec=wf.nelec, ndim=wf.ndim,
      4                      init=mol.domain('atomic'),
      5                      move={'type': 'all-elec', 'proba': 'normal'},

NameError: name 'SlaterJastrow' is not defined
[4]:
lr_dict = [{'params': wf.jastrow.parameters(), 'lr': 3E-3},
           {'params': wf.ao.parameters(), 'lr': 1E-6},
           {'params': wf.mo.parameters(), 'lr': 1E-3},
           {'params': wf.fc.parameters(), 'lr': 2E-3}]
opt = optim.Adam(lr_dict, lr=1E-3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_241/358553228.py in <module>
----> 1 lr_dict = [{'params': wf.jastrow.parameters(), 'lr': 3E-3},
      2            {'params': wf.ao.parameters(), 'lr': 1E-6},
      3            {'params': wf.mo.parameters(), 'lr': 1E-3},
      4            {'params': wf.fc.parameters(), 'lr': 2E-3}]
      5 opt = optim.Adam(lr_dict, lr=1E-3)

NameError: name 'wf' is not defined

A dedicated QMCTorch Solver has been developped to handle multiple GPU. To use this solver simply import it and use is as the normal solver and only a few modifications are required to use horovod :

[5]:
import horovod.torch as hvd
from qmctorch.solver import SolverMPI

hvd.init()
if torch.cuda.is_available():
    torch.cuda.set_device(hvd.rank())

solver = SolverMPI(wf=wf, sampler=sampler,
                                    optimizer=opt,
                                    rank=hvd.rank())
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_241/263085700.py in <module>
----> 1 import horovod.torch as hvd
      2 from qmctorch.solver import SolverMPI
      3
      4 hvd.init()
      5 if torch.cuda.is_available():

ModuleNotFoundError: No module named 'horovod'
[6]:
solver.configure(track=['local_energy'], freeze=['ao', 'mo'],
                loss='energy', grad='auto',
                ortho_mo=False, clip_loss=False,
                resampling={'mode': 'update',
                            'resample_every': 1,
                            'nstep_update': 50})

# optimize the wave function
obs = solver.run(5)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/tmp/ipykernel_241/214530952.py in <module>
----> 1 solver.configure(track=['local_energy'], freeze=['ao', 'mo'],
      2                 loss='energy', grad='auto',
      3                 ortho_mo=False, clip_loss=False,
      4                 resampling={'mode': 'update',
      5                             'resample_every': 1,

NameError: name 'solver' is not defined

As you can see some classes need the rank of the process when they are defined. This is simply to insure that only the master process generates the HDF5 files containing the information relative to the calculation.

Running parallel calculations

It is currently difficult to use Horovod on mutliple node through a jupyter notebook. To do so, one should have a python file with all the code and execute the code with the following command

horovodrun -np 2 python <example>.py

See the horovod documentation for more details : https://github.com/horovod/horovod

This solver distribute the Nw walkers over the Np process . For example specifying 2000 walkers and using 4 process will lead to each process using only 500 walkers. During the optimizaiton of the wavefunction each process will compute the gradients of the variational parameter using their local 500 walkers. The gradients are then averaged over all the processes before the optimization step takes place. This data parallel model has been greatly succesfull in machine learning applications (http://jmlr.org/papers/volume20/18-789/18-789.pdf)

A complete example can found in qmctorch/docs/example/horovod/h2.py