David Yunis, March 2024
Apptainer is a container system, like Docker without the nasty permission and networking issues. Using an Apptainer container is like using a frozen, stable OS, as well as a python virtual environment where your user permissions are constant across the filesystem outside the container. Below we will run through sample creation and usage of Apptainer containers.
Per Adam, Apptainer should be available on all compute nodes, so first log in to beehive:
ssh [user]@beehive.ttic.edu
then request a compute node (with a GPU as we will install GPU software):
srun -p dev-gpu -G 1 --pty bash
First make a directory on scratch for container building:
mkdir -p "/scratch/$USER/apptainer"
cd "/scratch/$USER/apptainer"
Now pull an existing container to a new folder for building the container. We will
use a stock docker image with cuda 12.3 support. We name this container pytorch_container
, but that
can be replaced as you like:
# this step pulls the docker image
apptainer build --sandbox pytorch_container docker://nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04
After getting the base image, we need to start up the container in writable mode to install the software we want:
# --writable allows for writing to the container, necessary for installing new software
# --no-home prevents issues with NFS home directories and writable containers
# --nv allows for gpu support
# --fakeroot is required for installing packages in the container as root, in particular using apt
apptainer shell --no-home --nv --writable --fakeroot pytorch_container
Now we should have a shell inside the container, where we will install software.
In general the flow should be apt
packages first, then conda
packages, and
then finally pip
packages.
# make sure package lists are current
apt update
apt upgrade
# install generally useful packages
apt install git
apt install wget
apt install vim
# create a location inside the container for a miniconda installation
mkdir /conda_tmp
cd /conda_tmp
# from beehive docs, install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /conda_tmp/mc3
rm Miniconda3-latest-Linux-x86_64.sh
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"
# install pytorch with cuda 12.1 support
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1 torchvision torchaudio
# install any other python packages needed
# exit the container
exit
After we install software in the container in writable mode, we need to build the container to an image so that we can use it for running software in readable mode. To do this, outside the container run, this step will take a while:
apptainer build pytorch_container.sif pytorch_container
After the build has finished, we should have a single image file that is the
whole container. Typically this file is quite large (~10G). Copy it to somewhere
on NFS so that we can reuse the container across different nodes.
We would advise somewhere in /share
as there is more space than the home directory,
which only has 20G, but below we use the home directory as everyone has one.
# clear the apptainer cache which takes up a lot of space
rm -r ~/.apptainer/cache
# copy the image
mkdir ~/apptainer
cp pytorch_container.sif ~/apptainer/pytorch_container.sif
# clean up installation directory
cd ~
rm -r "/scratch/$USER/apptainer"
After this step has finished, logout from the compute node:
exit # logout from compute node
There are two ways to use the container, interactive and non-interactive. We will go through examples of both.
First, allocate a compute node in interactive mode:
srun -p dev-gpu -G 1 --pty bash
Next, enter the container and activate the python environment we installed:
# enter the container:
# --nv is needed for gpu access
# the --mount flags mount all of the NFS directories inside the container
apptainer shell --mount type=bind,src=/scratch,dst=/scratch --mount type=bind,src=/share,dst=/share --nv ~/apptainer/pytorch_container.sif
# activate miniconda environment
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"
A simple test of our install is to run the python shell and see if we have GPU access:
python
import torch
a = torch.ones(1)
a.cuda()
If this works and a
is on the GPU, all is well!
To run the container in batch mode we will need two scripts: the first will
be an sbatch
script to launch the job on beehive, and the second will be a script
that runs inside the container environment, containing all of the python code.
First, create a python file to test our python environment:
cd ~
echo 'import torch; a = torch.ones(1); print(a.cuda())' > test_pytorch.py
Then, create a shell script to run inside the container, with contents like below:
#!/bin/bash
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"
cd ~
python test_pytorch.py
which we can save to stump_script.sh
Next, create an sbatch
script to launch our job, which should look like:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --constraint=
cd ~
apptainer exec --mount type=bind,src=/scratch,dst=/scratch --mount type=bind,src=/share,dst=/share --nv ~/apptainer/pytorch_container.sif bash stump_script.sh
which we can save to sbatch_script.sh
Finally, launch the job
sbatch sbatch_script.sh
--nv
BOTH when installing the container software and when running the container--writable
the container is read-only. This is intentional so that you cannot override software in the container by mistake, but there are certain instances like rebuilding C++ extensions to python for which you may need write access to the container. The cleanest solution to this is specify temporary build paths on the scratch
directory outside the container, and you may need to manually edit different packages to accomplish this.apt upgrade
fails due to locale issues, or --fakeroot
cannot be started due to GLIBC
issues, then the base container you're using may have underlying problems. The one we choose here is tested as of March 2024.