Apptainer installation instructions

David Yunis, March 2024

Apptainer is a container system, like Docker without the nasty permission and networking issues. Using an Apptainer container is like using a frozen, stable OS, as well as a python virtual environment where your user permissions are constant across the filesystem outside the container. Below we will run through sample creation and usage of Apptainer containers.

Installing Apptainer

Per Adam, Apptainer should be available on all compute nodes, so first log in to beehive:

ssh [user]@beehive.ttic.edu

then request a compute node (with a GPU as we will install GPU software):

srun -p dev-gpu -G 1 --pty bash

Building a container

First make a directory on scratch for container building:

mkdir -p "/scratch/$USER/apptainer"
cd "/scratch/$USER/apptainer"

Now pull an existing container to a new folder for building the container. We will use a stock docker image with cuda 12.3 support. We name this container pytorch_container, but that can be replaced as you like:

# this step pulls the docker image
apptainer build --sandbox pytorch_container docker://nvidia/cuda:12.3.2-cudnn9-devel-ubuntu22.04

After getting the base image, we need to start up the container in writable mode to install the software we want:

# --writable allows for writing to the container, necessary for installing new software
# --no-home prevents issues with NFS home directories and writable containers
# --nv allows for gpu support
# --fakeroot is required for installing packages in the container as root, in particular using apt
apptainer shell --no-home --nv --writable --fakeroot pytorch_container

Now we should have a shell inside the container, where we will install software. In general the flow should be apt packages first, then conda packages, and then finally pip packages.

# make sure package lists are current
apt update
apt upgrade

# install generally useful packages
apt install git
apt install wget
apt install vim

# create a location inside the container for a miniconda installation
mkdir /conda_tmp
cd /conda_tmp

# from beehive docs, install miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /conda_tmp/mc3
rm Miniconda3-latest-Linux-x86_64.sh
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"

# install pytorch with cuda 12.1 support
conda install -c pytorch -c nvidia pytorch pytorch-cuda=12.1 torchvision torchaudio

# install any other python packages needed

# exit the container
exit

After we install software in the container in writable mode, we need to build the container to an image so that we can use it for running software in readable mode. To do this, outside the container run, this step will take a while:

apptainer build pytorch_container.sif pytorch_container

After the build has finished, we should have a single image file that is the whole container. Typically this file is quite large (~10G). Copy it to somewhere on NFS so that we can reuse the container across different nodes. We would advise somewhere in /share as there is more space than the home directory, which only has 20G, but below we use the home directory as everyone has one.

# clear the apptainer cache which takes up a lot of space
rm -r ~/.apptainer/cache

# copy the image
mkdir ~/apptainer
cp pytorch_container.sif ~/apptainer/pytorch_container.sif

# clean up installation directory
cd ~
rm -r "/scratch/$USER/apptainer"

After this step has finished, logout from the compute node:

exit  # logout from compute node

Running the container

There are two ways to use the container, interactive and non-interactive. We will go through examples of both.

Interactive mode

First, allocate a compute node in interactive mode:

srun -p dev-gpu -G 1 --pty bash

Next, enter the container and activate the python environment we installed:

# enter the container:
# --nv is needed for gpu access
# the --mount flags mount all of the NFS directories inside the container
apptainer shell --mount type=bind,src=/scratch,dst=/scratch --mount type=bind,src=/share,dst=/share --nv ~/apptainer/pytorch_container.sif
# activate miniconda environment
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"

A simple test of our install is to run the python shell and see if we have GPU access:

python
import torch
a = torch.ones(1)
a.cuda()

If this works and a is on the GPU, all is well!

Batch mode

To run the container in batch mode we will need two scripts: the first will be an sbatch script to launch the job on beehive, and the second will be a script that runs inside the container environment, containing all of the python code.

First, create a python file to test our python environment:

cd ~
echo 'import torch; a = torch.ones(1); print(a.cuda())' > test_pytorch.py

Then, create a shell script to run inside the container, with contents like below:

#!/bin/bash
eval "$(/conda_tmp/mc3/bin/conda 'shell.bash' 'hook')"
cd ~
python test_pytorch.py

which we can save to stump_script.sh

Next, create an sbatch script to launch our job, which should look like:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gpus=1
#SBATCH --constraint=

cd ~
apptainer exec --mount type=bind,src=/scratch,dst=/scratch --mount type=bind,src=/share,dst=/share --nv ~/apptainer/pytorch_container.sif bash stump_script.sh

which we can save to sbatch_script.sh

Finally, launch the job

sbatch sbatch_script.sh

Encountered issues

If you encounter issues with GPU access:
make sure you used --nv BOTH when installing the container software and when running the container
if you still have issues, it is possible certain build tools are missing on the node you used to build the container, and Adam will be able to fix that
Without --writable the container is read-only. This is intentional so that you cannot override software in the container by mistake, but there are certain instances like rebuilding C++ extensions to python for which you may need write access to the container. The cleanest solution to this is specify temporary build paths on the scratch directory outside the container, and you may need to manually edit different packages to accomplish this.
If you encounter an issue where apt upgrade fails due to locale issues, or --fakeroot cannot be started due to GLIBC issues, then the base container you're using may have underlying problems. The one we choose here is tested as of March 2024.