Overview Getting Started Software Hardware Jobs For Faculty

Job Submission and Scheduling

Access and Scheduling

Partitions

Beehive has several types of partitions with different access levels and time limits:

Default Partitions

dev-cpu: Development partition for CPU jobs (1 hour limit)
dev-gpu: Development partition for GPU jobs (1 hour limit)
cpu: Standard CPU partition (8 hour limit)
gpu: Standard GPU partition (8 hour limit)
low-gpu: Lower priority standard GPU partition (8 hour limit)

Beehive uses Slurm's fair-share scheduling with multifactor priority, which means:

Job priority considers your recent usage, group allocations, and partition-specific weights
Fair-share half-life is 8 hours - usage impact on priority decays by 50% every 8 hours
The fair-share component has a very high weight (500000) compared to other factors

Research Group Partitions

Several research groups have partitions with increased priority:

Partition	Priority	Group
greg-gpu	2863	greg-cluster
willett-gpu	2009	willett-group
nati-gpu	1406	nati-group
speech-gpu	1212	speech-cluster
mmaire-gpu	708	vision-mmaire
mcallester-gpu	613	mcallester-group

Guaranteed Access Nodes

Some nodes are reserved for specific research groups with unlimited time limits:

bhattad-gpu: Node g5 (RTX 6000 Ada)
mm-gpu: Node g18 (RTX A4000)
lingxw-gpu: Node g17 (RTX A4000)
zhiyuanli-gpu: Nodes g20 and priv-g14 (RTX 6000 Ada and L40S)
yanhongli-own-gpu: Node g19 (RTX A6000)
richard1xur-own-gpu: Node g16 (RTX 6000 Ada)

Private Nodes

Many researchers have private nodes that are only accessible to their group:

greg-own-gpu: RTX A5000 GPUs
jiahao-own-gpu: RTX A5500 GPUs
jjery2243542-own-gpu: RTX 4000 Ada GPUs
marcelo-own-gpu: Quadro RTX 8000 GPUs
pushkarshukla-own-gpu: RTX 2080 Ti GPUs
savarese-own-gpu: GTX 1080 Ti GPUs
shesterg-own-gpu: RTX 2080 Ti GPUs
txu-own-gpu: Quadro RTX 6000 GPUs
takuma-own-gpu: RTX 2080 Ti GPUs
whc-own-gpu: RTX A5000 GPUs
xdu-own-gpu: Quadro RTX 6000 GPUs
zsun-own-gpu: RTX 6000 Ada GPUs

Job Submission

Basic Job Submission

To submit a job to the cluster, use the sbatch command:

sbatch my_job_script.sh

Requesting Specific GPUs

Beehive supports specifying GPU types, which is important for jobs requiring particular architectures or memory sizes:

#!/bin/bash
#SBATCH --job-name=my_job
#SBATCH --output=output_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:nvidia_rtx_a6000:1  # Request a specific GPU type
#SBATCH --cpus-per-task=4

python train.py

Available GPU Types

You can request specific GPU models using the following identifiers:

nvidia_geforce_gtx_1080_ti      # 11GB VRAM (Pascal)
nvidia_geforce_rtx_2080_ti      # 11GB VRAM (Turing)
nvidia_titan_v                  # 12GB VRAM (Volta)
nvidia_rtx_a4000                # 16GB VRAM (Ampere)
nvidia_rtx_a5000                # 24GB VRAM (Ampere)
nvidia_rtx_a5500                # 24GB VRAM (Ampere)
nvidia_rtx_a6000                # 48GB VRAM (Ampere)
nvidia_rtx_4000_ada_generation  # 20GB VRAM (Ada)
nvidia_rtx_5000_ada_generation  # 32GB VRAM (Ada)
nvidia_rtx_6000_ada_generation  # 48GB VRAM (Ada)
nvidia_l40s                     # 48GB VRAM (Ada)
quadro_rtx_6000                 # 24GB VRAM (Turing)
quadro_rtx_8000                 # 48GB VRAM (Turing)

GPU Architecture Features

You can also request GPUs based on architecture features:

#SBATCH --constraint=ada         # Ada architecture (newest)
#SBATCH --constraint=ampere      # Ampere architecture
#SBATCH --constraint=turing      # Turing architecture
#SBATCH --constraint=volta       # Volta architecture
#SBATCH --constraint=pascal      # Pascal architecture (oldest)

GPU Memory Constraints

Request GPUs with specific minimum memory sizes:

#SBATCH --constraint=48g        # GPUs with 48GB VRAM
#SBATCH --constraint=32g        # GPUs with 32GB VRAM
#SBATCH --constraint=24g        # GPUs with 24GB VRAM
#SBATCH --constraint=20g        # GPUs with 20GB VRAM
#SBATCH --constraint=16g        # GPUs with 16GB VRAM
#SBATCH --constraint=12g        # GPUs with 12GB VRAM
#SBATCH --constraint=11g        # GPUs with 11GB VRAM

Sample Job Scripts

Basic GPU Job

#!/bin/bash
#SBATCH --job-name=basic_gpu
#SBATCH --output=basic_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4

nvidia-smi
python my_script.py

High-Memory GPU Job (A6000/6000 Ada/L40S)

#!/bin/bash
#SBATCH --job-name=high_mem_gpu
#SBATCH --output=high_mem_%j.log
#SBATCH --partition=gpu
#SBATCH --constraint=48g
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8

python train_large_model.py

Multi-GPU Training

#!/bin/bash
#SBATCH --job-name=multi_gpu
#SBATCH --output=multi_gpu_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16

python distributed_training.py

Development/Testing Job

#!/bin/bash
#SBATCH --job-name=dev_test
#SBATCH --output=dev_%j.log
#SBATCH --partition=dev-gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4

python debug_script.py

Singleton Jobs (Preventing Duplicate Runs)

The singleton dependency is specifically designed for managing a series of jobs that should run one after another. It ensures only one instance of a job with a specific name can run at a time:

#!/bin/bash
#SBATCH --job-name=my_job_series
#SBATCH --output=series_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --dependency=singleton

# Process one chunk of data
python process_data_chunk.py --chunk=${SLURM_ARRAY_TASK_ID}

To submit a series of jobs that will execute sequentially:

# Submit 10 jobs that will run one after another
for i in {1..10}; do
  sbatch --job-name=data_processing_series job_script.sh
done

Note: The singleton dependency is primarily useful for job series and should not be used for other purposes. All jobs must share the same name for the singleton dependency to work correctly. This is an efficient way to process large datasets in manageable chunks while preventing resource overload.

You can also combine singleton with array jobs:

#!/bin/bash
#SBATCH --job-name=sequential_array
#SBATCH --output=array_%A_%a.log
#SBATCH --array=1-100
#SBATCH --dependency=singleton
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

# Process array tasks sequentially
python process.py --task=${SLURM_ARRAY_TASK_ID}

Checkpointing for Long Jobs

The standard partitions have an 8-hour time limit. For jobs that need to run longer, implement checkpointing in your code. Here are examples of how to save and load checkpoints with popular ML frameworks:

PyTorch Checkpointing Example

Here's a complete example showing how to implement checkpointing in PyTorch:

import os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import time

# Define your model
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layers = nn.Sequential(
            nn.Linear(784, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 10)
        )

    def forward(self, x):
        return self.layers(x)

# Initialize model, optimizer, loss function
model = MyModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()

# Checkpoint directory
checkpoint_dir = './checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)
checkpoint_path = os.path.join(checkpoint_dir, 'model_checkpoint.pt')

# Training configuration
start_epoch = 0
total_epochs = 100
best_accuracy = 0.0

# Check if a checkpoint exists
if os.path.isfile(checkpoint_path):
    print(f"Loading checkpoint from {checkpoint_path}")
    checkpoint = torch.load(checkpoint_path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    start_epoch = checkpoint['epoch'] + 1
    best_accuracy = checkpoint['best_accuracy']
    print(f"Resuming from epoch {start_epoch} with best accuracy {best_accuracy:.4f}")
else:
    print("No checkpoint found, starting from scratch")

# Training loop with periodic checkpointing
for epoch in range(start_epoch, total_epochs):
    # Training code here...
    # ...

    # Calculate accuracy on validation set
    accuracy = 0.93  # This would be your actual validation accuracy

    # Save checkpoint periodically (every 5 epochs and when best accuracy improves)
    if (epoch + 1) % 5 == 0 or accuracy > best_accuracy:
        if accuracy > best_accuracy:
            best_accuracy = accuracy

        checkpoint = {
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'best_accuracy': best_accuracy,
            'training_time': time.time()  # Store current time for reference
        }

        # Save checkpoint
        torch.save(checkpoint, checkpoint_path)
        print(f"Checkpoint saved at epoch {epoch+1} with accuracy {accuracy:.4f}")

Job Script with Checkpointing

#!/bin/bash
#SBATCH --job-name=pytorch_training
#SBATCH --output=training_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4

# Activate your environment if needed
source ~/venv/bin/activate

# Run the training script
python train_with_checkpoints.py

# Optional: Submit a new job to continue training if this job times out
sbatch --dependency=afterany:$SLURM_JOB_ID continue_training.sh

Continue Training Script

For automatically continuing training after the time limit, create a continue_training.sh script:

#!/bin/bash
#SBATCH --job-name=pytorch_continue
#SBATCH --output=continue_%j.log
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4

# Activate your environment if needed
source ~/venv/bin/activate

# Continue training with the saved checkpoint
python train_with_checkpoints.py

# Submit another job to continue if needed
sbatch --dependency=afterany:$SLURM_JOB_ID continue_training.sh

Getting Help

For cluster-specific questions or issues, contact the TTIC system administrators.

For general Slurm usage questions, refer to the Slurm documentation or use the man pages:

man sbatch
man squeue
man sinfo