Using a GPU-Enabled Server for Running Scripts

This tutorial provides step-by-step instructions for using a server with GPU resources, managed by SLURM, to run scripts requiring GPU acceleration.

Prerequisites

Access to the Server: Ensure you have login credentials and access to the server.
Installed Software: Check for the required software, such as Python, CUDA, and any dependencies for your script.
Environment Setup: Familiarity with SLURM workload manager for job scheduling.
SSH Client: Install an SSH client (e.g., PuTTY, OpenSSH) to access the server.

Step 1: Connect to the Server

Use SSH to log into the server:

1	ssh your_username@server_address

Replace your_username with your username and server_address with the server’s hostname or IP.

Step 2: Load Required Modules

Many servers use a module system to load software environments. Check available modules:

1	module avail

Load the required modules for your script, such as Python or CUDA:

1 2	module load python/3.8 module load cuda/11.7

Step 3: Prepare Your Script

Transfer your script and data files to the server. You can use scp or an SFTP client:

1	scp your_script.py your_username@server_address:/path/to/destination

Navigate to the directory containing your script:

1	cd /path/to/destination

Step 4: Create a SLURM Batch Script

A SLURM batch script specifies the resources and commands required to run your job. Create a file, e.g., run_gpu_job.sh:

#!/bin/bash
#SBATCH --job-name=my_gpu_job         # Job name
#SBATCH --output=output_%j.log        # Standard output and error log
#SBATCH --error=error_%j.log          # Error log
#SBATCH --partition=amd_a100n         # GPU partition name
#SBATCH --gres=gpu:1                  # Number of GPUs
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --cpus-per-task=4             # Number of CPU cores per task
#SBATCH --time=01:00:00               # Runtime (HH:MM:SS)
#SBATCH --mem=16G                     # Memory per node

# Load modules
module load python/3.8
module load cuda/11.7

# Activate virtual environment (if needed)
source /path/to/your/venv/bin/activate

# Run your script
python your_script.py

Save the script and make it executable:

1	chmod +x run_gpu_job.sh

Step 5: Submit the Job

Submit the batch script to SLURM:

1	sbatch run_gpu_job.sh

SLURM will return a job ID. You can check the job’s status using:

1	squeue -u your_username

Step 6: Monitor the Job

While the job is running, you can monitor its progress:

View Logs:
1
tail -f output_<jobID>.log
Check GPU Usage:
1
2
squeue -u your_username
nvidia-smi
or you can use watch to keep your connection active
1
watch -n 1 squeue -u x3002a15
to keep your terminal active

Step 7: Cancel a Job (if needed)

To cancel a job, use:

1	scancel <jobID>

Step 8: Retrieve Results

Once the job is complete, retrieve results from the server to your local machine:

1	scp your_username@server_address:/path/to/results /local/destination

Tips for Efficient Usage

Check Available Resources:
1
sinfo
This displays available partitions and their status.
Test Locally First: Test your script locally before submitting it to SLURM to minimize debugging time.
Use Virtual Environments: Maintain a Python virtual environment to manage dependencies.
Use GPUs Wisely: Request only the necessary number of GPUs to avoid over-allocation.

By following this tutorial, you can efficiently use a GPU-enabled server with SLURM to run your computationally intensive scripts. Always consult your server’s documentation for partition-specific settings and policies.