Author: Rebecca Vandewalle rcv3@illinois.edu
Created: 8-16-21
This notebook provides an example of running a Monte Carlo style computation using CyberGIS-Compute. CyberGIS-Compute is service for running High Performance Computing (HPC) jobs from a Jupyter Notebook within CyberGISX. In this example, the FireABM simulation script is run twice, each separately using two different tasks. This small example demonstrates how to run a serial script with no in-built parallelization multiple times on CyberGIS-Compute, how to pass parameters from a notebook to CyberGIS-Compute, how to access standard HPC variables (such as node_ids) from within a CyberGIS-Compute job, and how to specify the correct working and results directories for running the job script and downloading the results. The goal of this example is to demonstrate how to use CyberGIS-Compute with no or very little adjustments to the original serial script. The custom job in this notebook uses this repository: https://github.com/cybergis/cybergis-compute-fireabm.git .
The CyberGIS-Compute client is the middleware that makes it possible to access High Performance Computing (HPC) resources from within a CyberGISX Jupyter Notebook. The first cell loads the client if it has already been installed. If not, it first installs the client and then loads it.
# Try to load CyberGIS-Compute client
try:
from cybergis_compute_client import CyberGISCompute
# If not already setup, setup CyberGIS-Compute in the current Jupyter kernel
except:
import sys
!{sys.executable} -m pip install git+https://github.com/cybergis/job-supervisor-python-sdk.git@v2
from cybergis_compute_client import CyberGISCompute
The custom repository used in this example is https://github.com/cybergis/cybergis-compute-fireabm.git .
This repo contains the following files:
manifest.json (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/manifest.json) is a mandatory file. It must be a JSON file named manifest.json and must contain a JSON array of key value pairs that are used by CyberGIS-Compute. In particutlar, the "name" value must be set, the "container" must be set ("cybergisx-0.4" contains the same modules as a CyberGISX notebook at the time this tutorial notebook was created), and the "execution_stage" must be set. In this case "bash ./runjobs.sh" tells CyberGIS-Compute to run the shell script runjobs.sh
when the job runs.
runjobs.sh (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/runjobs.sh) is a shell script that runs when a CyberGIS-Compute Job is run. This script does the following actions:
$SEED
variable value based on the $param_start_value
(a value set when the job is constructed within this Notebook) and #SLURM_PROCID
(the task ID, a built in variable populated when the job runs on HPC)$result_folder
(a path set by the CyberGIS-Compute Client when the job is created)$result_folder
$SEED
value and the $result_folder value
$result_folder
(note that for real examples, this task is better done in the post_processing_stage
Variables: This shell script uses variables and directories set in a few different places. The $SEED
variable is created in runjobs.sh. The $param_start_value
is a value that is passed to the CyberGIS-Compute client from a notebook. This value is set in the param
array within the .set()
function in the next section of this notebook. #SLURM_PROCID
is a built-in variable set on the HPC (other available variables can be found here: https://slurm.schedmd.com/srun.html#lbAJ)
Directories: CyberGIS-Compute client uses two primary directories which are set when the job is created. The paths to these directories can be accessed by environment variables. Although scripts are run in the $executable_folder
, results should be written to the $results_folder
. These folders are not in the same location. You might need to adjust your primary script if it by default writes result files in the same folder as the script. In this example, the $results_folder
variable is passed to the python script, which requires an output path to use to write results.
Execution Stages: The CyberGIS-Compute client supports three stages: "pre_processing_stage", "execution_stage", and "post_processing_stage". These are each keys in the manifest.json file which expect a command to run as a value. An example of a manifest.json file that uses all three stages can be found here: https://github.com/cybergis/cybergis-compute-hello-world/blob/main/manifest.json . Ideally the clean up tasks should be performed in the "post_processing_stage" to ensure that all tasks in the execution stage are finished before performing clean up activities.
Other files and directories in the repo: The FireABM simulation needs some small input data files and a specific input directory structure. These files and directories are included in the GitHub repo and will be copied to the $executable_folder
by the CyberGIS-Compute Client.
In the next step, a CyberGIS-Compute object and a job object is created. See this tutorial notebook for more details on the basic job creation process: https://cybergisxhub.cigi.illinois.edu/notebook/cybergis-compute-tutorial/ .
# Create a CyberGIS-Compute object
cybergis = CyberGISCompute(url="cgjobsup-dev.cigi.illinois.edu",
port=3030, protocol='HTTP', isJupyter=True)
Since this is a custom job, the maintainer will be "community_contribution".
# List available maintainers
cybergis.list_maintainer()
Each custom job requires a GitHub repository to be created and specified when the job is created. After the GitHub repository is created, the CyberGISX team must be contacted to review the repository and if approved, add it to the available repositories that can be used with CyberGIS-Compute. In this case the custom repository described above can be seen in the approved repositories list.
# List available git repositories
cybergis.list_git()
Now a 'community_contribution' job object can be created.
# Create base job object
demo_job = cybergis.create_job('community_contribution', hpc='keeling_community')
The .set()
function can accept an array of keys that can be used to set common HPC variables. Supported keys are listed below.
# Parameters that can be set
# slurm = {
# walltime?: string -> --time
# num_of_node?: number -> --nodes
# num_of_task?: number -> --ntasks
# cpu_per_task?: number -> --cpus-per-task
# memory?: string -> --mem
# memory_per_cpu?: string -> --mem-per-cpu
# memory_per_gpu?: string -> --mem-per-gpu
# gpus?: number -> --gpus
# gpus_per_node?: number | string -> --gpus-per-node
# gpus_per_socket?: number | string -> --gpus-per-socket
# gpus_per_task?: number | string -> --gpus-per-task
# partition?: string -> --partition
# mail_type?: string[] -> --mail-type (ex. "mail_type": ["END", "FAIL"])
# mail_user?: string[] -> --mail-user (ex. "mail_user": ["email@email.com"])
# }
Now job specific parameters are set for the job. The slurm
array sets HPC values. The param
array is used to set a custom variable required by the runjobs.sh
shell script. Note that in the param
array, the variable is set to the key start_value
, which is accessed in runjobs.sh as $param_start_value
.
The slurm
"num_of_task"
key value sets the number of tasks requested by the CyberGIS-Compute client when the job runs on HPC. This means that the runjobs.sh
shell script will be run twice, once per each task. In the runjobs.sh
shell script, the #SLURM_PROCID
variable, a unique id that is given to each task, is used to differentiate between the two times the run_fireabm.py
script is run.
# Set number of tasks and the starting value for the script
task_number = 2
local_start_value = 20
# Sets variables used by HPC
slurm = {
"num_of_task": task_number,
"walltime": "10:00",
}
# Sets specific parameters for the job
demo_job.set(executableFolder="git://fireabm",
param={"start_value": local_start_value}, slurm=slurm)
Now the job can be submitted.
# Submit job!
demo_job.submit()
Once the job has been submitted, the events()
and the logs()
functions can be used to follow the job progress.
# View job events
demo_job.events(liveOutput=True, refreshRateInSeconds=5)
# View job logs
demo_job.logs(liveOutput=True)
Once the job is complete, any results written to the $results_folder
can be downloaded with the downloadResultFolder()
function.
# Download results
outfile = demo_job.downloadResultFolder('./')
The results folder is downloaded as a .zip file. The following commands can be used to create a new folder to hold all the results and to unzip thd downloaded .zip file to the new folder.
# Create a folder for the results and unzip the results to the folder
!mkdir results_dir
!unzip -q -o $outfile -d results_dir
Finally, it can be useful to clean up what has been downloaded. The following lines remove the results folder and the job file.
# Run to clean up results directory
#!rm -r results_dir
# Run to clean up results zip file
#!rm $outfile
If you want to create a Custom Monte Carlo style job you will need to follow these steps: