CyberGIS-Compute FireABM Monte Carlo Notebook¶

Author: Rebecca Vandewalle rcv3@illinois.edu
Created: 8-16-21

This notebook provides an example of running a Monte Carlo style computation using CyberGIS-Compute. CyberGIS-Compute is service for running High Performance Computing (HPC) jobs from a Jupyter Notebook within CyberGISX. In this example, the FireABM simulation script is run twice, each separately using two different tasks. This small example demonstrates how to run a serial script with no in-built parallelization multiple times on CyberGIS-Compute, how to pass parameters from a notebook to CyberGIS-Compute, how to access standard HPC variables (such as node_ids) from within a CyberGIS-Compute job, and how to specify the correct working and results directories for running the job script and downloading the results. The goal of this example is to demonstrate how to use CyberGIS-Compute with no or very little adjustments to the original serial script. The custom job in this notebook uses this repository: https://github.com/cybergis/cybergis-compute-fireabm.git .

Contents¶

Load the CyberGIS-Compute Client
Prepare the GitHub Repository
Setup the CyberGIS-Compute Job
Run the CyberGIS-Compute Job
View and Download the CyberGIS-Compute Job Results
Clean up
Steps for Creating your own Custom job

Load the CyberGIS-Compute Client¶

The CyberGIS-Compute client is the middleware that makes it possible to access High Performance Computing (HPC) resources from within a CyberGISX Jupyter Notebook. The first cell loads the client if it has already been installed. If not, it first installs the client and then loads it.

# Try to load CyberGIS-Compute client

try:
    from cybergis_compute_client import CyberGISCompute
    
# If not already setup, setup CyberGIS-Compute in the current Jupyter kernel

except:
    import sys
    !{sys.executable} -m pip install git+https://github.com/cybergis/job-supervisor-python-sdk.git@v2
    from cybergis_compute_client import CyberGISCompute

Prepare the GitHub Repository¶

The custom repository used in this example is https://github.com/cybergis/cybergis-compute-fireabm.git .

This repo contains the following files:

README.md: a readme to give information about the repo
manifest.json: a file that controls how the CyberGIS-Compute is run
runjobs.sh: a shell script that creates needed directories and runs run_fireabm.py
run_fireabm.py: the top level python script that runs the simulation
other files and directories: contain data and functions needed to run the simulation

manifest.json (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/manifest.json) is a mandatory file. It must be a JSON file named manifest.json and must contain a JSON array of key value pairs that are used by CyberGIS-Compute. In particutlar, the "name" value must be set, the "container" must be set ("cybergisx-0.4" contains the same modules as a CyberGISX notebook at the time this tutorial notebook was created), and the "execution_stage" must be set. In this case "bash ./runjobs.sh" tells CyberGIS-Compute to run the shell script runjobs.sh when the job runs.

runjobs.sh (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/runjobs.sh) is a shell script that runs when a CyberGIS-Compute Job is run. This script does the following actions:

sets a $SEED variable value based on the $param_start_value (a value set when the job is constructed within this Notebook) and #SLURM_PROCID (the task ID, a built in variable populated when the job runs on HPC)
creates a directory in the $result_folder (a path set by the CyberGIS-Compute Client when the job is created)
on one task only: copies files to the $result_folder
runs the python script run_fireabm.py (the serial starting script) passing in the $SEED value and the $result_folder value
on one task only: after the script is run, removes data files from the $result_folder (note that for real examples, this task is better done in the post_processing_stage

Variables: This shell script uses variables and directories set in a few different places. The $SEED variable is created in runjobs.sh. The $param_start_value is a value that is passed to the CyberGIS-Compute client from a notebook. This value is set in the param array within the .set() function in the next section of this notebook. #SLURM_PROCID is a built-in variable set on the HPC (other available variables can be found here: https://slurm.schedmd.com/srun.html#lbAJ)

Directories: CyberGIS-Compute client uses two primary directories which are set when the job is created. The paths to these directories can be accessed by environment variables. Although scripts are run in the $executable_folder, results should be written to the $results_folder. These folders are not in the same location. You might need to adjust your primary script if it by default writes result files in the same folder as the script. In this example, the $results_folder variable is passed to the python script, which requires an output path to use to write results.

Execution Stages: The CyberGIS-Compute client supports three stages: "pre_processing_stage", "execution_stage", and "post_processing_stage". These are each keys in the manifest.json file which expect a command to run as a value. An example of a manifest.json file that uses all three stages can be found here: https://github.com/cybergis/cybergis-compute-hello-world/blob/main/manifest.json . Ideally the clean up tasks should be performed in the "post_processing_stage" to ensure that all tasks in the execution stage are finished before performing clean up activities.

Other files and directories in the repo: The FireABM simulation needs some small input data files and a specific input directory structure. These files and directories are included in the GitHub repo and will be copied to the $executable_folder by the CyberGIS-Compute Client.

Setup the CyberGIS-Compute Job¶

In the next step, a CyberGIS-Compute object and a job object is created. See this tutorial notebook for more details on the basic job creation process: https://cybergisxhub.cigi.illinois.edu/notebook/cybergis-compute-tutorial/ .

# Create a CyberGIS-Compute object

cybergis = CyberGISCompute(url="cgjobsup-dev.cigi.illinois.edu", 
                           port=3030, protocol='HTTP', isJupyter=True)

Since this is a custom job, the maintainer will be "community_contribution".

# List available maintainers

cybergis.list_maintainer()

Each custom job requires a GitHub repository to be created and specified when the job is created. After the GitHub repository is created, the CyberGISX team must be contacted to review the repository and if approved, add it to the available repositories that can be used with CyberGIS-Compute. In this case the custom repository described above can be seen in the approved repositories list.

# List available git repositories

cybergis.list_git()

Now a 'community_contribution' job object can be created.

# Create base job object

demo_job = cybergis.create_job('community_contribution', hpc='keeling_community')

📃 created constructor file [job_constructor_16291420382ZGRa.json]

Run the CyberGIS-Compute Job¶

The .set() function can accept an array of keys that can be used to set common HPC variables. Supported keys are listed below.

# Parameters that can be set

# slurm = {
#    walltime?: string -> --time
#    num_of_node?: number -> --nodes
#    num_of_task?: number -> --ntasks
#    cpu_per_task?: number -> --cpus-per-task
#    memory?: string -> --mem
#    memory_per_cpu?: string -> --mem-per-cpu
#    memory_per_gpu?: string -> --mem-per-gpu
#    gpus?: number -> --gpus
#    gpus_per_node?: number | string -> --gpus-per-node
#    gpus_per_socket?: number | string -> --gpus-per-socket
#    gpus_per_task?: number | string -> --gpus-per-task
#    partition?: string -> --partition
#    mail_type?: string[] -> --mail-type (ex. "mail_type": ["END", "FAIL"])
#    mail_user?: string[] -> --mail-user (ex. "mail_user": ["email@email.com"])
# }

Now job specific parameters are set for the job. The slurm array sets HPC values. The param array is used to set a custom variable required by the runjobs.sh shell script. Note that in the param array, the variable is set to the key start_value, which is accessed in runjobs.sh as $param_start_value.

The slurm "num_of_task" key value sets the number of tasks requested by the CyberGIS-Compute client when the job runs on HPC. This means that the runjobs.sh shell script will be run twice, once per each task. In the runjobs.sh shell script, the #SLURM_PROCID variable, a unique id that is given to each task, is used to differentiate between the two times the run_fireabm.py script is run.

# Set number of tasks and the starting value for the script

task_number = 2
local_start_value = 20

# Sets variables used by HPC

slurm = {
    "num_of_task": task_number,
    "walltime": "10:00",
}

# Sets specific parameters for the job

demo_job.set(executableFolder="git://fireabm", 
             param={"start_value": local_start_value}, slurm=slurm)

{'param': {'start_value': 20}, 'env': {}, 'slurm': {'num_of_task': 2, 'walltime': '10:00'}, 'executableFolder': 'git://fireabm'}

Now the job can be submitted.

# Submit job!

demo_job.submit()

✅ job submitted

<cybergis_compute_client.Job.Job at 0x7fc084db2450>

View and Download the CyberGIS-Compute Job Results¶

Once the job has been submitted, the events() and the logs() functions can be used to follow the job progress.

# View job events

demo_job.events(liveOutput=True, refreshRateInSeconds=5)

📮 Job ID: 16291420382ZGRa
💻 HPC: keeling_community
🤖 Maintainer: community_contribution

# View job logs

demo_job.logs(liveOutput=True)

📮 Job ID: 16291420382ZGRa
💻 HPC: keeling_community
🤖 Maintainer: community_contribution

Once the job is complete, any results written to the $results_folder can be downloaded with the downloadResultFolder() function.

# Download results

outfile = demo_job.downloadResultFolder('./')

file successfully downloaded under: ./1629142042S1eT.zip

The results folder is downloaded as a .zip file. The following commands can be used to create a new folder to hold all the results and to unzip thd downloaded .zip file to the new folder.

# Create a folder for the results and unzip the results to the folder

!mkdir results_dir
!unzip -q -o $outfile -d results_dir

Clean up¶

Finally, it can be useful to clean up what has been downloaded. The following lines remove the results folder and the job file.

# Run to clean up results directory

#!rm -r results_dir

# Run to clean up results zip file

#!rm $outfile

Steps for Creating your own Custom job¶

If you want to create a Custom Monte Carlo style job you will need to follow these steps:

Determine what script you want to run.
Create a GitHub repository containing the script and any data needed for it to run.
Create a shell script to create any needed directories and run the script based on input parameters.
Create a manifest.json file containing the job information and specifying which top level script to run.
Contact the CyberGIS team to submit your GitHub repository for approval.
Once your GitHub repository has been approved, attempt to run your job from a notebook.
Look at the job.stdout, job.stderr, and output files for any errors. If there are errors, you can make changes to the files in your GitHub repository and try to run the job again until it runs correctly.

maintainer	hpc	default_hpc	job_pool_capacity	executable_folder->from_user	executable_folder->must_have
hello_world_singularity	['keeling_community']	keeling_community	5	False	not specified
community_contribution	['keeling_community', 'bridges_community']	keeling_community	5	True	not specified

link	name	container	repository
git://spatial_access_covid-19	COVID-19 spatial accessibility	python	https://github.com/cybergis/cybergis-compute-spatial-access-covid-19.git
git://hello_world	hello world	python	https://github.com/cybergis/cybergis-compute-hello-world.git
git://fireabm	hello FireABM	cybergisx-0.4	https://github.com/cybergis/cybergis-compute-fireabm.git
git://bridge_hello_world	hello world	python	https://github.com/cybergis/CyberGIS-Compute-Bridges-2.git

types	message	time
JOB_QUEUED	job [16291420382ZGRa] is queued, waiting for registration	2021-08-16T14:27:18.000Z
JOB_REGISTERED	job [16291420382ZGRa] is registered with the supervisor, waiting for initialization	2021-08-16T14:27:21.000Z
SLURM_UPLOAD	uploading files	2021-08-16T14:27:27.000Z
SSH_UNZIP	unzipping /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable.zip to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable	2021-08-16T14:27:27.000Z
SSH_RM	removing /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable.zip	2021-08-16T14:27:27.000Z
SSH_CREATE_FILE	create file to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable/job.json	2021-08-16T14:27:27.000Z
SLURM_MKDIR_RESULT	creating result folder	2021-08-16T14:27:27.000Z
SLURM_SUBMIT	submitting slurm job	2021-08-16T14:27:27.000Z
JOB_INIT	job [16291420382ZGRa] is initialized, waiting for job completion	2021-08-16T14:27:27.000Z
SSH_ZIP	zipping /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result.zip	2021-08-16T14:28:01.000Z
SSH_SCP_DOWNLOAD	get file from /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result to /job_supervisor/data/root/1629142042S1eT	2021-08-16T14:28:01.000Z
SSH_RM	removing /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result.zip	2021-08-16T14:28:01.000Z
JOB_ENDED	job [16291420382ZGRa] finished	2021-08-16T14:28:01.000Z