CyberGIS-Compute FireABM Monte Carlo Notebook

Author: Rebecca Vandewalle rcv3@illinois.edu
Created: 8-16-21

This notebook provides an example of running a Monte Carlo style computation using CyberGIS-Compute. CyberGIS-Compute is service for running High Performance Computing (HPC) jobs from a Jupyter Notebook within CyberGISX. In this example, the FireABM simulation script is run twice, each separately using two different tasks. This small example demonstrates how to run a serial script with no in-built parallelization multiple times on CyberGIS-Compute, how to pass parameters from a notebook to CyberGIS-Compute, how to access standard HPC variables (such as node_ids) from within a CyberGIS-Compute job, and how to specify the correct working and results directories for running the job script and downloading the results. The goal of this example is to demonstrate how to use CyberGIS-Compute with no or very little adjustments to the original serial script. The custom job in this notebook uses this repository: https://github.com/cybergis/cybergis-compute-fireabm.git .

Load the CyberGIS-Compute Client

The CyberGIS-Compute client is the middleware that makes it possible to access High Performance Computing (HPC) resources from within a CyberGISX Jupyter Notebook. The first cell loads the client if it has already been installed. If not, it first installs the client and then loads it.

In [1]:
# Try to load CyberGIS-Compute client

try:
    from cybergis_compute_client import CyberGISCompute
    
# If not already setup, setup CyberGIS-Compute in the current Jupyter kernel

except:
    import sys
    !{sys.executable} -m pip install git+https://github.com/cybergis/job-supervisor-python-sdk.git@v2
    from cybergis_compute_client import CyberGISCompute

Prepare the GitHub Repository

The custom repository used in this example is https://github.com/cybergis/cybergis-compute-fireabm.git .

This repo contains the following files:

  • README.md: a readme to give information about the repo
  • manifest.json: a file that controls how the CyberGIS-Compute is run
  • runjobs.sh: a shell script that creates needed directories and runs run_fireabm.py
  • run_fireabm.py: the top level python script that runs the simulation
  • other files and directories: contain data and functions needed to run the simulation

manifest.json (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/manifest.json) is a mandatory file. It must be a JSON file named manifest.json and must contain a JSON array of key value pairs that are used by CyberGIS-Compute. In particutlar, the "name" value must be set, the "container" must be set ("cybergisx-0.4" contains the same modules as a CyberGISX notebook at the time this tutorial notebook was created), and the "execution_stage" must be set. In this case "bash ./runjobs.sh" tells CyberGIS-Compute to run the shell script runjobs.sh when the job runs.

runjobs.sh (https://github.com/cybergis/cybergis-compute-fireabm/blob/main/runjobs.sh) is a shell script that runs when a CyberGIS-Compute Job is run. This script does the following actions:

  • sets a $SEED variable value based on the $param_start_value (a value set when the job is constructed within this Notebook) and #SLURM_PROCID (the task ID, a built in variable populated when the job runs on HPC)
  • creates a directory in the $result_folder (a path set by the CyberGIS-Compute Client when the job is created)
  • on one task only: copies files to the $result_folder
  • runs the python script run_fireabm.py (the serial starting script) passing in the $SEED value and the $result_folder value
  • on one task only: after the script is run, removes data files from the $result_folder (note that for real examples, this task is better done in the post_processing_stage

Variables: This shell script uses variables and directories set in a few different places. The $SEED variable is created in runjobs.sh. The $param_start_value is a value that is passed to the CyberGIS-Compute client from a notebook. This value is set in the param array within the .set() function in the next section of this notebook. #SLURM_PROCID is a built-in variable set on the HPC (other available variables can be found here: https://slurm.schedmd.com/srun.html#lbAJ)

Directories: CyberGIS-Compute client uses two primary directories which are set when the job is created. The paths to these directories can be accessed by environment variables. Although scripts are run in the $executable_folder, results should be written to the $results_folder. These folders are not in the same location. You might need to adjust your primary script if it by default writes result files in the same folder as the script. In this example, the $results_folder variable is passed to the python script, which requires an output path to use to write results.

Execution Stages: The CyberGIS-Compute client supports three stages: "pre_processing_stage", "execution_stage", and "post_processing_stage". These are each keys in the manifest.json file which expect a command to run as a value. An example of a manifest.json file that uses all three stages can be found here: https://github.com/cybergis/cybergis-compute-hello-world/blob/main/manifest.json . Ideally the clean up tasks should be performed in the "post_processing_stage" to ensure that all tasks in the execution stage are finished before performing clean up activities.

Other files and directories in the repo: The FireABM simulation needs some small input data files and a specific input directory structure. These files and directories are included in the GitHub repo and will be copied to the $executable_folder by the CyberGIS-Compute Client.

Setup the CyberGIS-Compute Job

In the next step, a CyberGIS-Compute object and a job object is created. See this tutorial notebook for more details on the basic job creation process: https://cybergisxhub.cigi.illinois.edu/notebook/cybergis-compute-tutorial/ .

In [2]:
# Create a CyberGIS-Compute object

cybergis = CyberGISCompute(url="cgjobsup-dev.cigi.illinois.edu", 
                           port=3030, protocol='HTTP', isJupyter=True)

Since this is a custom job, the maintainer will be "community_contribution".

In [3]:
# List available maintainers

cybergis.list_maintainer()
maintainer hpc default_hpc job_pool_capacity executable_folder->from_user executable_folder->must_have
hello_world_singularity['keeling_community'] keeling_community5 False not specified
community_contribution ['keeling_community', 'bridges_community']keeling_community5 True not specified

Each custom job requires a GitHub repository to be created and specified when the job is created. After the GitHub repository is created, the CyberGISX team must be contacted to review the repository and if approved, add it to the available repositories that can be used with CyberGIS-Compute. In this case the custom repository described above can be seen in the approved repositories list.

In [4]:
# List available git repositories

cybergis.list_git()
link name container repository commit
git://spatial_access_covid-19COVID-19 spatial accessibilitypython https://github.com/cybergis/cybergis-compute-spatial-access-covid-19.git
git://hello_world hello world python https://github.com/cybergis/cybergis-compute-hello-world.git
git://fireabm hello FireABM cybergisx-0.4https://github.com/cybergis/cybergis-compute-fireabm.git
git://bridge_hello_world hello world python https://github.com/cybergis/CyberGIS-Compute-Bridges-2.git

Now a 'community_contribution' job object can be created.

In [5]:
# Create base job object

demo_job = cybergis.create_job('community_contribution', hpc='keeling_community')
📃 created constructor file [job_constructor_16291420382ZGRa.json]

Run the CyberGIS-Compute Job

The .set() function can accept an array of keys that can be used to set common HPC variables. Supported keys are listed below.

In [6]:
# Parameters that can be set

# slurm = {
#    walltime?: string -> --time
#    num_of_node?: number -> --nodes
#    num_of_task?: number -> --ntasks
#    cpu_per_task?: number -> --cpus-per-task
#    memory?: string -> --mem
#    memory_per_cpu?: string -> --mem-per-cpu
#    memory_per_gpu?: string -> --mem-per-gpu
#    gpus?: number -> --gpus
#    gpus_per_node?: number | string -> --gpus-per-node
#    gpus_per_socket?: number | string -> --gpus-per-socket
#    gpus_per_task?: number | string -> --gpus-per-task
#    partition?: string -> --partition
#    mail_type?: string[] -> --mail-type (ex. "mail_type": ["END", "FAIL"])
#    mail_user?: string[] -> --mail-user (ex. "mail_user": ["email@email.com"])
# }

Now job specific parameters are set for the job. The slurm array sets HPC values. The param array is used to set a custom variable required by the runjobs.sh shell script. Note that in the param array, the variable is set to the key start_value, which is accessed in runjobs.sh as $param_start_value.

The slurm "num_of_task" key value sets the number of tasks requested by the CyberGIS-Compute client when the job runs on HPC. This means that the runjobs.sh shell script will be run twice, once per each task. In the runjobs.sh shell script, the #SLURM_PROCID variable, a unique id that is given to each task, is used to differentiate between the two times the run_fireabm.py script is run.

In [7]:
# Set number of tasks and the starting value for the script

task_number = 2
local_start_value = 20

# Sets variables used by HPC

slurm = {
    "num_of_task": task_number,
    "walltime": "10:00",
}

# Sets specific parameters for the job

demo_job.set(executableFolder="git://fireabm", 
             param={"start_value": local_start_value}, slurm=slurm)
{'param': {'start_value': 20}, 'env': {}, 'slurm': {'num_of_task': 2, 'walltime': '10:00'}, 'executableFolder': 'git://fireabm'}

Now the job can be submitted.

In [8]:
# Submit job!

demo_job.submit()
✅ job submitted
id maintainer hpc executableFolder dataFolder resultFolder param slurm
2021-08-16T14:27:18.000Zgit://fireabmcommunity_contribution{"num_of_task": 2, "walltime": "10:00"}16291420382ZGRa keeling_community{"start_value": 20}
Out[8]:
<cybergis_compute_client.Job.Job at 0x7fc084db2450>

View and Download the CyberGIS-Compute Job Results

Once the job has been submitted, the events() and the logs() functions can be used to follow the job progress.

In [9]:
# View job events

demo_job.events(liveOutput=True, refreshRateInSeconds=5)
📮 Job ID: 16291420382ZGRa
💻 HPC: keeling_community
🤖 Maintainer: community_contribution
types message time
JOB_QUEUED job [16291420382ZGRa] is queued, waiting for registration 2021-08-16T14:27:18.000Z
JOB_REGISTERED job [16291420382ZGRa] is registered with the supervisor, waiting for initialization 2021-08-16T14:27:21.000Z
SLURM_UPLOAD uploading files 2021-08-16T14:27:27.000Z
SSH_UNZIP unzipping /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable.zip to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable2021-08-16T14:27:27.000Z
SSH_RM removing /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable.zip 2021-08-16T14:27:27.000Z
SSH_CREATE_FILE create file to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/executable/job.json 2021-08-16T14:27:27.000Z
SLURM_MKDIR_RESULTcreating result folder 2021-08-16T14:27:27.000Z
SLURM_SUBMIT submitting slurm job 2021-08-16T14:27:27.000Z
JOB_INIT job [16291420382ZGRa] is initialized, waiting for job completion 2021-08-16T14:27:27.000Z
SSH_ZIP zipping /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result to /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result.zip 2021-08-16T14:28:01.000Z
SSH_SCP_DOWNLOAD get file from /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result to /job_supervisor/data/root/1629142042S1eT 2021-08-16T14:28:01.000Z
SSH_RM removing /data/keeling/a/cigi-gisolve/scratch/dev/16291420382ZGRa/result.zip 2021-08-16T14:28:01.000Z
JOB_ENDED job [16291420382ZGRa] finished 2021-08-16T14:28:01.000Z
In [10]:
# View job logs

demo_job.logs(liveOutput=True)
📮 Job ID: 16291420382ZGRa
💻 HPC: keeling_community
🤖 Maintainer: community_contribution
message time
node id: 0, task id: 1, start number: 20, SEED: 21, result folder: /16291420382ZGRa/result /16291420382ZGRa/executable node id: 0, task id: 0, start number: 20, SEED: 20, result folder: /16291420382ZGRa/result /16291420382ZGRa/executable copying over files using FireABM_opt !! starting file parse at: 14:27:38 using FireABM_opt !! starting file parse at: 14:27:38 !! Working Directory: /16291420382ZGRa/executable !! checking input parameters !! Working Directory: /16291420382ZGRa/executab...[download for full log] 2021-08-16T14:28:01.000Z

Once the job is complete, any results written to the $results_folder can be downloaded with the downloadResultFolder() function.

In [11]:
# Download results

outfile = demo_job.downloadResultFolder('./')
file successfully downloaded under: ./1629142042S1eT.zip

The results folder is downloaded as a .zip file. The following commands can be used to create a new folder to hold all the results and to unzip thd downloaded .zip file to the new folder.

In [12]:
# Create a folder for the results and unzip the results to the folder

!mkdir results_dir
!unzip -q -o $outfile -d results_dir

Clean up

Finally, it can be useful to clean up what has been downloaded. The following lines remove the results folder and the job file.

In [14]:
# Run to clean up results directory

#!rm -r results_dir

# Run to clean up results zip file

#!rm $outfile 

Steps for Creating your own Custom job

If you want to create a Custom Monte Carlo style job you will need to follow these steps:

  1. Determine what script you want to run.
  2. Create a GitHub repository containing the script and any data needed for it to run.
  3. Create a shell script to create any needed directories and run the script based on input parameters.
  4. Create a manifest.json file containing the job information and specifying which top level script to run.
  5. Contact the CyberGIS team to submit your GitHub repository for approval.
  6. Once your GitHub repository has been approved, attempt to run your job from a notebook.
  7. Look at the job.stdout, job.stderr, and output files for any errors. If there are errors, you can make changes to the files in your GitHub repository and try to run the job again until it runs correctly.
In [ ]: