High Performance Computing

As you start to use more complicated models in your research (or a couple weeks from now in this course), you’ll eventually get tired of having to leave your computer running overnight just to realize that you forgot to include an important control variable or accidentally included a bunch of observations outside your timeframe of interest. The research process can be much less frustrating the more computing power you have to work with. While supercomputers are the gold standard here, there are still ways to squeeze more performance out of your laptop.

Parallelization

Parallelization lets you increase the speed of your code by running instructions in parallel (side by side), instead of serially (one after another). However, parallelization is not a magic bullet, and you need to use it intelligently to speed up your code. This is due to two main factors. First, parallel code is more complicated than serial code, so the speed gains you can get from your code need to outweigh the time it takes you to parallelize. Second, only certain tasks benefit from parallelization. Copy and paste the code below into R. Before you run it, try to guess which one will run faster (this won’t work on Windows due to the way that mclapply() works, unfortunately).

library(parallel)

fake_data <- as.data.frame(matrix(rbeta(1e6, 1, 2), 1e3, 1e3))

## serial execution
print(microbenchmark::microbenchmark(serial = lapply(fake_data, function(x) sqrt(x) / max(x)),
                                     parallel = mclapply(fake_data, function(x) sqrt(x) / max(x),
                                                         mc.cores = parallel::detectCores())))

## Unit: milliseconds
##      expr min lq mean median  uq max neval
##    serial  15 21   27     25  30  65   100
##  parallel  54 62   89     70 108 482   100

Sending instructions to multiple cores takes time. If the length of time it takes to carry out an instruction is sufficiently short, the extra step of sending the instructions to mulitple cores can actually make parallel code slower than serial code. This means that parallelization is suited to code where each instruction takes a long time to execute.

One way to execute code in parallel is using mclapply() in the parallel package like we did above. This is a good approach if the operations you need to carry out can be accomplished with simple anonymous functions, or even preexisting functions.

If you need to carry out more complicated operations, the foreach() function in the foreach package can be a good option. It also gives you more control over exactly how your code is parallelized. For Windows users, this can be your only option, as mclapply() does not work due to the way it parallelizes code. The foreach() function can use parallel backends from multiple different pacakges, but we’re going to use doParallel. Use the registerDoParallel() command on the makeCluster() command, with detectCores() as the number of cores argument

library(doParallel)
registerDoParallel(makeCluster(parallel::detectCores()))

The syntax for parallelized loops is slightly different from regular loops. Be sure to include the %dopar% after the arguments, otherwise the code will just be executed serially (…like a regular loop).

for (i in 1:10) print(i)

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

foreach(i = 1:10) %dopar% print(i)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 10

We also retrieve the results of a foreach() loop differently. Normally we create an object outside a loop, and then fill it in by indexing using our iteration variable e.g. x[i]. If we want to save the results of a foreach loop to an object, we need to the entire loop to that object. Note that this means any object we create in the loop won’t be coming out of that environment; only operations called at the end of the loop will be included in the final object.

x <- foreach(i = 1:1e4) %dopar% rnorm(1, 0, 1)

We can also use foreach() to create 2 dimensional objects like matrices and arrays. If we do not change the .combine argument in the function call the results are returned in a list like above, except each element is now a vector of numerics. Setting the .combine argument to rbind() instead will yield a matrix.

x <- foreach(i = 1:15) %dopar% rbeta(5, 1, 1)
class(x)

## [1] "list"

x <- foreach(i = 1:15, .combine = rbind) %dopar% rbeta(5, 1, 1)
class(x)

## [1] "matrix"

If we want to make a data frame instead of a matrix, just use the data.frame() command around the elements. Remember not to assign this command!

x <- foreach(i = 1:15, .combine = rbind) %dopar% {
  
  chr <- sample(letters[1:5], 1)
  num <- rnorm(1, 2, 5)
  data.frame(letter = chr, number = num)
  
}
x

If your data are exchangeable, meaning that they don’t have labels and the order does not convey any meaningful information, then you can set the .inorder argument to false to improve performance. If you’re carrying out operations on real data you should leave this set to true, but if you are, for example, generating bootstrap measures of uncertainty, then you can set it to false. If it’s set to false, then a worker which finishes its tasks can immediately write its result to output and begin a new one, instead of having to wait for the other workers to write their results. For example, a worker that finishes iteration 4 can’t start on iteration 8 until the workers on iterations 5, 6, and 7 finish. Compare the speed of two parallel loops, with .inorder = T and .inorder = F.

print(microbenchmark::microbenchmark(ordered = foreach(i = 1:1e3, .combine = rbind,
                                                       .inorder = T) %dopar% rbeta(5, 1, 1),
                                     unordered = foreach(i = 1:1e3, .combine = rbind,
                                                         .inorder = F) %dopar% rbeta(5, 1, 1)))

## Unit: milliseconds
##       expr min  lq mean median  uq max neval
##    ordered 478 496  507    503 509 582   100
##  unordered 480 498  509    506 513 593   100

The real benefit of parallelization lies in repeating multiple tasks that each take a lot of time to carry out, such as reading in data or saving it to disk storage. Your laptop probably only has four cores, maybe eight if you’re lucky. That will get you a speedup over serial execution, but we can do way better.

Some basic *nix

Longleaf is UNC’s new cluster. A standard Longleaf node has 24 cores and 256GB of RAM, with access to functionally unlimited storage space. In addition to all of this computing power, Longleaf has another major advantage: when code is running on the cluster, it’s not tying up your computer.

Longleaf runs Linux, which isn’t too different from the software underlying Mac OS. If you’re on a Mac, you can do everything you need through the Terminal. If you’re on Windows, you’ll need an SSH client like PuTTY and an FTP client like PSFTP. If you’re using Linux, you don’t need me to tell you what to do.

cd # change directory
pwd # print working directory
ls # list files
mv # move
cp # copy
mkdir # make directory
rm # remove
rm -r # remove recursively
man # manual

Navigating directories works just like in R: cd is equivalent to setwd() and pwd is equivalent to getwd(). Linux uses forward slashes, so even if you’re on Windows, when you’re on Longleaf, you’ll use / in filepaths. However, you have to follow any spaces in a file path with a backslash e.g. cd 787/Labs/High\ Performance\ Computing. Anytime you want to modify(move, copy, delete) a directory, you need to use the -r option with your command to tell the shell to execute it recursively, otherwise you’ll get a warning. The man command is equivalent to ? in R and is incredibly helpful for figuring out options and syntax for other commands. For example, I frequently use ls -lht which shows:

l long format with file sizes and date of modification
h human readable filesizes e.g. 22kb or 5.7mb
t sort by modification time

A very important Linux shortuct is ~, which is your home directory. On a Mac this will be /Users/yourusernamehere. On Longleaf, it is /nas/longleaf/home/onyen. Just like RStudio, the shell has tab completion, so you’ll rarely have to type out an entire filepath.

Logging on

To access Longleaf for the first time, open up your shell of choice and type

ssh <onyen>@longleaf.unc.edu

After a very stern warning about unathorized access, you’ll get a password prompt that asks you for you UNC password. Enter it and you will connect to the cluster. You can type ls to see what’s in your home directory, but there’s not much to do at the moment. Create a directory for this lab to prevent things from getting cluttered.

Right now we’re on the cluster, but we can’t actually run any R code yet. Longleaf uses a ‘module’ system to manage scientific software, and you need to load the appropriate module before you can use any software. Type module spider r to see all the different versions of R available. We’ll want to use the latest one, so type module load r/3.5.0. Now you can type R in the shell to start R from the command line. Type quit() to exit. To save time having to load R every single time you log on, type module save, and Longleaf will automatically load all the modules you currently have loaded each time you log on. Close your ssh session with the exit command, and it’s time to write some R code!

Your first job

Code is generally not run interactively on Longleaf. Instead, you write a script that will do everything it needs to completely unattented. Once you’ve done this, you’ll submit it to Longleaf’s scheduler, which will run it as a job. Create an R script that calculates the mean mpg by cyl in the mtcars data and saves the result to a .csv file and creates a plot from the data as well. Don’t forget to save the plot! End the script with quit(save = 'no'). This tells R that you’re finished and it’s time to kill the R process. If you set save = 'yes', then R will save the session to a .Rhistory file so that it can be loaded again. This is useful if you’re debugging code and want to be able to start R and figure out exactly where your code failed.

setwd('~/Dropbox/UNC/TA/787\ Fall\ 2018/Labs/Cluster')
library(tidyverse)
cyls <- mtcars %>% group_by(cyl) %>% summarize(mpg = mean(mpg))
write.csv(cyls, 'cyls.csv')
pdf('plot.pdf', width = 8, height = 8)
ggplot(mtcars, aes(x = hp, y = mpg, color = as.factor(gear))) +
  geom_smooth(method = 'lm', inherit.aes = F, aes(x = hp, y = mpg)) +
  geom_point() +
  facet_wrap(~ cyl, scales = 'free_x')
dev.off()
quit(save = 'no')

Save this script as cluster script.R. Now we need to actually run this script on the cluster. Longleaf uses a job scheduler called SLURM, and the job submission command is sbatch. You could submit a job by typing everythng out in the shell each time, but this is a pain if you’re debugging code and have to submit several jobs in short succession (it’s not an accident this is the second time I’ve mentioned debugging…). To submit a job to SLURM, you can create a job submission script. This is a simple shell script that tells the cluster what modules to load, what resources your code needs, and how long your code is allowed to run. Open a text editor (RStudio works just fine for this).

Your script needs to begin with the following: #!/bin/sh. The #! is called a ‘hashbang’ or ‘shebang’ and tells the cluster which interpreter to use for the script. In this case, /bin/sh tells it to use the Bourne shell. Note that unlike in R, # isn’t always a comment character since the hashbang passes information to the system, but they usually are. Following that, your script will have multiple appearances of #SBATCH followed by a different option each time. The key ones are:

--job-name which lets you identify your job amongst all the others running on the cluster
--ntasks which tells the scheduler how many separate processes your code will spawn
--cpus-per-task which tells the scheduler how many cores your code will need per process
--mem which tells the scheduler how much memory to allocate for your code
--time which tells the scheduler how long to run your code before killing it. This is super important because the more time you request, the longer you’ll have to wait for your code to start. Acceptable time formats are “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds.” The default on Longleaf is one hour.

Finally, you need to tell your job submission script to actually submit your job! You do this with normal shell commands.

module load r/3.5.0 makes sure that you’re running the correct version of R
Rscript --vanilla cluster\ job.R executes your R script

Note! There’s a bug in R 3.5.0 where any R script with a space in the name will fail if you don’t specify at least one option to Rscript, which is why I have --vanilla; don’t worry about what that does. Putting it all together, our job submission script looks like:

#!/bin/sh

#SBATCH --job-name=cl_test

# one task for threaded parallelization
#SBATCH --ntasks=1

# Number of CPU cores per task
#SBATCH --cpus-per-task=1

# memory request
#SBATCH --mem=1gb

# three day time limit
#SBATCH --time=0:30:0

# load the appropriate R module
module load r/3.5.0

# Use Rscript to run cost surface creation
Rscript --vanilla cluster\ job.R

Save the file as cluster job.sl with the .sl extension so you know it’s a SLURM script. Now we need to actually get our files on the cluster! We do this using sftp. Connect to Longleaf again, but use sftp this time instead of ssh. NOTE: Do not start sftp after starting ssh, because then the local and remote will be Longleaf since you ran sftp from Longleaf. Either exit ssh or start sftp in a separate shell window.

sftp <onyen>@longleaf.unc.edu

SFTP has a set of ‘local’ commands that you’ll need to use because you’re navigating files on both the cluster and your system. Use ls to switch to the folder you created for this lab. Use lpwd and lls to figure out where you are on your machine, and then switch to the directory where you saved your R script and job submission script. Use the put command to put both of these files on the cluster.

put cluster\ job.R
put cluster\ job.sl

Open a new terminal window and SSH into Longleaf. Navigate to the lab directory. We’re finally ready to run our script using the sbatch command.

sbatch cluster\ job.sl

Type squeue and you can see your job running on the cluster…along with every single other job! To solve this issue, type squeue -u <onyen> instead. The -u option tells the command to only return jobs for the specified user, you. To save yourself the hassle of having to specify this every single time, type alias squeue='squeue -u <onyen>' which will alias the squeue command to automatically specify your username.

Once your job is done, there will be a log file in the directory called slurm-JOBIDNUMBER.out. We’ll want to take a look at it to make sure everything went as planned. We could download it and then open it up on our computer, but that’s a waste of time if everything went smoothly. Longleaf has a lightweight text editor called Nano that we can use to check our log.

nano slurm-JOBIDNUMBER.out

Scroll through to make sure everything looks good, exit Nano by pressing CTRL+x. Now let’s get our results off of the cluster so we can work with them on our computer. Switch back to your SFTP window and get the .csv and .pdf files from the cluster.

get cyls.csv
get plot.pdf

Going parallel

To submit a parallel job, you just have to make one small tweak to your job submission script. Set --cpus-per-task equal to the number of cores you want to use e.g. --cpus-per-task=10. Your R script similarly requires just one major adjustment. When registering your parallel backend, you can’t use parallel::detect_cores() to set the number of cores you’ll use. This is because code running on Longleaf runs on individual nodes, and this function can sometimes detect cores on the node above and beyond the number you’ve requested from the scheduler. Instead, specify the number of cores you want directly e.g. registerDoParallel(cores = 10). That’s it!

A note on debugging

Notice how we haven’t been directly using R this whole time? That’s because everything we do in the shell is actually running on a login node. Login nodes do not have 24 cores and hundreds of gigabytes of memory. When you submit a job SLURM sends it to a compute node, which is designed to handle high performance tasks. The login nodes are not, so don’t run your code on them to debug. Instead, use the srun command to run R as a job on a compute node. Be sure to specify --vanilla, otherwise R won’t load.

srun R --vanilla

You can even debug parallel code this way since srun supports most of the standard sbatch arguments. Type man srun if you’re unsure whether a specific argument is supported.

srun --cpus-per-task=4 --mem=2gb R --vanilla

A word of warning…

You can have your cluster privileges suspended if you do something that messes up other people’s jobs. Luckily, clusters are very hard to break these days. Even if you write an infinite loop, it will be confined to your cores on a compute node and then terminated when the time limit is up. Harmful? Yes. Catastrophic? No. Obviously don’t do anything malicious on purpose, but don’t be afraid, either.

Individual Exercise

Submit the log file generated by running your script on Longleaf.