Checkpointing Overview

Note

The simulations and visualizations in this tutorial were generated using Blender 2.67 and CellBlender 1.0 RC. It may or may not work with other versions.

Checkpointing allows you to stop a simulation at a specified iteration and resume it at some later point. This can be beneficial for several different reasons:

  • You are using any sort of multi-user system that requires you to share time with others
  • The computer you are using crashes or is shutdown unexpectedly
  • There are parameters you want to change partway through a simulation

We'll cover how to set up checkpointing in the next several sections, starting with a simple case where we modify a couple parameters.

Creating the MDL

Eventually, it will be possible to use checkpointing directly within CellBlender. Until that time, you can still do it by manually hand-editing some files.

Inside of /home/user/mcell_tutorial, type the following command:

mkdir -p change_dc/change_dc_files/mcell

Note

CellBlender expects a certain directory structure for visualization. That is why we are creating these sub-directories in a very specific way.

Then within that directory (i.e. /home/user/mcell_tutorial/change_dc/change_dc_files/mcell), create a file called change_dc1.mdl. Add the following text to that file:

CHECKPOINT_INFILE = "dc_chkpt"
CHECKPOINT_OUTFILE = "dc_chkpt"
CHECKPOINT_ITERATIONS = 100
ITERATIONS = 200
TIME_STEP = 1E-6

DEFINE_MOLECULES
{
  vol1 {DIFFUSION_CONSTANT_3D = 1E-7}
}

INSTANTIATE World OBJECT
{
  vol1_rel RELEASE_SITE
  {
    SHAPE = SPHERICAL
    LOCATION = [0,0,0]
    SITE_DIAMETER = 0.0
    MOLECULE = vol1
    NUMBER_TO_RELEASE = 100
  }
}

sprintf(seed,"%05g",SEED)

VIZ_OUTPUT
{
  MODE = CELLBLENDER
  FILENAME = "./viz_data/seed_" & seed & "/Scene"
  MOLECULES
  {
    NAME_LIST {vol1}
    ITERATION_NUMBERS {ALL_DATA @ ALL_ITERATIONS}
  }
}

There are three new commands in this file (which have been highlighted): CHECKPOINT_INFILE, CHECKPOINT_OUTFILE, and CHECKPOINT_ITERATIONS. As we mentioned earlier, checkpointing allows you to stop a simulation and resume it later. This is accomplished by means of a checkpoint file that is written (CHECKPOINT_OUTFILE) when the simulation is temporarily stopped and later read (CHECKPOINT_INFILE) when the simulation is resumed. The value assigned to these two commands is the name of the file that is written or read. In this case, they both have the same name, although that is not required. CHECKPOINT_ITERATIONS indicates at what iteration the simulation is temporarily stopped and the checkpoint file is created.

Now make a copy of change_dc1.mdl called change_dc2.mdl by entering the command:

cp change_dc1.mdl change_dc2.mdl

Then change the diffusion constant from 1E-7 to 1E-5 in the second mdl. Once again, save and quit.

Running the Simulation

Now run the first mdl by entering the command:

mcell change_dc1.mdl

When it is finished running, enter the command:

ls

Notice that a file called dc_chkpt was created. This file stores the information needed to recommence running the simulation. Let's finish it now by entering the command:

mcell change_dc2.mdl

Visualizing the Results

Start Blender. Save your blend file with the name change_dc.blend in /home/user/mcell_tutorial/change_dc. Be careful to name it correctly, as the directory structure we set up earlier depends upon it. Normally, this is all handled automatically by CellBlender, but we must be careful when hand-editing files. Delete the default Cube now (select and hit x), since it's not actually a part of our simulation. Hit Read Viz Data under the Visualize Simulation Results panel. Hit Alt-a to begin playing back the animation. You will notice that the molecules start off moving rather slowly, and then speed up halfway through the simulation, coinciding with the change in diffusion constant.

This is just a simple example of one parameter you can change. Here is a partial list of some other parameters that you could change:

  • TIME_STEP
  • reaction rates
  • SURFACE_CLASS properties (ABSORPTIVE, TRANSPARENT, REFLECTIVE)

Time Based Checkpointing

Instead of checkpointing at a specific iteration, you can alternatively create a checkpoint at a set time. To do this, replace CHECKPOINT_ITERATIONS with CHECKPOINT_REALTIME. The value assigned to this is a series of numbers separated by colons. The units and formatting are illustrated below:

  • days:hours:minutes:seconds
  • hours:minutes:seconds
  • minutes:seconds
  • seconds

For example, if you set CHECKPOINT_REALTIME = 1:30, then the simulation would create a checkpoint after running for 1 minute and 30 seconds. Or if you set CHECKPOINT_ITERATIONS = 2:6:3:40, then the simulation would create a checkpoint after running for 2 days, 6 hours, 3 minutes, and 40 seconds.

If you want the simulation to automatically continue running after writing a checkpoint file, you have to put the keyword NOEXIT at the end of the CHECKPOINT_REALTIME command, like this: CHECKPOINT_REALTIME = 1:30 NOEXIT.

You will know that a checkpoint file has been created, because MCell will report something like this while it is running:

MCell: time = 1098, writing to checkpoint file chkpt (periodic).

Checkpointing with SIGUSR1 and SIGUSR2

Sometimes, you need to end a simulation right now, but a lot of time can be wasted if you haven't checkpointed recently. To deal with this problem, pass the SIGUSR1 or SIGUSR2 flags to the kill command along with MCell's PID. If you use SIGUSR1, MCell will create a checkpoint and continue running. If you use SIGUSR2, MCell will create a checkpoint and end the simulation. You can use the top or ps commands to find MCell's PID. For example, if your MCell executable is called mcell, then type the following command while MCell is running:

ps -e | grep mcell

This will output something similar to this:

7984 pts/4    00:00:10 mcell

The first number listed, 7984, is the PID. Next, enter the following command (using your own PID in place of 7984):

kill -SIGUSR1 7984

This creates a checkpoint and keeps the simulation running. However, to create a checkpoint and kill the simulation, you would enter the following command:

kill -SIGUSR2 7984

You will know that these worked if MCell reports something like this:

MCell: time = 1282, writing to checkpoint file chkpt (user signal detected).