Running TLC in Distributed Mode

Contents
  How Distributed TLC Works
  Running the Master from the Toolbox
  Running the Workers
  Limitations of Distributed Mode 
  Getting and Giving Help

How Distributed TLC Works

TLC consists of a collection of worker threads that do the actual model checking--that is, they compute the graph of reachable states and check invariants and other safety properties.  There is also a master thread that coordinates the worker threads.  In ordinary (non-distributed) mode, these threads all run on the same Java Virtual Machine (JVM) on the same computer as the Toolbox.  In distributed mode, each thread runs under its own JVM, usually on different computers.  The master thread runs on the same computer as the Toolbox.  (In network terminology, the master is a server and the workers are clients.)  The master and the workers are started separately, in any order.

Running the Master from the Toolbox

TLC is run on a model.  To specify that it is to be run in distributed mode, check the Run in distributed mode box in the How to run? section of the Model Overview Page.

Unless you're just trying it out, you're running TLC in distributed mode because your model is quite large.  For any large model, it's a good idea to increase the Maximum JVM heap size in MB value to give TLC as much of the computer's memory as is not needed by the operating system and other programs that will be running at the same time.  (Note that the Number of worker threads parameter in the How to run section applies only to running TLC in ordinary, non-distributed mode.) For distributed mode, you may also want to add special arguments to the command that launches the JVM that runs the master.  As explained below, one such argument may be needed for the master and the workers to communicate.  The arguments are put in the Additional JVM arguments field.

You start a TLC run in distributed mode as usual by clicking on the button, by selecting Run model on the TLC Model Checker menu, or by typing F11.  This should cause the Current status field of the Model Checking Results Page to change to Master is waiting for (remote) workers and then to indicate that one or more workers have registered with the master.

Running the Workers

The following are the basic steps that must be accomplished to start the workers.  The details of how these steps are performed will vary according to the operating system and network configuration that you are using.  For help getting the workers started on your system, go to (web page).

To run the workers, you must have a network of computers that can communicate with one another and with the machine running the master and the Toolbox.  We assume that there is some some remote management/administration system installed on the computers that run the workers.  Examples of such a system are Remote Desktop (Windows), ssh, rsh, and telnet.  A Java Runtime Environment (JRE), version 1.5 or later, must also be installed on the macines. 

There are two basic ways to run a worker thread on a computer that we will call the worker computer.  In the following instructions, master-computer is the name of the computer running the Toolbox and the master.

Method 1
Start the Toolbox on the master computer.  Then execute the following command in a shell on the worker computer:
   wget http://master-computer:10996/files/tla2tools.jar
Then execute this command on that shell:
  java -cp tla2tools.jar tlc2.tool.distributed.TLCWorker master-computer
The wget command downloads the file tla2tools.jar into the current directory on the worker machine, the java command actually starts the worker.  The wget command therefore just has to be executed the first time you run a worker, and then whenever you install a new version of the Toolbox.  The wget command is not part of Windows, but can be installed on Windows as part of Cygwin.
Method 2
Start the Toolbox on the master computer, and then run a Web browser on the worker computer.  In that web browser, enter the url
    http://master-computer:10996
This will display a Web page containing a Launch button labeled Launch worker from browser.  Clicking on that will download the file tla2tools.jar and will execute it, starting the worker.

A Possible Problem

The master-computer name you use in the instructions above for running the workers is usually a name like jones-home-laptop or tla.msr-inria.inria.fr.  However, some computers have multiple network interfaces, with different IP addresses--for example, 192.168.1.10 and 10.10.0.1.  If the master computer is such a computer, then it's possible that using this name as the master-computer name in the instructions above will cause the worker not to be able to communicate with the master because they are using different IP addresses.  To solve this problem, choose one of those IP addresses--say 192.168.1.10. Use that address as the master-computer name in the instructions above for running the workers, and put the following argument in the Additional JVM arguments field of the Model Overview's How to run? section:
   -Djava.rmi.server.hostname=192.168.1.10

Using Multi-Core Worker Machines

Currently, workers are single threaded.  You can take advantage of multi-core computers by running multiple workers on the same computer, using one of the previous methods to start each of the workers.  The major disadvantage of doing this is that a separate copy of the worker code is loaded into memory for each worker.

Limitations of Distributed Mode

It does not run in depth-first mode or Simulation mode, even if one of those modes is selected on the Advanced Options Page.  Coverage information is not provided.  You should therefore check a smaller model in ordinary non-distributed mode to make sure that the specification specifies what you think it does. 

Getting and Giving Help

You may want to run TLC in distributed mode with many workers.  (Very preliminary tests suggest that linear speedup is possible even with hundreds of workers.)  Manually starting so many workers is not practical.  It is possible to implement scripts that can automate the installation of the JRE and one of the two methods described above for copying the tla2tools.jar file and starting the workers.  How this is done will depend on the operating system and network configuration.  If you need help, try going to the Forum on Distributed TLC on the tlaplus.net page.  If you have successfully run TLC in distributed mode, please use that Forum to tell others how you did it on your system.
↑ Model Overview Page