This article is the third in a series on Xgrid, see Part I and Part II. In the present article, we look at a real life example to see how one can use Xgrid to actually get something done.
It has come up several times (on the Web on the various Forums and on the apple mailing list): what is Xgrid good for? Xgrid is good for programs that can be broken up is smaller pieces independent of each other. An example in science are Monte Carlo calculations, where the same (relatively simple) calculation is repeated several million times. Another would be what's called a "parameter study", where the same program is run several times with different parameters. The MandelBrot calculation map provided with Xgrid.app is another good example: the calculation at a given point is completely independent of other calculations for other points (it is a calculation of the "speed" at which the recursive application of a function diverges). An example that is most likely to be interesting to most people is graphic rendering. Each part of an image tends to be independent of other parts. Hence one can break up an image (e.g., or a scene in a movie) in smaller images and render them on several computers. This is exactly what we are going to do here, using a program called Persistence of Vision Raytracer (or POVray for short). I will try to keep the details of povray out of this article whenever possible.
Xgrid and the command line version of Povray. Two simple programs: generate (to generate the .INI files) and combineppm (to stich together the graphics). Links to those files are available below (in context).
We want to render (that is, create) a complex image using POVray. We need to have the command line version of povray installed. This can be done with darwinport with sudo port install povray. The program gets installed in the /opt/local/ tree, which is assumed in this article. POVray comes with a wide selection of scenes, we will render chess2.pov, available in the scenes/advanced/ directory. We will generate a file at a resolution of 1024 x 768. Because the rendering can take a long time (say, hours), it is advantages to split it in several subtasks and have several computers render a small piece of the image. That's how Xgrid can help.
There is no magic with Xgrid: if you want every agent to do a slightly different task than the other agents, then you need to provide a list of slightly different arguments. What they are and how you generate them depends on the problem at hand. For POVray, we can create .INI files that are passed as arguments to the povray executable. The .INI files have everything POVray needs to do its thing. We generate those files such that each node will generate a slightly different slice of the image and save it under a specific name. (I use a trivial Perl script called generate to generate the files). I arbitrarily decided that 4 slices were enough, but typically, you would set up as many task as you have agents, as long as those tasks are not too small (there is a point where splitting is not giving you anything since Xgrid will spend most of its time copying files over the network):
This can be generated from the Perl script "generate".
Hence, at this point, we have 4 files in Povray_args (no other files since each file will get passed as an argument to the agent, you don't want anything else).


We then fill in the form with:

which pretty much says "Run the command /opt/local/bin/povray from the (equivalent of) the working directory /opt/local/share/povray-3.5/ with all the files in ~Desktop/Povray_args/ as arguments and store the result (I'll get back to that) on the desktop. We need to run from /opt/local/share/povray-3.5/ since povray needs access to all sorts of files that are stored in that directory. At this point, if you click Submit Job (and you don't have any mistakes in the argument files), everything will go through and will start processing. If you have more than 4 machines, you might want to split the job in more than 4 slices of images (see the generate script).
A note: I had Job Timeout several times, but I don't think they really were: the Tachyometer was up, and an excursion to the terminal showed that povray was actually running. Since the output files get created at the very end of the job (when all of the tasks are done), you don't see any output until the end. I never had the patience to wait and see if I got the files eventually.

To stich the files together, you need to use a simple program that will take those files and produce one big final file. I found a program called combineppm that does just that (the web page I got it from also discusses POVray on a grid incidentally).
You can then open chess2.ppm in GraphicConverter (Preview.app does not open ppm files). You now have a nicely rendered graphic that looks like this:

You don't need to have the executable nor all of the "working" directories on the agent machines to make Xgrid work: the binary and the working directory get tarred and extracted into /tmp/xgagent.XXXXX and /tmp/xgagent.YYYYY on the agents (the full directory tree get extracted). Hence, when the binary is launched from the working directory, all the files are accessible. Moreover, when done, the working directory (which was copied to the agent) is copied back to your computer (via a tar command again, I assume). Hence, at the end of the job, you have in your destination directory (~Desktop/ in this case) a copy of the working directory in the state it was at the end of the calculation, including any output files. A side effect: you must make sure that each job produces a file with a different name, because if you don't they will get overwritten.
The purpose of BEEP in all this is to provide the underlying protocol between agent, controller and clients. I don't know enough about it to say much, except that it is BEEP that makes it relatively easy to exchange more than just text without having to redefine an entirely new protocol.
Since all XGrid tasks run in user space as nobody (not in the kernel) it is safe. In addition, the communication between agents and controller is well-defined and convenient: agents contact the controller, hence only the controller needs to have its firewall adjusted (open port 4111), not the agents.
Those are exactly the kind of things that are not implemented in a home made solution, and this is why Apple should do them.
Also, a better way to set arguments: you can't provide dependent ranges (like 1-10, 11-20, 21 to 30, etc... It would have been useful in the present case
Lots of things could be done (and it has been discussed on the (archives, archives) Xgrid mailing list). The most important to me is agents for other architectures. An other application, for instance, is to create (I haven't tried it, but looks promising) an applescript that would contact the local machine (via a remote AppleScript call) under a username and password defined for that machine (and that user is logged at the console and the machine accepts remote applescripts) to process something using GUI applications. There certainly isn't anything that would prevent this from happening in the current implementation of Xgrid.
Feel free to contact me for comments and questions at dccote_at_novajo.ca about this article.
Some keywords: example, tutorial, Xgrid, Apple, cluster, parallel processing, rendering, render farm, povray, Mac OS X
Getting acquainted with Xgrid. Here are the few things I have found so far.
The first part of this article is available here. The next part is available here.


You should now be able to use his machine as an agent when you start Xgrid.
That's sounds reasonable but I still have a question: you can accomplish that with ssh (connect to a machine, run a command and collect the result), so why Xgrid? I suspect the answer is twofold: 1) to connect to a machine with ssh, you need an account on that machine (with shell acess) and if you do, then you can pretty much do anything you want on that machine (not so good for the owner) and 2) if you use BEEP instead of ssh, you can transfer things other than just text (from reading the documentation at http://www.beepcore.org).
So what I want to know is: can I use Xgrid to upload a given program to the agents before running it? [Note added Jan 11th: yes, see next article]. I wonder if there is some facility in Xgrid to do that, or if you need to do that manually with some king of remote copying with scp and such. Actually, I tested the remote copy and it looks somewhat complicated: the process on the agent runs as nobody and is kept in /tmp/ as can be obtained easily with the shell program of Xgrid.app:

However, I found a way to circumvent that: you could upload a file to a web server and have the Xgrid agent download it, then execute it. For instance if the program echo "program downloaded and ran" is kept in a file called testprog, one can do the following:

This is not a security issue: you are running as the user nobody and hence don't have access to much (not more than you would by running some other command that's already installed).
The problem has the following symptoms: you try to start or stop the controller and you keep getting error messages after error messages and when it stops, you just haven't started the agent or controller. The error messages don't say much (talks about the volume not following permissions, which is not true).
The problem is that permissions on a few password files are incorrect or the files don't exist and don't get created (they must be root:wheel with permission 0600 as explained in the document XgridRemoteInstallation). If you start the server or agent manually, you will see a list of warning and error emssages that tell you just that.
That can easily be fixed, but for now will remove the password protection. I recommend to do this:
sudo touch /Library/Xgrid/Agent/controller-password
sudo touch /Library/Xgrid/Server/agent-password
sudo touch /Library/Xgrid/Server/client-password
sudo chown root:wheel /Library/Xgrid/Agent/controller-password
sudo chown root:wheel /Library/Xgrid/Server/agent-password
sudo chown root:wheel /Library/Xgrid/Server/client-password
sudo chmod 0600 /Library/Xgrid/Agent/controller-password
sudo chmod 0600 /Library/Xgrid/Server/agent-password
sudo chmod 0600 /Library/Xgrid/Server/client-password
Then in the two files /Library/Preferences/com.apple.xgrid.agent.plist and /Library/Preferences/com.apple.xgrid.controller.plist, change the RequireControllerPassword and RequireClientPassword settings from true to false (if they were true) with a text editor.
Today, Apple introduced Xgrid. What is Xgrid and why should one care? This article describes my findings on Xgrid. Everything is available in the documentation or somewhere on the web, but this article presents a quick overview.
The second part of this article, entitled "Getting acquainted with Xgrid", is available here. The third part "Xgrid: Povray example" is available here.
Xgrid is targeted towards computations that take a very long time (several hours). Typical applications that gain from this are: Monte Carlo calculations, 3D rendering, and other calculations that can be broken in several sub-tasks that don't affect each other. Apple provides a few examples, the most obvious is Mandelbrot: the calculation of the Mandelbrot map at a given point does not depend on the result at another point. Hence, one can split the whole map in sub-maps and ask the agent computers to perform their part of the calculation.
Xgrid does not perform the calculation. Actually, Xgrid does not know squat about math or science. Even worse (or better?), Xgrid does not even know you are trying to "compute" something. Xgrid provides the basic infrastructure so that one computer can talk to several others, run a command and get the result. That's it. It is based on BEEP, which is a (new?) HTTP-like protocol. You can get very good information on it here and there, but I will come back to it later. BEEP is the plumbing to do the talking.
The shell program requires particular attention because the source code is also provided. The shell program runs any command that is available on the agent's machine. The real question I have is this: for the Shell program or the Mandelbrot program, does the agent run its local copy (which it finds in /Library/Xgrid/Plug-ins/Mandelbrot.xgplug/ for instance), or does it receive a copy over the connection from the controller? I suspect it is the former, which would make everything less useful than it appears: you would need to have a local copy of your program installed on all the machine you want to run it on. Hence, if you have some scientific program you've made, you would need to find the agents and copy the files onto them and always make sure they have the proper version. That in itself would defeatd the purpose of rendezvous: you might not even know where the agents are and you highly likely don't have access to them anyway, let alone administrator access. [Note, Jan 10th: However, the custom plug-in allows one to set an arbitrary program name and a working directory which may even contain files. Upon completion, the directory is copied back to the "Destination directory". More on that in another article.]
The source code provided by Apple (the Shell program) does not give enough information to get to the guts of Xgrid: one must derive a class from XgJobViewController and override a few functions, and we don't have the code for that class. Hence, the details of the Xgrid protocol are kind of hidden, which makes me scratch my head more than I should. And this brings me to the last section.
14 What about other software clustering technologies (MPI)?• Xgrid is not a replacement for MPI. MPI is an API that enables programmers to write portable parallel applications, whereas Xgrid is a suite of applications and daemons which enables scientists to run distributed computations using a simple Mac OS X application.
• An Xgrid plug-in could be written and used as a replacement for programs such as mpirun, which coordinate the start and stop of MPI applications on a cluster of computers. However, no such plug-in is included with this release of Xgrid.
10 Can I use Xgrid with other UNIX-based computers?• The short answer is no.
• The long answer is that Xgrid uses an XML property list protocol built on top of BEEP for all of its inter-computer communication and coordination, and because these protocols are open, it is possible a client, agent, or controller could be written to run on other UNIX-based computers and interoperate with Xgrid. However, no such programs have been written.
(Bold passages by me). MPI (Message Passing Interface) is the standard for parallel computation, at least in academia. It allows you to easily split a computation in sub-tasks, execute the sub-tasks on other computers that you specifiy manually in some configuration file or on the command line. How MPI talks to the other nodes is irrelevant: it just does and one should not care. However, MPI provides facilities to collect all the results of a calculation and "sum" them, which is something that Xgrid does not provide. Xgrid provides the piping and finds the agents to perform a task, but that's it. What I don't understand is how one can take the current MPI programs (with all the convenient functions for "summing" results) and use them in Xgrid. Apple alludes to the fact that they at least thought of it (I suspect they even have some kind of solution), but I just don't understand, since MPI has its own communication scheme. What do we need here? Some kind of xGridMPI? I am not sure.
But really, what I do know for sure is this: although some of us are lucky enough to have an OS X machine on the desk at work, most people around us don't. Moreover, the real powerful machines for calculations in Universities are Unix-based and they aren't running OS X. Hence, it is critical that the protocol that Xgrid implements (what is the controller asking the agents to do and how) be made public so that Xgrid agents can be programmed for Linux, SunOS, IRIX, etc. Since BEEP has been implemented on tons of architectures (see http://www.beepcore.org/), the base plumbing is there for a brave soul to implement the Xgrid client, agent and controller on their machine of choice (and rendezvous). Mac OS X will be the best machine from which to initiate the calculation, but as long as Xgrid does not interact with other architectures, its adoption in academia will be quite limited. We don't all have 1100 G5 in our labs.
Xgrid looks good and removes a lot of complexity in managing parallel computations, but how one tailors it to suit ones needs is not clear to me. If it is required to recreate the functionalities of MPI, then I don't see the gain in using Xgrid (so far) considering the time investment. Moreover, how Xgrid differs from Pooch is also unclear. [Added Jan 8th: Actually Dauger has a FAQ about the difference between Xgrid and Pooch. This is it: Pooch does MPI, Xgrid does not. The discussion above is correct.]
The second part of this article is available here.
Some simple assumptions: the syntax is for the C-shell (csh or tcsh), not bash or sh. sudo is a command that calls a program as root.
If one wants to send a job to the background in Unix, most people know that you append '&' to the command:
It is in the background, as you can see with the command:
[1] is the job number ([1], [2], [3], etc...). Don't confuse the job number with the PID 23035 (Process ID number). You can bring the job back to the foreground with fg %1 (or just fg):
But what do you do when you have the job in the foreground and you want to send it to the background (i.e. you are "stuck" because you forgot the &, or you simply changed your mind and now you want the job in the background)? Ctrl-C will kill the job, and you don't want that. If you type Control-Z, the job gets "suspended" which means, it is in not in the foreground anymore, but it is not running either. For instance if you type:
then Control-Z, the shell will respond with:
Listing the jobs will give you the following:
To make it run again, you have two options: use the bg command (which means "make it RUN in the background" or "change its status from suspended to running):
Also, and that I did not know until recently, you can use kill -CONT <pid>, where <pid> is the process id number. In this context, kill is actually a pretty bad command name: it does not "kill" the program. It sends a signal to the program (which, if you don't use -CONT(inue), will be by default -TERM(inate)). Similarly, you can send a -STOP signal to suspend a job that is already in the background.
For instance, if you are running a lengthy and CPU consuming job (or disk consuming job), like:
You can see it running in the background:
where [1] is the job number and 21035 is the pid. By issuing:
you will suspend the job, until you use the bg command with the job number, or kill -CONT with the process ID number.