June 21, 2004
XGrid agent for Unix architectures
The official, permanent reference for this site is: "XGrid agent for Unix architectures" available at http://www.novajo.ca/xgridagent/.
In January 2004, Apple released XGrid, a simple system for setting up and using a cluster of OS X machines. It is very simple to use compared to other grid or cluster systems and reduces the learning curve for performing cluster computation. The appeal of XGrid is that it shields the end user from the details of the cluster. What is missing to make it even more powerful is an agent for architectures other than Mac OS X (the agent in Xgrid terminology is the computer performing the computation). This is important since the computer infrastructure available to scientists is not always based on Mac OS X, and universities have a significant investment in various Unix platforms that should not be neglected when running computations in clusters. This article introduces the first working Xgrid agent for Linux and other Unix systems that can be integrated in any XGrid cluster (managed by OS X). The agent will compile and work on Linux (at least Debian and RedHat), Solaris (minimal testing) and Darwin (tested). You still need an OS X machine for the controller and for using the actual XGrid (with XGrid.app). Also, the user currently needs to "be aware" that the cluster is multi-architecture (since the XGrid controller actually does not know). Examples are provided to show you how to deal with this.
This article is separated in various sections:
- Getting the source code and compiling it
- Usage and examples
- Modifying the source code
- How it all fits together: Xgrid and its various layers
- Conclusion
There are other articles about Xgrid and POV-ray here.. For comments, here is my contact info at the Ontario Cancer Institute, (University of Toronto), Biophotonics group.
1) Getting the source code and compiling it
Necessary requirements to compile the agent:
- libxml2
- roadrunner (the BEEP library) and its required library glib-2.0 (and libxml2)
- xgridagent.c xgridagent.h xgridagent-profile.c xgridagent-profile.h xgrid.config.xml (see below for download)
I will give instructions to compile everything in your home directory tree (that is: you don't need root privileges). I haven't encountered any problems myself but let me know if you do.
Getting and compiling libxml2
If libxml2 is not installed on your system (check with xml2-config --libs), then you can install and compile it the standard way:
or with DarwinPort sudo port install libxml2.
Getting and compiling glib-2.0
or with DarwinPort: sudo port install glib2
Getting and compiling Roadrunner
roadrunner is not available with darwinport.
Getting and compiling XgridAgent
Thanks to Chris Baker for a gcc 2.95.x patch and Justin Gullingrud for the RedHat9 libxml2 patch.
You will get one warning: /home/dccote/xgridagent-rr/xgridagent.c:346: warning: the use of `tmpnam' is dangerous, better use `mkstemp'. Don't worry for now: that's the least of your problems. Don't run the agent as root: run as a regular user because there are a lot of vulnerabilities in the code. Try to run the agent with:
to connect to a controller (you can start a controller by hand in the terminal of an OS X machine with /usr/libexec/xgrid/GridServer). You will get a lengthy, verbose description of what the agent is doing. Adjust the message level in xgrid.config.xml. You must not be using XGrid passwords neither on the controller or with agents (not implemented yet, although I don't think it is hard). You can then connect to the controller using the XGrid.app application, and start testing your cluster with Linux agents (limitations, see below).
Several notes on compilation:
- If you use this for anything other than testing, you are insane.
- The configure script isn't great: it does not check for all compatibility issues and might even fail to run properly without telling you. If you type pkg-config --list-all and you don't see glib-2.0, gthreads, gobjects, libxml2 you have a problem with the installation of some of the packages and must fix that first. A few things I have noticed: you might need to define PKG_CONFIG_PATH to point to where the various configure.pc files are (in this example setenv PKG_CONFIG_PATH $HOME/lib/pkgconfig).
- Libraries are linked dynamically. Make sure that LD_LIBRARY_PATH is defined to include at least $HOME/lib (where the libraries are installed if you followed the instructions above).
2) Overview of usage
The agent will load most of its configuration parameters from xgrid.config.xml. You may modify it at will. The program will write a file called cookie to reuse the same cookie between calls. The actual tasks run in "/tmp/filexxxx/"
When you open XGrid.app, you should obtain something along the lines of:
where the cluster on the picture is making use of three Linux machines, in addition to an OS X agent running as dccote. In your case, you will highly likely have only one Linux agent.
The Shell Xgrid plug-in will simply call a shell command (regardless of where it is in the execution path on the agent). For instance, on a cluster with a single Linux agent, one obtains the following result with uname -a:
You may try the XFeed plug-in to send a range of arguments to a command, but because of the way Xgrid.app is working, the command's path must be the same on the Linux agent and on the computer from which you run Xgrid.app (if they aren't it will tell you that the command is invalid). Note the following major restriction: large outputs/files will not get sent back properly and the agent will hang (see bugs below).
Finally, you can make a Custom plug-in: that's where it becomes interesting. If you want to execute a bourne shell script, then everything is fine (they are portable across Unix platforms):
with Test.sh being:
(make sure Test.sh is executable with chmod +x Test.sh). You will get someFile1.txt copied back in the destination directory, as well as "Some text to stdout" in the stdout file.
The custom plug-in can also send a binary executable to the agent and execute it, after which it sends the results back. Since you can't know ahead of time which node of your cluster will run what, then you must provide a binary for each type of agent you have (or you must compile it each time). Assuming you know that you have both Darwin on Power PC and Linux on i686 agents, then you can do the following:
where cal and ncal are the binaries for each platform and the shell script chooseAndRun.sh is:
The script will figure out what architecture it is running on and call the appropriate binary. This is the starting point for a multi-architecture calculation: one would provide all the binaries for all the possible agents and make a script similar to the one above to carry out the calculation.
Notes on usage (also known as bugs):
- Again, if you use this for anything other than testing, you are insane.
- The xgridagent provided here isn't working well with multiple simultaneous tasks. I am not too sure why. The timing of the mutex/semaphore must be wrong. For now, 1 task per agent is recommended (and works well).
- Very important bug: if the message sent by the agent is larger than 15k, it will hang. This is a problem due to my poor understanding of BEEP. See code xgridagent-profile.c for problem description in the function xgridagent_SengMSG(). This means that with the Custom plugin, if you generate data in the working directory and it is larger than 15k (tarred and zipped), the agent will hang. Most useful cases fall in that category. Fix the code if you know how, because I don't.
- There are more buffer overflow vulnerabilities than you can count. They will get fixed. In the mean time, don't run this as root.
- If you close the server, the agent will highly likely crash.
3) Modifying the source code
If you want to modify the code, then here are a few general warnings and comments:
- The code is released under the Apache licence.
- Please send me your modifications at dccote@novajo.ca so I can include them in the main distribution. If you PGP sign your email, it will get through my spam filter for sure (my pgp key is here). Don't encrypt, just sign.
- The code is full of threads. If you don't know threads, then read a tutorial (like http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html)
- The code makes heavy use of Xpath, which is a way to refer to any part of an XML document (it looks like a directory tree but it is more than that). You can simply modify the examples throughout the code, but you can also learn about it
Here is a graphical overview of the code:
Specific comments and pitfalls:
- The thread function sem_init is not implemented on Darwin whereas the functions sem_open/sem_close is not implemented on Linux. That's why there is some juggling with the initialization of the semaphores in the code.
- Take a look at the to do list below.
- The only entry point from the BEEP library is xgridagent_ProcessBeepMSG() (called when a message is received from the controller).
- The only call to a BEEP library function from the code is xgridagent_SengMSG() (called when a message is sent by xgridagent).
- Adjust the xgrid.config.xml file to your liking for the debug level: 4 will spit out quite a bit of stuff, 0 is pretty quiet. (See comments in file and code).
- If for some reason you would prefer to use beepcore-c instead of roadrunner, it is actually quite simple to change and I have another version of the code that uses beepcore-c (I started with beepcore-c and switched in the middle of the development because I had too many problems with it). I don't recommend it.
To do:
- Fix the hang problem when messages are larger than advertised BEEP window. This should be simple, I just don't know enough about BEEP.
- Improve the autoconf, automake scripts. BTW, autoconf 2.59 produces a bogus libtool. I use autoconf 2.57.
- Much better error management needed: the use of a static buffer with LogMessage() is not even thread safe.
- Stop use of unsafe buffers: there are tons and tons vulnerabilities (buffer overflows) in the code, because I use printf and scanf in finite buffers.
- Add more flexibility in the way the tasks are started. For instance, allow the use of various site-specific commands (e.g. local job management systems) for starting tasks.
- It is assumed everything is in UTF-8 characters. That could be wrong and could lead to incorrect replies, result, etc... I am extremely cavalier with my use of (the aptly named) BAD_CAST operator from Libxml2.
- Check for the presence/absence/compatibility of the various Unix commands called (tar for instance does not always accept -z).
- Better security: jailing process in /tmp/ and running as nobody could be useful
- Clean up the code and use better terminology.
- Use passwords and SASL profiles.
- Adding support for idle mode
- Adding Rendezvous support. Start here: Apple's Rendezvous code for Linux, FreeBSD and all..
4) How it all fits together: Xgrid and its various layers
There are various layers in the Xgrid agent:
4.1) The Xgrid layer
The XGrid protocol is actually quite simple to understand, since there are only three types of messages that can be passed: a request (to which one replies) or a notification (to which there is no need to reply). Each message is identified with a CorrelationID, a name, a type (request/reply/notification) and a payload (which contains something specific to current message (identified by name)). The XGrid protocol is also the application protocol (that's what the application understands) and has nothing to do with the actual communication protocol (tcp/ip, beep, etc...). Here is a graphical overview of the cient registration process as well as the task submission process: View Registration image, View Task Submission image
4.2) The BEEP layer
Each XGrid message is sent as a BEEP MSG, and must be acknowledged when received completely by an empty RPY. MSG's can be sent in smaller chunks (frames). The implementation of BEEP that is used in this xgridagent is Roadrunner, but there is also beepcore-c (which is not as flexible).
4.3) XML
It is convenient but not necessary that both XGrid and BEEP rely on XML. Some BEEP information (in the initiation of the connection for instance) is encoded in XML. XGrid uses XML extensively, which makes it trivial to analyze.
4.4) Lightweight threads
Because two computers are talking to each other over the network, it is convenient to use threads for the BEEP library. This means that there is no "single point" in the code where one can follow the execution: it looks like several parts are running in parallel. To make sure that the various threads can talk to each other, one uses a simple locking mechanism (mutex) or a signalling system (sempahores).
Conclusion
There remain a few important bugs in the agent code, but they should be worked out quickly if others look at the code. It can be used for simple examples for now, involving agents of different architectures on the same cluster. Since the Xgrid application protocol is platform agnostic, this agent can be used to bring any Unix machine into an XGrid cluster. Since XGrid can be tunneled through SSH (see XGrid documentation), then it can be integrated in a secure research environment. The official reference for this site is: "XGrid agent for Unix architectures" available at http://www.novajo.ca/xgridagent/. Any question or comment can be sent to Daniel Côté (OCI, U of T) or to dccote@novajo.ca.
This work was done with help and encouragements from the XGrid team and Ernest Prabhakar at Apple.
January 11, 2004
Xgrid example: Parallel graphics rendering in POVray
Running something useful on Xgrid
This article is the third in a series on Xgrid, see Part I and Part II. In the present article, we look at a real life example to see how one can use Xgrid to actually get something done.
What is XGrid good for?
It has come up several times (on the Web on the various Forums and on the apple mailing list): what is Xgrid good for? Xgrid is good for programs that can be broken up is smaller pieces independent of each other. An example in science are Monte Carlo calculations, where the same (relatively simple) calculation is repeated several million times. Another would be what's called a "parameter study", where the same program is run several times with different parameters. The MandelBrot calculation map provided with Xgrid.app is another good example: the calculation at a given point is completely independent of other calculations for other points (it is a calculation of the "speed" at which the recursive application of a function diverges). An example that is most likely to be interesting to most people is graphic rendering. Each part of an image tends to be independent of other parts. Hence one can break up an image (e.g., or a scene in a movie) in smaller images and render them on several computers. This is exactly what we are going to do here, using a program called Persistence of Vision Raytracer (or POVray for short). I will try to keep the details of povray out of this article whenever possible.
Requirements
Xgrid and the command line version of Povray. Two simple programs: generate (to generate the .INI files) and combineppm (to stich together the graphics). Links to those files are available below (in context).
The task
We want to render (that is, create) a complex image using POVray. We need to have the command line version of povray installed. This can be done with darwinport with sudo port install povray. The program gets installed in the /opt/local/ tree, which is assumed in this article. POVray comes with a wide selection of scenes, we will render chess2.pov, available in the scenes/advanced/ directory. We will generate a file at a resolution of 1024 x 768. Because the rendering can take a long time (say, hours), it is advantages to split it in several subtasks and have several computers render a small piece of the image. That's how Xgrid can help.
Setting up the tasks
There is no magic with Xgrid: if you want every agent to do a slightly different task than the other agents, then you need to provide a list of slightly different arguments. What they are and how you generate them depends on the problem at hand. For POVray, we can create .INI files that are passed as arguments to the povray executable. The .INI files have everything POVray needs to do its thing. We generate those files such that each node will generate a slightly different slice of the image and save it under a specific name. (I use a trivial Perl script called generate to generate the files). I arbitrarily decided that 4 slices were enough, but typically, you would set up as many task as you have agents, as long as those tasks are not too small (there is a point where splitting is not giving you anything since Xgrid will spend most of its time copying files over the network):
This can be generated from the Perl script "generate".
Hence, at this point, we have 4 files in Povray_args (no other files since each file will get passed as an argument to the agent, you don't want anything else).

Enters Xgrid.app: Custom plug-in
Xgrid comes with a custom plug in that hides all the power in a simplistic interface. We will perform the rendering, and then I will discuss some important aspects. Start /Applications/Xgrid.app, log in your "cluster". (If you are testing this on a single computer, make sure you have both the agent and server running in System Preferences:Xgrid). We choose the Custom Plug-in option of Xgrid:

We then fill in the form with:

which pretty much says "Run the command /opt/local/bin/povray from the (equivalent of) the working directory /opt/local/share/povray-3.5/ with all the files in ~Desktop/Povray_args/ as arguments and store the result (I'll get back to that) on the desktop. We need to run from /opt/local/share/povray-3.5/ since povray needs access to all sorts of files that are stored in that directory. At this point, if you click Submit Job (and you don't have any mistakes in the argument files), everything will go through and will start processing. If you have more than 4 machines, you might want to split the job in more than 4 slices of images (see the generate script).
A note: I had Job Timeout several times, but I don't think they really were: the Tachyometer was up, and an excursion to the terminal showed that povray was actually running. Since the output files get created at the very end of the job (when all of the tasks are done), you don't see any output until the end. I never had the patience to wait and see if I got the files eventually.
Pasting the images together
When the job is done, you will have 4 files and 1 folder on your desktop: the standard output of each of the 4 runs (called something like povray_Users-dccote-Desktop-Povray_args-0.ini.txt) and povray-3.5/. povray-3.5 contains the files chess2_000n.ppm.

To stich the files together, you need to use a simple program that will take those files and produce one big final file. I found a program called combineppm that does just that (the web page I got it from also discusses POVray on a grid incidentally).
You can then open chess2.ppm in GraphicConverter (Preview.app does not open ppm files). You now have a nicely rendered graphic that looks like this:

Discussion of Xgrid
If you have followed the instructions above, you might be thinking that a few things are missing for your good understanding. This section hopes to provide some enlightment.
You don't need to have the executable nor all of the "working" directories on the agent machines to make Xgrid work: the binary and the working directory get tarred and extracted into /tmp/xgagent.XXXXX and /tmp/xgagent.YYYYY on the agents (the full directory tree get extracted). Hence, when the binary is launched from the working directory, all the files are accessible. Moreover, when done, the working directory (which was copied to the agent) is copied back to your computer (via a tar command again, I assume). Hence, at the end of the job, you have in your destination directory (~Desktop/ in this case) a copy of the working directory in the state it was at the end of the calculation, including any output files. A side effect: you must make sure that each job produces a file with a different name, because if you don't they will get overwritten.
The purpose of BEEP in all this is to provide the underlying protocol between agent, controller and clients. I don't know enough about it to say much, except that it is BEEP that makes it relatively easy to exchange more than just text without having to redefine an entirely new protocol.
Since all XGrid tasks run in user space as nobody (not in the kernel) it is safe. In addition, the communication between agents and controller is well-defined and convenient: agents contact the controller, hence only the controller needs to have its firewall adjusted (open port 4111), not the agents.
What I left unsaid about POVray
POVray is great but has a few quirks and partial rendering is one of them: the generated files are corrupted graphics files. The image size is given as the total image size, not the size of the section you just rendered. That's the reason the format PPM was used since it is easy to stich those files together with the combineppm program. If you try to open the individual files (say chess2_0001.png if you had chosen the PNG format with the option +FN) in Preview.app or Photoshop, it will fail. (God did I run out of walls to bang my head on before I figured this one out.)
What's missing from Xgrid
A way to monitor the running processes on all the machines.
A way to recover from a timedout machine.
A way to monitor the status of the current running job.
A way to monitor the submitted jobs to the grid.
Those are exactly the kind of things that are not implemented in a home made solution, and this is why Apple should do them.
Also, a better way to set arguments: you can't provide dependent ranges (like 1-10, 11-20, 21 to 30, etc... It would have been useful in the present case
Conclusion
Xgrid is a great app since it simplifies the setting up of a grid, remote access and copying of files to the agent. Contrary to what I initialy thought, it is usable out-of-the-box. (I actually ran some of my own Monte Carlo calculations using XGrid, but thought this example was better). One cannot have the agent talk to each other (like MPI would allow you to do), but this is not what grid computing is about (something I had not grasped when I first looked into Xgrid): Grid computing is about independent computations, not interdependent computations.
Lots of things could be done (and it has been discussed on the (archives, archives) Xgrid mailing list). The most important to me is agents for other architectures. An other application, for instance, is to create (I haven't tried it, but looks promising) an applescript that would contact the local machine (via a remote AppleScript call) under a username and password defined for that machine (and that user is logged at the console and the machine accepts remote applescripts) to process something using GUI applications. There certainly isn't anything that would prevent this from happening in the current implementation of Xgrid.
Feel free to contact me for comments and questions at dccote_at_novajo.ca about this article.
Resources
Xgrid mailing list (username archives, password archives), in particular Xgrid architectural overview
Macintouch Xgrid report
Apple Xgrid
BEEP
Some keywords: example, tutorial, Xgrid, Apple, cluster, parallel processing, rendering, render farm, povray, Mac OS X
January 08, 2004
Xgrid, the details
Getting acquainted with Xgrid. Here are the few things I have found so far.
The first part of this article is available here. The next part is available here.
The most important picture
This explains the architecture of Xgrid fairly well (straight out of Apple's documentation ReadmeFirst.pdf):

There are three entitities: the client, controller and agents. There is a program associated to each. The agents (which you start in the System Preferences:Xgrid) are the simplest ones to understand: they do whatever they are told to do by the controller to which they connect. The controller (which you also start in the System Preferences:Xgrid) knows about all the agents (they contacted him) and manage the calculation (but does not actually perform it). The client (that's the /Applications/Xgrid.app) contacts the controller and submits a job to the controller, which will in turn send it to the agents and collect the results they send.
External Agents
Although it is quite obvious after the fact, you can get any computer to join your cluster, regardless of where they stand on the internet (since Xgrid is TCP based, via BEEP). Let's assume you have a controller started on your machine. let's also assume your friend (who is not behind a router, including airport, and has is firewall properly set up) has installed Xgrid. He is just willing to give CPU cycles to a good cause (experimenting with Xgrid to do some nive but useless Mandelbrot calculations). In his System Preferences settings, he simply needs to say : Bind to specific host and type your IP name or address in order to join your cluster:

You should now be able to use his machine as an agent when you start Xgrid.
External client
He can also start Xgrid.app on his computer and type (for the service name), your IP address or name where you run the controller (and have gathered all the agents). He is not using any of his processing power but is using all of your cluster.
Still missing
By looking at the ShellJobViewController.m (on the Xgrid disk, Shell directory) as well as the Custom Plug-in in Xgrid.app, I am starting to get a better idea of what Xgrid actually does: it makes a list of commands (which must be locally installed [Not true for custom plug-in: added Jan 11. See next article.]) with their parameters (which it can generate, as for instance in the Custom Plug-in of Xgrid.app) as well as a standard input (which I guess is redirected when Xgrid "calls" the agent") and it collect the standard output (which can then get dumped into a file).
That's sounds reasonable but I still have a question: you can accomplish that with ssh (connect to a machine, run a command and collect the result), so why Xgrid? I suspect the answer is twofold: 1) to connect to a machine with ssh, you need an account on that machine (with shell acess) and if you do, then you can pretty much do anything you want on that machine (not so good for the owner) and 2) if you use BEEP instead of ssh, you can transfer things other than just text (from reading the documentation at http://www.beepcore.org).
So what I want to know is: can I use Xgrid to upload a given program to the agents before running it? [Note added Jan 11th: yes, see next article]. I wonder if there is some facility in Xgrid to do that, or if you need to do that manually with some king of remote copying with scp and such. Actually, I tested the remote copy and it looks somewhat complicated: the process on the agent runs as nobody and is kept in /tmp/ as can be obtained easily with the shell program of Xgrid.app:

However, I found a way to circumvent that: you could upload a file to a web server and have the Xgrid agent download it, then execute it. For instance if the program echo "program downloaded and ran" is kept in a file called testprog, one can do the following:

This is not a security issue: you are running as the user nobody and hence don't have access to much (not more than you would by running some other command that's already installed).
Troubleshooting
I have encountered a few problems with Xgrid (nothing that can't be fixed). Here they are.
No connection possible
If you are trying to connect to other machines, for now take your firewall down. If you are really paranoid (and you should), open ports above 49152 and see if it works. If you run everything on your own machine, then you should be fine and don't even need to worry about the firewall (leave it on).
Broken start/stop
I have had two problems, on two different machines, when trying to start Xgrid (agent and controller). One of them is an iBook kwith FileVault, the other is a machine with its Home directory on a server (at work).
The problem has the following symptoms: you try to start or stop the controller and you keep getting error messages after error messages and when it stops, you just haven't started the agent or controller. The error messages don't say much (talks about the volume not following permissions, which is not true).
The problem is that permissions on a few password files are incorrect or the files don't exist and don't get created (they must be root:wheel with permission 0600 as explained in the document XgridRemoteInstallation). If you start the server or agent manually, you will see a list of warning and error emssages that tell you just that.
sudo /Library/Xgrid/Scripts/agent_start
That can easily be fixed, but for now will remove the password protection. I recommend to do this:
sudo rm /Library/Xgrid/Agent/controller-password
sudo rm /Library/Xgrid/Server/agent-password
sudo rm /Library/Xgrid/Server/client-password
sudo touch /Library/Xgrid/Agent/controller-password
sudo touch /Library/Xgrid/Server/agent-password
sudo touch /Library/Xgrid/Server/client-password
sudo chown root:wheel /Library/Xgrid/Agent/controller-password
sudo chown root:wheel /Library/Xgrid/Server/agent-password
sudo chown root:wheel /Library/Xgrid/Server/client-password
sudo chmod 0600 /Library/Xgrid/Agent/controller-password
sudo chmod 0600 /Library/Xgrid/Server/agent-password
sudo chmod 0600 /Library/Xgrid/Server/client-password
Then in the two files /Library/Preferences/com.apple.xgrid.agent.plist and /Library/Preferences/com.apple.xgrid.controller.plist, change the RequireControllerPassword and RequireClientPassword settings from true to false (if they were true) with a text editor.
January 07, 2004
XGrid
Today, Apple introduced Xgrid. What is Xgrid and why should one care? This article describes my findings on Xgrid. Everything is available in the documentation or somewhere on the web, but this article presents a quick overview.
The second part of this article, entitled "Getting acquainted with Xgrid", is available here. The third part "Xgrid: Povray example" is available here.
The announcement and files
First, to get Xgrid, go to The Advanced Computation Group and download the disk image. Mount it so you can get access to the documentation files. I will refer to the Xgrid 1.0 disk image that gets mounted as "The Xgrid disk".. The press release is also available.
What is XGrid?
XGrid allows you to take a program a run it on various machines in parallel in order to get the result faster. There are three players in an Xgrid calculation: the client is the computer who wants to initiates a calculation (i.e. the one running /Applications/Xgrid.app), the controller is the actual computer who will initiate the calculation and the agents are the computers performing the calculation. A given computer can act as any of the three, even at the same time (you set both the controller and agent in System Preferences).
Xgrid is targeted towards computations that take a very long time (several hours). Typical applications that gain from this are: Monte Carlo calculations, 3D rendering, and other calculations that can be broken in several sub-tasks that don't affect each other. Apple provides a few examples, the most obvious is Mandelbrot: the calculation of the Mandelbrot map at a given point does not depend on the result at another point. Hence, one can split the whole map in sub-maps and ask the agent computers to perform their part of the calculation.
Xgrid does not perform the calculation. Actually, Xgrid does not know squat about math or science. Even worse (or better?), Xgrid does not even know you are trying to "compute" something. Xgrid provides the basic infrastructure so that one computer can talk to several others, run a command and get the result. That's it. It is based on BEEP, which is a (new?) HTTP-like protocol. You can get very good information on it here and there, but I will come back to it later. BEEP is the plumbing to do the talking.
The examples
When you follow the documentation of the Read Me First.pdf file (on the Xgrid disk), you can quickly run the Mandelbrot program or a program called shell. There is also a program that allows you to run any abitrary Unix command on all the agents.
The shell program requires particular attention because the source code is also provided. The shell program runs any command that is available on the agent's machine. The real question I have is this: for the Shell program or the Mandelbrot program, does the agent run its local copy (which it finds in /Library/Xgrid/Plug-ins/Mandelbrot.xgplug/ for instance), or does it receive a copy over the connection from the controller? I suspect it is the former, which would make everything less useful than it appears: you would need to have a local copy of your program installed on all the machine you want to run it on. Hence, if you have some scientific program you've made, you would need to find the agents and copy the files onto them and always make sure they have the proper version. That in itself would defeatd the purpose of rendezvous: you might not even know where the agents are and you highly likely don't have access to them anyway, let alone administrator access. [Note, Jan 10th: However, the custom plug-in allows one to set an arbitrary program name and a working directory which may even contain files. Upon completion, the directory is copied back to the "Destination directory". More on that in another article.]
The source code provided by Apple (the Shell program) does not give enough information to get to the guts of Xgrid: one must derive a class from XgJobViewController and override a few functions, and we don't have the code for that class. Hence, the details of the Xgrid protocol are kind of hidden, which makes me scratch my head more than I should. And this brings me to the last section.
What Xgrid is not
If you read the FAQ on the Xgrid1.0 disk image, you will find question 14 and 10:
14 What about other software clustering technologies (MPI)?• Xgrid is not a replacement for MPI. MPI is an API that enables programmers to write portable parallel applications, whereas Xgrid is a suite of applications and daemons which enables scientists to run distributed computations using a simple Mac OS X application.
• An Xgrid plug-in could be written and used as a replacement for programs such as mpirun, which coordinate the start and stop of MPI applications on a cluster of computers. However, no such plug-in is included with this release of Xgrid.
10 Can I use Xgrid with other UNIX-based computers?• The short answer is no.
• The long answer is that Xgrid uses an XML property list protocol built on top of BEEP for all of its inter-computer communication and coordination, and because these protocols are open, it is possible a client, agent, or controller could be written to run on other UNIX-based computers and interoperate with Xgrid. However, no such programs have been written.
(Bold passages by me). MPI (Message Passing Interface) is the standard for parallel computation, at least in academia. It allows you to easily split a computation in sub-tasks, execute the sub-tasks on other computers that you specifiy manually in some configuration file or on the command line. How MPI talks to the other nodes is irrelevant: it just does and one should not care. However, MPI provides facilities to collect all the results of a calculation and "sum" them, which is something that Xgrid does not provide. Xgrid provides the piping and finds the agents to perform a task, but that's it. What I don't understand is how one can take the current MPI programs (with all the convenient functions for "summing" results) and use them in Xgrid. Apple alludes to the fact that they at least thought of it (I suspect they even have some kind of solution), but I just don't understand, since MPI has its own communication scheme. What do we need here? Some kind of xGridMPI? I am not sure.
But really, what I do know for sure is this: although some of us are lucky enough to have an OS X machine on the desk at work, most people around us don't. Moreover, the real powerful machines for calculations in Universities are Unix-based and they aren't running OS X. Hence, it is critical that the protocol that Xgrid implements (what is the controller asking the agents to do and how) be made public so that Xgrid agents can be programmed for Linux, SunOS, IRIX, etc. Since BEEP has been implemented on tons of architectures (see http://www.beepcore.org/), the base plumbing is there for a brave soul to implement the Xgrid client, agent and controller on their machine of choice (and rendezvous). Mac OS X will be the best machine from which to initiate the calculation, but as long as Xgrid does not interact with other architectures, its adoption in academia will be quite limited. We don't all have 1100 G5 in our labs.
Wrap up
Xgrid looks good and removes a lot of complexity in managing parallel computations, but how one tailors it to suit ones needs is not clear to me. If it is required to recreate the functionalities of MPI, then I don't see the gain in using Xgrid (so far) considering the time investment. Moreover, how Xgrid differs from Pooch is also unclear. [Added Jan 8th: Actually Dauger has a FAQ about the difference between Xgrid and Pooch. This is it: Pooch does MPI, Xgrid does not. The discussion above is correct.]
The second part of this article is available here.
Other stories
Toxic Software
January 03, 2004
Jobs control in unix
Some simple assumptions: the syntax is for the C-shell (csh or tcsh), not bash or sh. sudo is a command that calls a program as root.
If one wants to send a job to the background in Unix, most people know that you append '&' to the command:
It is in the background, as you can see with the command:
[1] is the job number ([1], [2], [3], etc...). Don't confuse the job number with the PID 23035 (Process ID number). You can bring the job back to the foreground with fg %1 (or just fg):
% fg
But what do you do when you have the job in the foreground and you want to send it to the background (i.e. you are "stuck" because you forgot the &, or you simply changed your mind and now you want the job in the background)? Ctrl-C will kill the job, and you don't want that. If you type Control-Z, the job gets "suspended" which means, it is in not in the foreground anymore, but it is not running either. For instance if you type:
then Control-Z, the shell will respond with:
Listing the jobs will give you the following:
To make it run again, you have two options: use the bg command (which means "make it RUN in the background" or "change its status from suspended to running):
Also, and that I did not know until recently, you can use kill -CONT <pid>, where <pid> is the process id number. In this context, kill is actually a pretty bad command name: it does not "kill" the program. It sends a signal to the program (which, if you don't use -CONT(inue), will be by default -TERM(inate)). Similarly, you can send a -STOP signal to suspend a job that is already in the background.
For instance, if you are running a lengthy and CPU consuming job (or disk consuming job), like:
You can see it running in the background:
where [1] is the job number and 21035 is the pid. By issuing:
you will suspend the job, until you use the bg command with the job number, or kill -CONT with the process ID number.
