SunGridEngine
From Odwiki
Sun Grid Engine(SGE/SGEE) is a product that has been around for some years. It has changed from a strictly commercial product to various other incarnations since being purchased by Sun. Currently SGE is available in two flavours - the Sun Grid Engine and the Sun Grid Engine Enterprise Edition. The latter is a product with full support and costs $$s, whereas the former is now open source, freely available and has it's own very useful web site, which is highly recommended. For running a render farm there is probably no need to get SGEE - SGE should be good enough for most uses. It is highly scalable, extremely flexible, and you get the powerful capabilities of a grid engine that has been polished over many years.
Contents |
So What is this Grid Computing Thing, Anyway?
Grid computing is hot stuff right now. Essentially when any given computer is sitting idle these are wasted cpu cycles. Grid Computing attempts to utilize these computers without causing undue stress to anyone needing to grab that computer again for interactive use. A well-known application of this basic technology is the SETI@home project which uses internet computers to analyse masses of data captured from radio telescopes over the years to hopefully find indication of intelligent life in the universe.
While our needs might not be quite so lofty, there is much that can be learned from this process since, like Seti@home, we need to process multiple packets of data(hip files or individual render frames) to generate a final output(an image!). Grid computing is thus well suited for render farms.
There is a minimum number of computers you will probably want to have before going to the trouble of implementing a grid engine render farm. If you have two computers, each with just one cpu, it's probably not worth the trouble unless you anticipate upgrading in the near future. However, if you're running a few machines, possibly multi-cpu, and you find yourself cursing when you arrive in the morning to find one system with 100 frames waiting to render while the other systems lay about wasting electricity - you may want to consider a grid engine. :)
What Do I Need to Run It?
Officially supported platforms that run both Houdini and SGE are currently
- Linux
- Irix
- Solaris
- Mac OS/X
However, since the code is open-sourced, this is always changing.
You will need to have a dedicated machine that will serve as the SGE administrator, although this need not be a very powerful machine - the cpu overhead is quite low. I have run as lowly as a Pentium 2 without troubles. Additionally, you will need the render machines! These can be anything from dedicated rack-mount multi-processor servers to a workstation currently being used during the day. You determine the rules of use for each machine as a scheduled process, dynamically changing, or both.
Each machine on the grid will need to be setup with client software that runs in the background. This is well behaved software with a small footprint that won't adversely affect daily use of any machine. At any time, any given machine can be taken off the grid so that an animator can work away without fears of memory or cpu impact. This can be an automated or manual process.
Do I Need to Live on Cola and Chips to Install It?
Well, to be honest, being a Sysadmin can certainly help, but as a rule you need to have someone around to configure and setup a multi-user shop anyway, so they'll probably be the person doing it. SGE requires a consistent environment that is logically setup to work properly. Hips, data, texture maps etc. need to be in locations that are accessible by the same path no matter what machine you're on. Permissions need to be right. The majority of configuration problems result from errors in having inconsistent work environments. Not all is lost, however, as there is an excellent group of mailing lists to aid people in setup and debugging problems. Despite using the freely available SGE package, you still have access to a great group of experienced users/coders there who will help you at the drop of a hat.
What are the Basic Principles Involved?
Typically you will have a hip which represents a Shot(or possibly a portion of a shot). When you are rendering you will typically want to render a layer(FG, BG, bugs, ufo's etc.) out to disk. The manual way is to sit at a workstation, create a ROP with all the parameters you wish, hit Render and go for soda. This is of course the braindead method and unless it's a quickie test, you should be ashamed of yourself! A more advanced method is to create a script for hscript which can be run, perhaps on a different system, so you can continue to work at your workstation. Even more clever is to utilize the remote hosts rendering in either of these methods to distribute your work. However, all of this is a manual process, highly error-prone, and extremely inefficient. Here's how we use SGE at Axyz:
- We have a high-level script which is what the animator calls. This command script has many parameters (see QrenHelpTeaser) to allow the user to define the shot(and thus the hip), the layers to display, resolution and lots of other things to ensure that they get what they want. This will automatically create a ROP on the fly so there's no need for the animator to be concerned about existing ROPs or any particular state they've left their hip in. This is key: no need to worry if you left your pink posey in a matte state, or if you accidentally left Generate Shadow turned on for a light. You must explictly ask for these things and more when running the script. This call is typically a line in a saved script so human error is minimized - just run it again for a new pass and the latest hip is used.
- This is submitted to SGE as an hscript job...and SGE will immediately look for an available system to run the job on. This is where SGE shines - it's smart about things and when allowed to act without overrides will run first on systems that are faster, or are being less used than others.
- Once this job has begun on a system, it begins creating IFDs or RIBs for rendering. Each of these files, as created, are in turn immediately submitted to SGE as a render task, and work continues back at the hip to create the next frame. It continues generating renders, while the previously created renders are running.
- Continue spawning until finished.
What this means is that you have an hscript task spawning renders usually at high speed, and instead of a linear process where you run frame by frame, you have a marvellous cacophony of processes spawning processes. Monitoring your render shows hscripts running, renders stacking up, and general insanity over the network as frames render simultaneously. The upshot, your renders finish in a fraction of the time without all that mucking about in scripting or worrying about whether frames 34-63 are particularly slow so you need to run them on separate systems. You simply say "I want this shot, these parameters - do it" and carry on working on your next shot(or you may alternatively go for soda if you wish). The first time you run a job like this I guarantee you it will astound and you will send me many beers(this is in fact, encouraged). We can't imagine production without it now.
What's involved in setting up SGE?
I have some low level scripts that we use here at Axyz which are freely distributable here. Also included are some basic instructions on setting up SGE and the principles behind it. If you have any difficulties, feel free to post on the mailing lists and someone will help you out - probably me among them!
SGE in it's simplest form will merely submit a user's script over the grid. The location of where the job will run is unknown at submission time - and that's the whole point. You want something done - SGE tries to find you somewhere to do it. More sophisticated abilities such as software licenses(referred to in SGE as a "resource" along with things like memory, cpu power, and many other system attributes) are fully implemented and quite flexible. There are a cornucopia of options including the ability to delay running until a certain date/time, priorities, requiring or preferring certain systems, and running multi-cpu tasks(called Parallel Environments).
It does take some time to set up SGE the way you want to work - but that works as an advantage. Apart from the costs savings, you have your farm software working the way you want it, not the way a third party does.
JohnColdrick



