Life on the Render Farm

cloud3What happens when you move from a world of scarcity to a world of plenty? How does your thinking change? What becomes possible? At Cybera, we're curious about this, and in an attempt to find out, we've started projects to give groups close-to-free access to close-to-unlimited network and computing resources.

A previous post described a project we worked on with a group of independent filmmakers in Edmonton to build something new that had previously been impossible€” a cloud-based rendering tool for large animation files.

What'€™s in it for us?

Why in the world would Cybera want to get involved with independent film? Well, it turns out there are several reasons:

  1. We know that a general-purpose Infrastructure-as-a-Service cloud performs perfectly well with general-purpose applications running on it. Things like web services, or off-site backup servers, are able to share the cloud resources quite happily. But other resource-intensive applications €Šlike scientific simulations or data analytics €” are more demanding, and their performance suffers if they have to share compute resources with other applications. Graphics rendering is a very CPU-intensive operation, and building a rendering farm in our cloud allowed us to study the effects of very demanding applications on the other cloud tenants, and on the cloud itself.

  2. We spend a lot of time trying to foster a new "€˜digital economy"€™ in Alberta. We like to imagine how the economy of rural Alberta could be improved if everyone had access to the kinds of resources we use. Imagine Vulcan, Alberta, rendering the next Peter Jackson film. Imagine vast quantities of environmental data being analysed in High Prairie. This project is a good example of what ordinary citizens could achieve if cyberinfrastructure was as affordable and available as any other utility, such as water and electricity.

  3. As I mentioned earlier, we'€™re curious to see what becomes possible when constraints are removed. What will people come up with when they can have unlimited network bandwidth? Or have all the computing power they want, when they want it? In this case, a made-in-Alberta film was created that would not have been feasible otherwise.

  4. It was fun.

A simple approach

We took a simple approach to building our rendering farm. We wanted something that wouldn'€™t take long to build, and could be done by almost anyone with modest technical skills. A small python application installed on a laptop was given the job of taking the scene to be rendered, slicing it up into ranges of frames, and handing each range to a different rendering server. The rendering servers would take each frame in the range and render it as a .png image. When the entire range was complete, the set of images would be uploaded into our object storage system. The images could then be retrieved for assembling into the final animation.

The problem

The problem with this simple approach is that the rendering job is not complete until the final slowest server finishes the slowest frame. We discovered that some frames rendered very quickly sometimes just a few minutes, while others took hours. The result was that some rendering servers zipped through their chunk of the work in no time at all, and then sat idle while another server took ages to grind through its work. We developed techniques to feed idle servers new chunks of work to keep them busy, but it was a tedious and manual chore.

A better way

A better way to build the rendering farm would have been to invert the relationship between the controller and the rendering servers. Instead of the controller saying "here, work on this€™," it would be better to have the renderers say "€˜I'€™ve finished rendering that frame, give me another one to work on." This arrangement is known as a "€˜distributed task queue", a popular solution for problems like this. It would have been (a little) more complicated to build, but would have allowed the renderers to keep themselves busy 100% of the time. Much more efficient, and easier to manage.

What did we learn?

In the same way that airlines sometimes overbook a flight, and count on the statistical likelihood that not every passenger will show up, a physical host server can offer more resources to guest virtual machines than it actually has, counting on the fact that the guests won'€™t all need their resources at the same time. For most cases, this is a pretty safe bet. But in the case of a resource-intensive task  such as video rendering €” a guest virtual machine can seize so many resources that the other guest VMs are starved. This is known as 'the noisy neighbour' problem. For this project, we had to use a couple of tricks to make sure our renderers did not overwhelm the physical hosts and were evenly distributed across our cloud. It was exciting for us to be able to push our cloud to the limit, and it gave us a much better understanding of how to manage similar tasks in cloud environments.