Directing AI: How I Made an Animated Holiday Short

My first taste of generating art with AI was back in 2021 with Wombo Dream. I even used it to create very trippy illustrations for a series I wrote on getting a job as a product designer. To be sure, the generations were weird, if not even ugly. But it was my first test of getting an image by typing in some words. Both Stable Diffusion and Midjourney gained traction the following year and I tried both as well. The results were never great or satisfactory. Years upon years of being an art director had made me very, very picky—or put another way, I had developed taste.

I didn’t touch generative AI art again until I saw a series of photos by Lars Bastholm playing with Midjourney.

Child in yellow jacket smiling while holding a leash to a horned dragon by a park pond in autumn.

Lars Bastholm created this in Midjourney, prompting “What if, in the 1970s, they had a ‘Bring Your Monster’ festival in Central Park?”

That’s when I went back to Midjourney and started to illustrate my original essays with images generated by it, but usually augmented by me in Photoshop.

In the intervening years, generative AI art tools had developed a common set of functionality that was all very new to me: inpainting, style, chaos, seed, and more. Beyond closed systems like Midjourney and OpenAI’s DALL-E, open source models from Stable Diffusion, Flux, and now a plethora of Chinese models offer even better prompt adherence and controllability via even more opaque-sounding functionality like control nets, LoRAs, CFG, and other parameters. It’s funny to me that for a very artistic field, the associated products to enable these creations are very technical.

To create anything with these models, you run them in Python, in a command-line interface. Or you can use ComfyUI, which is a GUI system that allows you to string together building blocks called nodes to wield these models. Weavy, the product that Figma recently acquired, is like ComfyUI but much more user-friendly. And finally, you can use what’s built into ChatGPT and Gemini to prompt images into existence.

Of course you can do a lot with these simpler interfaces. Christian Haas, a former colleague of mine and now ECD at YouTube, used Google’s tools ImageFX and Flow to create a series of “hand-prompted” short films. He detailed his process in a post back in July.

I recently spent two months working on a personal project—an animated holiday short. Inspired by Pixar, enabled by AI, and motivated by my own sentimentality, the resulting video was a labor of love and an opportunity to learn a lot about how to make generative art.

First off, if you haven’t watched the short, here it is…

In this essay, I’ll take you behind the scenes of how I made the video. In shorter follow-up articles, I’ll go over some tips on getting started with ComfyUI, and how to wield generative AI art as a designer.

Start With the Story

I had the privilege of working at Pixar Animation Studios in the early 2000s. I was not an animator, nor did I work on a movie; I worked in marketing. I was heading up the formal design of Pixar.com—its first iteration was a plain HTML site done by technical people—and got to watch up-close exactly how their wonderful films were made. If I had to sum it up in one word, it’s story.

So when I got the bright (or dumb) idea to make this short, I knew I had to nail the story.

Coming from a large and extended Chinese immigrant family, family has always been one the values I hold strongest. So emotionally, I wanted the short to be a story about my immediate family coming together. My kids are older now—one’s in her third year of college and the other is about to go to university, so a holiday time gathering seemed to be a good setting. But every story needs some good drama, right? As Kurt Vonnegut likes to say, “Somebody gets into trouble—gets out of it again. People love that story.”

Which is where the idea of meal prep gone awry came from. But I didn’t want to be the hero of the story—I wanted the family to come together and solve it. Hence the other three family members each helping in their own way. I get into trouble, and they get me out of it.

Screenplay page: FADE IN, INT. ROGER'S KITCHEN – late afternoon; Roger cooking a turkey; montage scenes with Sadie, Griffin, Karen.

Once I got the outline of the story finished, I typed it up into a script. I then broke down each scene into a shot list (mostly in my head).

Finding the Look

Next up, I had to develop the look of the short. Originally, I was thinking stop-motion, similar to the Rankin/Bass animated classics like Rudolph the Red-Nosed Reindeer and Santa Claus Is Comin’ to Town. I started to experiment with that look but soon realized that to keep it consistent using open source models was going to prove too difficult. I pivoted to a more classic “Pixar” or CG-animated look instead.

I used Weavy first and originally thought I could do the whole project in the app. (More on that later.) I generated character sheets for each of us, based on photos.

Node graph showing headshot photos converted into stylized 3D character thumbnails connected to a composite on a dark background.

After this step, I soon realized a few lessons:

It’s never a one-shot deal. Getting the generation to match what’s in your head will never happen with a single prompt, without any experimentation with prompt engineering. If the model isn’t getting it right, you have change the prompt.
Randomness can have a huge range. I ended up generating 132 images to land on the design for my daughter’s character. But it only took 21 generations to capture my son. Many of the variations are subtle, but meaningful. The daughter character had to look like a young adult, not a child. Hard to get right.
Run many, choose later. Running each generation one by one is incredibly inefficient. Each image generation in Weavy can take five to 30 seconds. Especially if it’s on the higher end of that spectrum, there’s a lot of time sitting there watching the spinner spin. Running four or even 24 generations is better, but certainly costlier.

The Consistency Problem

One thing that AI-generated art has traditionally had issues with is making images with consistent characters and consistent backgrounds or sets from scene to scene. The same prompt will give you different variations of everything in an image, including the characters. The model doesn’t—by default—remember what it generated before.

For a short like this to work, I needed consistency. I found a tutorial by Mick Mumpitz on how to create consistent scenes. While on his YouTube channel, I also discovered how to train a LoRA so that I can have consistent characters.

To have consistent characters, I needed to train what’s called a LoRA. Essentially, it’s creating a fine-tune adapter to a model for a specific concept or character. To train the LoRA, I first had to have my character in a variety different poses. Mumpitz’s ComfyUI workflow comes into play here. From just the starting image of my character in an apron, it’s able to generate 20 poses and expressions. That is then fed into AI Toolkit, which is a GUI for the training run.

I also gave Mumpitz’s consistent scene workflow a go. This uses a LoRA called Next Scene and also involves creating a 360-degree equirectangular image of the set or location.

Open-plan kitchen and living area with white cabinets, stainless fridge, oven roasting chicken, red stand mixer, round table with blue chairs

Learning to Wrangle ComfyUI

This need for consistency is what drove me from Weavy to ComfyUI. There is just so much more tutorial content on YouTube for the latter. Of course, ComfyUI presents its own unique set of challenges.

First off, it’s local, meaning you run it on your machine. In case you’ve been under a rock for the last three years, you’ll know that Nvidia chips power nearly all the AI capabilities. Nvidia chips are not on the Mac. I only have a Mac. Thankfully, there’s a new category of hosting providers out there that will let anyone spin up a machine with a GPU in the cloud. CoreWeave just went public. RunPod is the service that I used and is popular among creators. These GPUs and machines are rented by the hour. As of this writing, the top-of-the-line consumer Nvidia GPU, the RTX 5090, is $0.89 per hour on RunPod. For the absolute top-of-the-line datacenter GPU, the B200, it’s $5.19 per hour.

So to train the LoRAs for my four characters—each run took three- to four hours—and to just use ComfyUI, I spun up machines on RunPod.

Setting up and using ComfyUI, for a software designer like myself, is a lesson in patience and anger management. I had mentioned in a previous post that I always liked node-based UIs. I may have changed my mind. There are elegant and user-friendly ways of implementing this—I think Weavy’s take is great. But when every node controls just a small sliver of functionality, workflows quickly become rats’ nests of tangled noodles. They kind of remind me of old school switchboards that telephone operators used. Or the red stringed cork boards of conspiracy theorists.

Node-based workflow on dark canvas with green and maroon grouped panels of image-processing nodes, preview windows, and connecting wires.

Truthfully, I never created my own workflows from scratch. I would use workflows from other creators or from the vast template library that ComfyUI provides right in the UI. But I did have to understand a few core parameters that control how diffusion models work. I’ll spare you the deep technical details for now.

I’ve bashed ComfyUI’s UI, but this tool is essential to controlling the output.

Building the Shots

Once my setup was up and running, I started to generate stills to represent each scene in the script. Each scene would have different shots, or cuts. A wide shot to establish the location, followed by a medium shot to show the action, and then sometimes a close-up. Classic filmmaking technique.

In the beginning, I used my character LoRAs and got great results. I put my character in the kitchen, my daughter’s character in a car, my son in the restaurant, and my wife on a plane. As I built out the scenes, I added shots to create interest and make the film come alive. All those years of watching movies and consuming behind-the-scenes content was finally paying off!

Storyboard grid showing a young man and family: kitchen, driving, airplane, supermarket, night house, grill, dinner.

I did think about assembling the stills together to make what Pixar calls a story reel, but since I didn’t have any dialogue and hadn’t chosen a score yet, flipping through the stills in the Finder did the trick.

More Than Prompting

I ran into issues when I needed to have multiple characters in a scene. The LoRAs did not work. For example, my prompt could not say, “Roger and Karen are sitting at the dining table.” Apparently stacking character LoRAs can’t be done, or I couldn’t figure it out. The model would get very confused, creating hideously weird images.

Two animated girls sit at a small café table with teacups, pastries and a potted plant; a third girl stands nearby holding a spoon.

So while the LoRA technique worked well for single-character scenes, I had to find another way to get the characters in the family into the same scene. I then discovered a model called Qwen Image Edit. Qwen is a family of open source models from Alibaba, the e-commerce giant in China.

Inspired by Mick Mumpitz’s workflow mentioned earlier, and just using the default Qwen Image Edit workflow found in ComfyUI, I was able to generate a still with a single character, and then take the results of that and add additional characters into the same scene. I did that with the food the characters are holding as well.

Inpainting was another technique that I had to incorporate. Users of Adobe Photoshop will know this as Generative Fill. This is when you can mask an area that you want the AI to affect but leave the other parts of the image alone. So if I love a generated image but just one thing was off, I can use inpainting to fix what I didn’t like.

I also tried a technique called control nets. This is when you use another image as a guide for the generation. For example, if there was a specific pose I wanted, I went to the internet and downloaded a photo from Google Image that would then be fed into the workflow. The subject’s pose is extracted as a stick figure which will make my character render in that pose. A depth map is another type of control net. I used this to compose the driving scene since I had a specific camera angle in mind. I found a film still from Flim—a database of film, TV, and commercial stills and videos—that worked and I used that as my depth map to guide the output.

Directing the AI

I felt very accomplished when the stills were done. I generated over 2,500 images and ended up with 50 frames for my short. These formed either the first or last frame of each shot. Many image-to-video AI models have first-to-last frame workflows, and all have at least a starting image workflow. So naturally, the next step was to plug these images into a video model to add motion.

There were two primary video models that I used: Wan 2.2 and Hunyuan Video. Wan is probably the most popular and capable open source model at this moment in time. But it has a tendency to make characters talk, despite negative prompts like “no talking, no speaking, no mouth open.” There’s no dialogue in my short, so it’s an issue. Hunyuan followed the “no talking” direction much better.

One of the first test shots using Wan 2.2 video.

For the first ten or so shots, I wrote the prompts manually, but eventually I got wise and enlisted the help of Claude to craft the prompts. I uploaded stills to Claude, described what I wanted, and it tailored the prompts according to the model I was using.

Currently, the AI can only generate between 89 and 129 frames of video, depending on the model. At 24 frames per second, that’s three to five seconds of video. Each generation took an average of 24 minutes, with the longest ones taking over an hour. My previous technique of doing a bunch of runs at once wasn’t tenable. Instead, I queued up a bunch of different shots, came back in a few hours to see what worked and what didn’t. Keeping a log of which shots were completed and in progress helped me stay organized.

For each shot, it took an average of four tries, or takes, to get to a good one. I likened this process to how an actual movie gets made—the actors play every take a little differently, and the director decides which take is the good one. After every take, if it wasn’t quite right, I’d ask Claude to tweak the prompt. Sometimes, I would have to tweak the parameters.

I said earlier that the maximum length per generation was about five seconds. Some of my shots needed more time. What I did there was split them into two generations, taking the last frame from the prior take and using that as the starting point for the next one.

Of course, this was never perfect. You can tell in a couple of spots in the final video. The rhythm of my daughter’s movement swaying to music in the car changes halfway through. And when my son’s character is wiping the table, the wiping motion changes because it’s two takes.

Finally, there was one shot that eluded me: the opening shot. I just wanted a simple locked off shot with rolling clouds and slightly rustling leaves and grass. But the models I tried couldn’t get the movement—or lack thereof—right. The clouds moved too fast, or not at all. The model interpreted “breeze” as gale-force winds. Or the garage door would open. Or someone would run out of the house.

Outtakes from the establishing shot. None of the generations worked because of random flaws.

I ended up animating the shot by hand in After Effects by cutting up the still into layers—foreground, background, sky—and moving a sky plate horizontally. I wasn’t going for AI purity in this video.

The Edit

DaVinci Resolve timeline with blue video and green audio clips, red playhead, animated character and Inspector panel

One thing to note: The default export from the video workflows is a compressed video file. The compression is very visible to my eyes, so I modified the workflows to also output PNGs in a sequence.

The final edit was done in DaVinci Resolve using the uncompressed PNGs. The first assembly was incredibly easy because I had the good takes downloaded. I simply had to put them in the right order and trim the clips a little. Resolve’s Cut page made it really, really fast.

The text messages were built in Figma and animated in After Effects. I dropped those on top of the footage in Resolve.

There were a couple glitches in the final clips that I had to fix. For example, in the shot where my character is resigned and texts the family that “Dinner is ruined,” the moon incoherently fades in and out. I had to manually mask the moon to keep it there throughout the shot.

Another more glaring error is in the scene immediately after, where my daughter is looking at her phone while driving (unsafe! I know, but hey, it’s a cartoon). Because it was one of those split shots where picking up her phone and putting it down were spliced together, the model couldn’t work out what was behind the phone through the steering wheel. So it just made it a white circle. I had to patch that in post (and poorly).

Animated woman driving at night in a UCLA sweatshirt, seatbelt on, looking down at a glowing phone on her lap (red arrow, woozy emoji)

Sound effects and the final track came from Epidemic Sound. I love their library. Unlike Christian Haas, I didn’t use AI for my audio. Maybe that’ll be the next experiment.

Reflections on What I Made

While this short film was created by just one man, not 260 talented artists like Pixar’s 2018 short “Bao,” there was a huge amount of care and effort that went into the making. It’s no AI slop.

As a creative, I had to tap into my skills and experience to choose the right 50 frames from 2,500, to direct the AI to animate what was in my imagination. While I didn’t draw, model, or render the characters in 3D, I was the director. I spelled out my vision to a team of AI models and shaped this short into what it became.

I’m incredibly proud of the work and happy with the result. It’s not perfect, nor do I pretend it to be as good as what professional animators would have been able to do. I think the characters’ movements could have been more expressive. In many parts, the film is a series of moving tableaus, stills brought to life. The middle section where my character checks the fuse box and runs to the grill is the most Pixar-like of the short. I would have also liked to compose the shots more deliberately and have the camera be more dynamic.

But deadlines force acceptance. Ideally, the short would have been finished before Christmas, but it had to be done by New Year’s Eve. And so it was.

The “world premiere” was in our living room on December 30. I played it for my family on our TV and they loved it. And ultimately, that’s why I created the short in the first place—to express the love I have for them via a piece of art.

Four people smiling for a couch selfie; front-right holds the camera, woman with glasses, black circular painting on wall.