Last month, Google’s The GameNGen AI model showed that general image diffusion techniques can be used to generate a passable, playable version of Disaster. Now, researchers are using some similar techniques in a model called MarioVGG to see if AI can generate realistic video of Super Mario Bros. in response to user inputs.
The results of MarioVGG’s model—available as a preprint paper published by crypto-adjacent AI company Virtuals Protocol—still show many glaring glitches, and it’s too slow for anything approaching real- timed gameplay. But the results show how even a limited model can infer some impressive physics and gameplay dynamics just from studying a bit of video and data input.
The researchers hope that this represents a first step towards “creating and demonstrating a reliable and controllable video game generator” or possibly even “replacing game development and game engines that use models entirely. of video development” in the future.
Viewing 737,000 Frames of Mario
To train their model, MarioVGG researchers (GitHub users Erniechew and Brian Lim are listed as contributors) started with a public dataset of Super Mario Bros. gameplay containing 280 ‘level” worth of input and image data arranged for machine-learning purposes (level 1-1 removed from training data to use images from it in analysis). The over 737,000 individual frames in that dataset were “preprocessed” into 35-frame chunks so the model could begin to predict what the instantaneous results of the various inputs would look like overall.
To “simplify the gameplay situation,” the researchers decided to focus on only two potential inputs in the dataset: “run right” and “run right and jump.” Even this limited range of motion presented some difficulties for the machine-learning system, however, because the preprocessor had to look back for several frames before jumping to know if and when the “run” began. Any jumps that involved mid-air adjustments (ie, the “left” button) also had to be discarded because “this would introduce noise into the training dataset,” the researchers wrote.
After preprocessing (and about 48 hours of training on an RTX 4090 graphics card), the researchers used a standard convolution and denoising process to generate new video frames from a static starting game image and a text input (either “run” or “jump” in this limited case). Although these generated sequences only last a few frames, the last frame of a sequence can be used as the first of a new sequence, potentially creating gameplay videos of any length that still show ” coherent and consistent gameplay,” according to the researchers.
Super Mario 0.5
Even with all this setup, MarioVGG doesn’t exactly generate silky smooth video that’s unrecognizable from a real NES game. For efficiency, the researchers reduced the output frames from the NES’ 256 × 240 resolution to a more muddy 64 × 48. They also combined 35 frames worth of video time into seven generated frame shares. “at equal intervals,” creating “gameplay” video that looks rougher than the actual game output.
Despite those limitations, the MarioVGG model still struggles to approach real-time video generation, at this point. The single RTX 4090 used by the researchers took six full seconds to generate a six-frame video sequence, which represents just over half a second of video, even at a very limited frame rate. The researchers admit that this is “impractical and friendly for interactive video games” but hope that future optimizations in weight quantization (and perhaps using more computing resources) can improve the rate. this.
However, with those limitations in mind, MarioVGG can create some believable videos of Mario running and jumping from a static starting image, similar to Google’s Genie game maker. The model was able to “learn game physics purely from video frames in the training data without any explicit hard-coded rules,” the researchers wrote. This includes inferring behaviors such as Mario falling when he runs over the edge of a cliff (with plausible gravity) and (usually) stopping Mario’s forward motion when he is next to an obstacle , the researchers wrote.
While MarioVGG focused on simulating Mario’s movements, the researchers found that the system could effectively hallucinate new obstacles for Mario as the video scrolled through an imagined level. These obstacles are “related to the game’s graphical language,” the researchers wrote, but currently cannot be influenced by user cues (eg, place a pit in front of Mario and make him jump into it). .
Just Make It Up
However, like all probabilistic AI models, MarioVGG has a frustrating habit of sometimes giving completely unhelpful results. Sometimes this means simply ignoring user input signals (“we find that input action text is not followed all the time,” the researchers wrote). Other times, this means hallucinating obvious visual glitches: Mario sometimes lands inside obstacles, runs over obstacles and enemies, flashes different colors, shrinks/grows from frame to frame, or disappearing completely for several frames before reappearing.
One particularly absurd video shared by researchers shows Mario falling off a bridge, becoming a Cheep-Cheep, then flying back across the bridges and becoming Mario again. That’s the kind of thing we’d expect to see from a Wonder Flower, not an AI video of the original Super Mario Bros.
The researchers speculate that training longer on “more diverse gameplay data” could help with these significant problems and help their model simulate more than running and jumping inevitably to the right. However, MarioVGG stands as a fun proof of concept that even limited training data and algorithms can create some decent starting models of major games.
This story originally appeared on Ars Technica.