Mirror World
A "visual transformation game" that turns a Raspberry Pi into a window to a parallel dimension, re-imagining friends into fantastical avatars in real-time.
Project Details
I built a "visual transformation game" that turns a Raspberry Pi into a window to a parallel dimension. By pointing the camera at a friend, the system captures reality, hallucinates a new layer of context (turning them into a medieval knight or a cyberpunk runner), and streams it back to the display in near real-time. It plays out like a Pokemon battle or a scene from Ready Player One, where the person standing in front of you is mundane, but the person on the screen is a fantastical avatar generated on the fly.
Inspiration
The driving force behind Mirror World is the concept of the "OASIS" from Ready Player One, specifically, the disconnect between the drabness of the physical world and the vibrancy of the digital overlay. We are increasingly living in a world of augmented perception, but AR headsets are isolating; they put a screen between you and the other person.
I wanted to invert this. Instead of a headset that shuts the world out, I wanted a "magic mirror" that invites the world in but changes its rules. I was inspired by the idea of roleplaying from Dungeons and Dragons and videogame MMORPGs. If I see you as a Knight, and you see me as a Cyborg, we aren't just looking at screens; we are agreeing to inhabit a shared fiction. The project asks: can we gamify face-to-face interaction not by adding points or scores, but by fundamentally rewriting the visual texture of the other person?
Overview
Mirror World is a distributed system for shared reality distortion. It explores the concept of "VR without the headset." Instead of isolating users, we use handheld portals (RPi + Screen) to reinterpret the shared physical space. The system uses a localized VLM (Vision Language Model) to understand the scene, an LLM to "dream" a new style based on that understanding, and SDXL Turbo to render that dream onto the video feed.
The experience is symmetrical. I hold a screen that films you; you hold a screen that films me. As we move, the AI re-renders us frame-by-frame. The goal is to synthesize a concurrent theme where two players could look at each other and see a coherent, transformed world, creating a feedback loop where my physical pose influences your digital perception of me.
Proposal
I propose Mirror World:
A pipeline that offloads the heavy lifting of generative reality to a central GPU server while keeping the edge devices (RPis) lightweight. The goal was to synthesize a concurrent theme where two players could look at each other and see a coherent, transformed world.
The challenge is latency. To make a "game" feel responsive, the loop between seeing and rendering needs to be tight.
I experimented with two distinct modes of interaction:
- The Raw Stream: Direct VLM-to-Image translation. Fast, but chaotic.
- The Narrative Stream: VLM-to-LLM-to-Image. Slower, but narratively consistent (ex. ensuring the person stays a "Cyberpunk Knight" rather than morphing into a generic robot).
Here is a sketch of what I imagine the system to be like:
The Battle for Real-Time
The biggest hurdle in this project was not generating the image, but generating it fast enough to feel like a reflection. I initially attempted to run the entire pipeline on the Raspberry Pi 5. This was a failure. Generating a single image using OnnxStream on the Pi took 1-3 minutes. For a game that relies on reaction and movement, a 60-second lag isn't just a delay; it's a broken interaction.
I pivoted to a "Thin Client" model. The RPI became a dumb terminal, acting only as an eye (camera) and a canvas (screen). The "brain" was moved to a PC running an RTX 3090. Even then, we hit bottlenecks. Using a localized LLM (Ollama) on the RPI for style transfer added a 15-second delay. I realized that latency is the enemy of immersion. A lower-quality image that updates instantly feels more "real" than a high-fidelity masterpiece that lags by 10 seconds. I optimized the pipeline by compressing SDXL Turbo output and using MQTT for lightweight transport, aiming for that "Ready Player One" fluidity.
Here is the architecture for the distributed system:
Components
The architecture is a essentially a study in distributed computing and bottleneck management. I wanted to use the RPI for as much as possible, but the reality of diffusion models forced a distributed approach.
- The Edge (Raspberry Pi 5): Handles camera capture via OpenCV and displays the returned base64 stream. It runs a lightweight node script to handle the handshake with the server.
- The Brain (PC with RTX 3090):
- FastVLM: Provides scene descriptions in ~0.5s. This is the "eyes" of the system.
- SDXL Turbo: The "imagination." Generates images in ~0.3s.
- Ollama (QWEN 2.5): The "director." Takes the plain caption ("A man holding a cup") and rewrites it ("A cybernetic warlord holding a glowing orb of plasma").
- The Transport (MQTT + Cloudflare): We used Cloudflare tunnels to expose the localhost servers to the RPIs over the internet, and MQTT to blast base64 image frames back to the screens.
Performance Benchmarks:
- FastVLM Server: PC (~0.5s) vs. RPI (~30s)
- SDXL Generation: PC (~0.3s) vs. RPI (1-3 mins)
- Style Transfer: PC (0.2s) vs. RPI (~15s)
I found that MQTT was the most efficient way to blast frames to multiple devices. The PC churns out generated images, puts the base64 string onto the MQTT topic, and the RPI reads it and renders it immediately using Pillow.
I also found that generating images locally on the Pi took 1-3 minutes per frame, which defeated the purpose of a "real-time" interaction. Even offloading the LLM style transfer to the Pi added a painful 10-15 second delay. The sweet spot was using the PC as a central brain and the Pis as sensory extremities.
SDXL Turbo Pipeline Demo (PC View): This shows the backend generating ~3 frames per second before transmission.
Results
The experience works best in that first moment of realization. When a player looks at the screen and sees their friend morph into a Cyberpunk character or a medieval warrior, there is a visceral "whoa" moment. The "glitchiness" of the AI initially adds to the charm, objects spontaneously morph, chairs become thrones, cups start glowing. It feels like a fever dream that two people are sharing.
However, the technical limitations eventually shape the behavior.
- The "Ollama Freeze": When we tried stylistic re-writing on the RPI, the 10-second freeze broke the illusion. Users stopped moving and started waiting. We found that users preferred the faster, chaotic raw stream over the slow, consistent styled stream.
- The Zoom Problem: The webcams had a narrow FOV and automatic digital zoom. To get a full-body generation (which looks best), players had to stand far apart, which reduced the intimacy of the interaction.
- Hallucination Drift: The model struggled with consistency. A character might change armor styles three times in three seconds. While funny at first, it prevents deep narrative immersion.
Here is the demo for the interaction without style transformation (Raw VLM caption -> SDXL):
Here is the demo with style transformation (VLM -> Ollama Style Prompt -> SDXL):
Discussion
| Aspect | Observation |
|---|---|
| Immersion | Participants felt it was "like VR without the headset." The fact that they could see a transformed version of their partner created a shared digital space without isolating them from the physical room. The physical world anchored the digital hallucination. |
| Playfulness | Once the novelty settled, it became a game of "Trick the AI." Players raised arms, struck poses, or held objects just to see how the VLM would misinterpret them into something epic. The interaction shifted from passive viewing to active performance. |
| Latency | The server load was the ultimate boss. With one feed, we hit 3fps. With two feeds, it dropped to 1.5fps. The "Distributed" aspect works, but it scales poorly on a single GPU. As more people joined, the "world" slowed down. |
Mirror World suggests that the "Metaverse" might not be a place we go to, but a filter we apply to the place we already are. The most compelling part of the experience wasn't the fidelity of the graphics, but the re-enchantment of the mundane. When the AI hallucinates a folding chair as a throne, or a coffee cup as a glowing artifact, it reveals the latent potential of physical objects. It turns the room into a stage and the users into improvisational actors. The "glitches" in the style transfer, where reality and fantasy momentarily clash, act as a reminder that perception is malleable.
The project also highlights a critical shift in how we relate to cameras. Usually, a camera is a device for documentation, it captures what is there. In Mirror World, the camera is a device for interpretation, it captures what could be there. This shift from capture to generation changes the social contract. You aren't consenting to be photographed; you are consenting to be re-imagined. The joy participants felt in "tricking" the VLM into generating cool armor suggests a future where we curate our physical appearance specifically to influence how algorithms interpret us.
Ultimately, this is a lesson in the physics of digital magic. Magic relies on immediacy. In this context, latency is not just a performance metric; it is the threshold of belief. Above a certain delay, the image is just a video filter. Below that delay, it becomes a new layer of reality. The next phase of this research is not about better graphics, but about tightening that feedback loop until the generated world feels as responsive and inevitable as the physical one.
Special Thanks
Special thanks to Thomas Knoepffler for his work on live camera filters, which served as an inspiration for the real-time transformation aspect of this project. I also want to thank Wendy Ju for pushing me to think about the "multiplayer" aspect of this: moving it from a solo art installation to a dyadic social game.