Visum Dei
An experimental device that explores contextual privacy by using real-time facial recognition to identify and personally address users, sparking a dialogue on AI, surveillance, and consent.
Project Details
Inspiration for this project idea came from discussions around contextual privacy and ongoing controversy in the area of facial recognition. It shares many similarities to a recent Harvard project and other face recognition controversies (ex. Clearview AI). Please refer to this document to learn more about how to protect your privacy.
Overview
I created an interactive installation about contextual privacy and face recognition. The system listens for speech, takes a quick webcam photo, matches facial landmarks against a small precomputed database, and uses the matched name to generate a personalized, spoken response. The purpose is to make ambient recognition visible and ask practical questions about consent, value, and risk in everyday interactions.
Proposal
I propose Visum Dei (the sight of a god):
A focused, end-to-end demo that shows how identity can be captured and used in real time: speech detection → STT → webcam snapshot → facial-landmark match → name retrieval → prompt conditioning → TTS reply → repeat. The build runs locally (Raspberry Pi + USB webcam), uses cached embeddings from a small, precomputed face set (ex. InsightFace), and calls a lightweight local LLM via Ollama for response text. The goal is not high accuracy but a clear, observable interaction that makes recognition, and its social impact, immediately visible.
We will compare variants to study comfort and consent:
- Identity use: direct name vs. indirect reference vs. anonymous mode
- Disclosure: with vs. without upfront consent text; inline “why” explanations
- Transparency: show/hide match confidence; visible opt-out toggle
Data handling is intentionally minimal: the face set is small and local, temporary snapshots are deleted after matching, and no cloud storage is used. Participants are informed of the process, can opt out at any time, and can request deletion of their entry from the local set.
Notice:
This project has heavy similarities to a similar Harvard project. This idea is not that original and has been in many different contexts throughout the years (ex. Clearview AI). Relating to myself, I came up with my own version of this idea two years ago during undergrad and working with traffic cameras when the newest python facial landmark detection library came out but never did anything with it. So like how MIT has its slogan "demo or die" I never created a demo. The Harvard project is well recorded and has some good videos of interaction with it- I recommend you check out the following video, and support their privacy safety efforts.
Sketch
Below is the interaction loop for how the system listens, captures a frame, identifies user, personalizes reply, speaks, and repeats.
Interaction loop:
- Listens for user input, once user input has stopped being detected, uses STT to record what the user said.
- Once it has recorded what the user said, it takes a temporary picture with the webcam of the user.
- This temporary picture is used in a precomputed facial landmark database for similarity to other faces.
- Finds the most similar face (no threshold was implemented in my version), then returns the name associated with the face.
- The name associated with the face is fed into a system prompt which guides the system in responding to
- the user prompt
- details of the user's name according to the landmark database
- using TTS to respond to user with a personalized response including their name
- Repeats to Step 1 again, indefinitely.
Here is a picture of the device (with webcam attached):
Components
To create the facial landmark database, we needed to have some faces for usage. For educational and research purposes only, I created a small database of the Linkedin headshots of people in my Interactive Device Design class with Wendy Ju. This database was created by scraping the list of names from the Canvas class page, then using a Linkedin headshot Selenium scraper. The automated browser used a logged-in Linkedin session, searched for each name + "Cornell Tech", found the most relevant person, and downloaded their Linkedin headshot (with the format FIRST_LAST_LINKEDINID.jpeg).
Notice:
It is important to note that users with no Linkedin headshot (or an avatar instead of an IRL face) could not be identified as no facial landmarks could be computed for them.
I then used the Python library Insightface with the Linkedin headshot folder and created a script that matched an input image to the most likely face from the Linkedin headshot folder. This was a bit slow (around 10s) for each call, since it would calculate the embeddings for the entire folder and the embedding for the input image and compare them. I resolved this down to <2s by caching the embeddings of the Linkedin headshot folder (since we would be comparing against that folder everytime).
For privacy reasons, I have not included the source code for the Selenium scraper for Linkedin headshots. Please reach out to my email address if you would like to have it.
My system involves the pipeline described in the interaction loop above, consisting of the following files:
master.py: an infinitely running one-shot script which orchestrates the following module python scripts in the pipeline order of the interaction loop.face_server.py: a server running on port7860which uses the precomputed embeddings and Insightface to find the most similar headshot and retrieve the relevant name.love_server.py: a server running on port7861which queries Ollama with the user's input, a system prompt, and relevant name fromface_server, then outputs a response to the user's query.greet_name_piper.py: an executable one-shot script which is an upgraded version of our STT that uses Python Piper STT library, immediately outputs via speaker the audio, and is the final module to be called.
Results
Since the interaction for my device was simple, just an Ollama chatbot with the additional information of knowing the user's name, I put more emphasis on the discussion/debriefing after the interaction. I believe the importance of this project, similar to the Harvard one, is to encourage discussion around contextual privacy.
The interaction would be somewhat slow since even though the landmark facial recognition was done in under 2s on average, text generation and subsequent TTS often took 10s plus. This led to a disrepency between the user asking a question, then needing to wait around 10s for an answer, which felt somewhat awkward. I believe this could easily be fixed by offloading to a powerful PC for the text generation etc. so that the local RPI5 is only doing minimal computational tasks. Another issue was that some of the Linkedin headshots were of different points of view of a face than the current point of view of the person facing the camera, causing some misidentification. As always, it is important to mention that bias still exists in face recognition models and could be seen sometimes even in this toy example, with certain ethnicities being more prone to misidentification.
I compare interactions and describe some lessons and some future improvements of the current system below:
- WoZ interactions, and autonomous ones for that matter, can feel quite unrealistic if the pacing/timing is off. This was readily apparent in my system with the ~10s response times, and it makes it less of an interaction and more of a waiting game. The simple and easy fix to this would be to offload heavy computational tasks to more powerful PCs.
- As noted by several participants and spectators, having the system say the user's name directly is quite straightforward and puts the user on their defensive immediately, wondering "How did this robot know my name?". To this extent, better 'social engineering' experiments could be conducted. Please see the Harvard video above for a good example of this where they interact with Boston residents not by directly saying their name, but by rather saying an association or relative of that person. This lowers the defensive wall for the user and elicits a more natural reaction than one where the user is immediately put on guard. This is a bit deceitful of course, so please note that these thoughts are being expressed for educational reasons only, and I do not encourage these types of insincere interactions.
- The system sometimes did not elicit quality responses to user queries which led to confusion on the user's side. This could be solved with a higher quality language model (at the cost of speed).
- It was readily apparant that the webcam was being used for the face recognition, and participants quickly gleaned this once they heard their name being spoken. I wonder if there is a way of disguising the webcam so that the participant does not realize it is face recognition. What different interactions could occur from this? Would the paricipant believe that their name is just in the knowledge-base of the LLM, some voice recognition is being done, etc.? I think it would be interesting to explore what reasons users can come up for why a privacy breach is occurring. It might give an insight into what people believe are the most privacy-breaching technologies of the current day.
- It was raised by a few participants that they would be okay with the interaction if they had "consented" to it. They defined this consent as being given an explicit ask by the robot about opting into recognition, and then affirming that the robot could use sensor information to guess who they were.
Along with the above reflections, I believe an LLM might be able to deny the privacy breach if fine-tuned on a specific dataset. Similar to how current LLMs rely on RL of human-LLM conversations to have larger databases for training, I believe that the same sort of thing can be done with privacy-breach human-LLM conversations. Conversations where privacy has been breached by the LLM are quite different than normal conversations and would be an interesting database to create and utilize for developing models to better interact in scenarios where there is a lessened sense of privacy.
Other sensing modalities (and perhaps more important) would be voice recognition. A camera can only see in a certain range of view- and within that view several actors may exist, while sound can come from anywhere- and from any actor. Therefore, an easy issue to see would be if there were two people in front of the camera of this system and only one of them is speaking. The same notion goes for if the person interacting with the robot is out of the field of the view of the camera. There is no detection being done for is_speaking, or rotating of the camera to face audio source, so the model will try to identify whichever individual is in camera-view after the audio source has elapsed speaking and could be completely wrong about who to address. With voice recognition, the model could recognize and assign names to each voice and be able to converse personally even while blind (no camera).
Video Demo:
Discussion
As the purpose of my interaction was more of an experimental demo to encourage discussion around contextual privacy, I have recorded feedback from several users below.
| Person | Viewpoint |
|---|---|
| Thomas Knoepffler | Identity use should be earned with a clear value exchange. Instead of just saying a name, the system should explain why it's using the identity to provide a tangible benefit, like loading preferences. Tie recognition to a goal, show the payoff, and make opt-out easy. |
| Nana Takada | The creepy factor begins when a system recognizes you without prior consent. She draws a line between announced public surveillance and secret recognition in personal devices. The fear is about hidden data collection by powerful entities. Good practice starts with upfront disclosure and clear rules. |
| Miriam Alex | Recognition should only be used when necessary for the task. For many queries, identity adds no value. She prefers in-the-moment consent, like a prompt asking for permission. While less awkward, silent recognition is ethically weaker because it lacks transparency. The best approach is to minimize data by default and make the benefit of recognition obvious. |
| Anonymous | Prefers softer identity cues over direct naming, such as referencing a public detail. This feels responsive without being overly invasive. They also noted the power gap created by social engineering tactics and suggested keeping recognition ephemeral and clearly optional to bridge that gap. |
Visum Dei stages the machine’s gaze as a public act. I called it “the sight of a god” to suggest absolute certainty, yet what you encounter is a regression model that claims to know you. When the device speaks your name, it collapses the space between seeing and knowing; perception arrives as authority. That authority is fragile. Each identification is only a statistical guess wrapped in confidence. The piece uses that tension, between the performance of omniscience and the reality of approximation, to make recognition feel less like a neutral sensor reading and more like a social event with stakes.
In this setting, naming is the critical gesture. To name is to position, to fix, to exert power. The system’s match is not truth but hypothesis where misrecognitions become part of the meaning rather than defects to hide. They reveal how identity, under computation, is produced from fragments: pose, lighting, priors. The installation therefore is not asking “does it work?” so much as “what does it do to us when it works, or almost works?” Consent here is not just a checkbox but the duality of before and after, seen and acknowledged, guessed and declared.
The project points toward a broader question: do we want environments that greet us, or environments that index us? A humane answer might lie in cultivating a right to anonymity, where recognition is a reversible affordance. Visum Dei holds the mirror uncomfortably close: it shows that the machine’s knowledge is neither divine nor empty, but something in between, powerful enough to shape behavior, but brittle enough to demand doubt.