StreamTwin
A decentralized digital-twin framework that uses public webcams and browser-based edge computing to visualize real-time traffic conditions.
Project Details
I kept running into the same uncomfortable pattern with “smart city” digital twins. The concept is powerful: a time-synchronized replica of the real world that can monitor, predict, and help control complex systems like traffic. But the implementations that actually work at high fidelity tend to assume expensive infrastructure. Dense deployments of calibrated sensors. Dedicated edge boxes. Centralized compute clusters. Huge volumes of upstream video. If you are a major city with a major budget, you can justify it. If you are not, the “twin” stays a slide deck.
StreamTwin started with a smaller, more opportunistic question: what if we stop treating sensing and compute as things we have to install, and start treating them as things the world already has? There are thousands of publicly accessible webcams pointed at streets and intersections. People already watch these streams. Their laptops and phones already have CPUs, and those CPUs are sitting inside a browser tab. I wanted to see if the simple act of watching a stream could become part of a city-scale sensing pipeline, without shipping raw pixels to a central server.
That is the core trick. When you open StreamTwin, your browser loads a lightweight object detector (YOLOv5n) compiled through ONNX Runtime Web and WebAssembly. Inference runs client-side on the live stream. Raw video never leaves the client. Instead, the browser emits only structured detections over a WebSocket: timestamp, class label, bounding box coordinates, and a coarse camera identifier. The system becomes cheaper by construction, and the privacy story changes shape immediately because the server never receives frames to store, leak, subpoena, or mishandle.
Overview
StreamTwin is a decentralized digital twin built from two crowds. The first crowd is the cameras: public webcams that already exist and already cover streets at city scale. The second crowd is the compute: viewers who load the web app and become transient, heterogeneous edge nodes. The server’s job is not to run vision. The server’s job is to fuse many small, noisy, asynchronous observations into a single coherent world model, then stream that model back as an interactive visualization.
I built that fusion layer as the Aggregate Spatiotemporal Cache (ASC). It is the difference between “lots of people running a detector” and “a usable twin.”
Proposal
I propose StreamTwin as a digital twin framework that lowers the cost and technical barrier of deployment by pushing inference to the browser and transmitting only anonymized detection metadata. The system trades metric-perfect calibration for scalability and participation: it is designed to be good enough for monitoring congestion, trends, and situational awareness, while being deployable without installing new hardware and without streaming continuous video upstream.
System Architecture
The architecture is intentionally simple. Public webcams provide live streams. A viewer visits the StreamTwin web application and their browser runs YOLOv5n via ONNX plus WASM. The browser outputs bounding boxes for vehicles and pedestrians and sends only those detections to the server. The server maintains ASC, fuses detections across viewers and cameras into a world state, and stores that state in an in-memory cache. The client periodically fetches the fused world model and renders it interactively in WebGL using deck.gl, so the twin feels like a live map you can explore rather than a static report.
The system is stateless aside from the ASC cache, which makes it resilient to restarts. If a node dies, it rebuilds state quickly from recent detection logs. The more important scaling property is that the heavy compute lives at the edge: capacity grows with crowd size because each viewer runs their own inference and sends only kilobytes per second.
Aggregate Spatiotemporal Cache (ASC)
ASC is built for the messy reality of crowdsourced inference: detections arrive asynchronously, out of order, and from clients that can disappear at any moment. The cache maintains a set of hypotheses, each representing a candidate object in the world model with a ground-plane position, a velocity, and a confidence score. Incoming detections are projected onto a common ground plane using approximate homographies when camera pose is available, or a flat-earth assumption when it is not. Then association happens: each new detection can match at most one existing hypothesis, gated by spatial distance, time window, and class consistency.
When a detection matches, a Kalman filter updates the hypothesis state and confidence increases as another independent confirmation. When a hypothesis is not observed, it is propagated forward by its motion model and confidence decays. If confidence falls below a threshold, the hypothesis is removed. This gives the system a useful kind of common sense: corroborated objects stick around through brief occlusions, while uncorroborated noise tends to die out.
Experiments and Results
I evaluated StreamTwin across ten publicly accessible traffic cameras in San Francisco. To control crowd size, I emulated viewers using headless Chrome clients, ranging from 1 to 50 viewers per camera. The key question was whether the decentralized approach could approach centralized baselines without paying the video bandwidth cost.
The short answer is that it gets surprisingly close. StreamTwin achieved a scene reconstruction IoU of 0.73, compared to 0.78 for a centralized cloud baseline and 0.77 for an edge-server-per-camera baseline, while radically reducing bandwidth. End-to-end latency from capture to twin visualization measured around 90 ms for StreamTwin, compared to roughly 200 ms for centralized cloud processing and 120 ms for the edge-server baseline. Precision and recall were strong in the reported setup (0.91 precision and 0.84 recall), which matters because false alarms are exactly what make a live dashboard feel untrustworthy.
The bandwidth difference is where the architecture stops being a neat idea and becomes a different deployment category. A raw 720p stream commonly costs on the order of 5 Mbps. StreamTwin reduces per-stream bandwidth to about 20 kbps by transmitting detections instead of pixels. In cost terms, that reduction translates into more than a 20× cut in monthly operating costs in the reported analysis. It also removes the central network bottleneck that video analytics pipelines repeatedly crash into.
Client Overhead (Does this actually run in a tab?)
I also measured what it costs the viewer to participate. On a typical laptop, running YOLOv5n at about 10 FPS used roughly 20% of a single CPU core and about 150 MB of memory. On a Pixel 6 smartphone, I observed around 5 FPS under 60% CPU utilization. Those numbers are not free, but they land in the zone of “feels plausible” for opt-in participation, especially because the browser is already open for viewing and because inference rates can be throttled dynamically based on device capability.
Scalability and Robustness
One of StreamTwin’s nicer properties is that its accuracy improves with crowd size. With only 1–2 viewers per camera, objects get missed due to occlusion or detection gaps and the twin looks brittle. With 10+ viewers, redundancy starts filling those holes. In the reported experiments, IoU climbed from around 0.4 with a single viewer to above 0.7 with 50 viewers, then plateaued around 0.75–0.78 beyond that point because additional viewers mostly add confirmation rather than new information.
The system also handles churn in a way that feels honest. When I randomly dropped 30% of viewers at a time, some hypotheses lost support and disappeared, then re-instantiated when observed again. The overall IoU dipped during the dropout and recovered afterward, which matches the intuition that redundancy is doing real work here.
I also probed how ASC behaves under noisy inputs. When 10% of viewers sent random bounding boxes, the confidence mechanism largely filtered them out because false detections were rarely corroborated by honest clients. A coordinated, colluding attack is still a real threat, though. If an attacker controls enough clients to manufacture consensus, they can bias the world model. Security hardening (attestation, anomaly detection, rate limits, governance) is future work, not something I pretend is solved.
Visualization
I cared a lot about how the system feels to use, because “accuracy” is not the only thing users perceive. If the twin lags, users think it’s broken. If objects flicker, users think it’s untrustworthy. The StreamTwin dashboard makes camera coverage legible and lets you zoom into different vantage points while your browser automatically runs inference. The fused detections become a living layer you can explore, rather than a black-box metric.
Limitations
StreamTwin’s biggest limitation is the same thing that makes it scalable: it depends on participation. During low viewership hours, fidelity degrades because fewer independent observers exist to corroborate detections and bridge occlusions. Coverage gaps are also structural: areas with no public cameras cannot appear in the twin. And there is a fairness issue that is easy to miss at first: if participation clusters in certain neighborhoods, the twin becomes more accurate there, potentially reinforcing uneven attention unless incentives or deliberate coverage strategies compensate.
Ethical Considerations
I think it is important to say the quiet part out loud. A fused network of public cameras begins to resemble a city-scale surveillance system, even if the individual streams are publicly accessible. StreamTwin improves privacy by never transmitting raw video, but detection metadata can still be abused, especially if it enables trajectory-level tracking. Real deployments would need governance and scope boundaries: traffic analytics, not individual tracking. They would also need stronger privacy protection, potentially including differential privacy noise on counts or positions to provide provable guarantees.
Bias is also real. Detection performance varies by camera quality, angle, lighting, and weather, and those factors correlate with neighborhood infrastructure. A twin that is unevenly accurate can become a tool that reinforces uneven service quality. Transparency and ongoing evaluation are part of the system, not a footnote.
Future Work
The next improvements are less about “bigger models” and more about making the twin graceful when the crowd is imperfect. Predictive models could bridge sparse observation windows using historical patterns when viewership is low. Federated learning could adapt the detector to specific camera viewpoints and weather conditions while preserving privacy. On the fusion side, extending ASC toward more consistent 3D localization by incorporating approximate camera pose and map constraints could tighten geometry. And on the security side, robust outlier detection and client authentication matter if you want to deploy this beyond a research demo.
Discussion
StreamTwin is ultimately a proposal about where infrastructure can live. A digital twin does not have to be built only on dense sensor grids, dedicated edge appliances, and terabytes of upstream video. It can be built on what already exists: open streams and the everyday act of watching them. By treating public webcams as sensors and viewer browsers as ephemeral edge nodes, and by using ASC to fuse noisy detections into a coherent traffic scene, StreamTwin approaches centralized accuracy while reducing bandwidth, cost, and privacy risk by design.