11 min read

The hard part of Swift on Jetson was not Swift

Table of Contents

In March 2026 I was invited to a virtual Swift meetup to talk about Swift running on edge/embedded devices. My experience at WendyLabs served me well for this challenge. That gave me the excuse to port a vision pipeline to Swift.

Jetson Orin Nano 8 GB Super on my desk

The first version was a small Python project: object detection with YOLO26n, then selected frames sent to a quantized Qwen model for a description. The description stage was slow and not especially practical, but it proved the system could combine vision, detection metadata, and a local language model on the Jetson.

I wanted to find out whether Swift could be a practical host language for a Jetson vision pipeline.

Not “can Swift replace DeepStream?” That would be the wrong goal. The Jetson already has hardware video decode, TensorRT, GStreamer, and DeepStream. The useful question was narrower:

Can Swift sit around that stack, configure the pipeline, read typed inference metadata, serve the results, and avoid becoming the bottleneck?

The answer was yes. The interesting part was not the Swift code. The interesting part was getting the Jetson container runtime and Swift cross-compilation SDK into the right shape.

The project runs on a Jetson Orin Nano 8 GB running WendyOS. It reads a 1080p H.264 RTSP stream from a TP-Link Tapo camera, decodes it with Jetson hardware decode, runs object detection through DeepStream/TensorRT, tracks objects with NvDCF, and sends typed detection metadata to browser clients over WebSockets.

For this post, I focused on the detection pipeline itself and ported that part to Swift. NVIDIA’s supported DeepStream ecosystems are Python and C/C++, but Swift has good C interop, and memory management is much less painful than C++. That made it a good candidate for the control-plane code around a DeepStream graph.

In this post:

  • the final pipeline shape;
  • what Swift actually does;
  • how the Swift version compared with the Python version;
  • what WendyOS gave me;
  • the two fixes I had to make: CDI video plugin paths and a rebuilt Swift SDK sysroot.

The pipeline shape

Pipeline diagram: Tapo C110 → mediamtx splits into two paths — WebRTC pixels to the browser video element, and RTSP into the detector container's GStreamer graph (rtspsrc → nvv4l2decoder → nvstreammux → nvinfer → nvtracker → fakesink) where a pad probe extracts metadata into Swift, then to the browser canvas via WebSocket; the two reunite via RVFC rtpTimestamp.

The stack is:

  • Jetson Orin Nano 8 GB;
  • WendyOS 0.10.5;
  • DeepStream 7.1;
  • 1080p H.264 RTSP camera;
  • mediamtx as an RTSP/WebRTC relay;
  • YOLO26n FP16 via TensorRT (called from Swift through the tensorrt-swift bindings);
  • NvDCF tracker;
  • Swift + GStreamer + a small C shim for DeepStream metadata.

The final GStreamer graph is a single pipeline:

rtspsrc → nvv4l2decoder → nvstreammux → nvinfer → nvtracker → fakesink

The shape matters. Video enters through RTSP, gets decoded by nvv4l2decoder, batched by nvstreammux, processed by nvinfer, tracked by nvtracker, and discarded at fakesink.

The browser video path is separate. mediamtx serves the camera stream over WebRTC. Swift only publishes detection metadata.

That separation is the whole point: pixels stay in the NVIDIA video/GPU path; Swift reads metadata.

What Swift actually does

Swift does five jobs in this app.

First, it builds and owns the GStreamer pipeline. It sets properties on elements like rtspsrc, nvv4l2decoder, nvstreammux, nvinfer, and nvtracker, then installs a pad probe on the tracker's source pad.

Second, it reads DeepStream metadata. The C shim walks:

NvDsBatchMeta → NvDsFrameMeta → NvDsObjectMeta

and copies the fields I care about into plain Swift values: class ID, confidence, bounding box, tracker ID, and frame timestamp.

Third, it turns those detections into an AsyncStream<DetectionFrame>. The pad probe runs on the GStreamer streaming thread; the rest of the app consumes a typed async sequence. Multiple WebSocket clients each get their own continuation through a DetectionBroadcaster actor.

Fourth, it serves HTTP and WebSocket clients using Hummingbird on SwiftNIO. Detection events fan out to the browser, and the same process exposes health checks and Prometheus-style metrics.

Fifth, it handles reconnect. rtspsrc does not magically recover from every upstream camera failure, so Swift watches the GStreamer bus for EOS and ERROR messages, tears down the graph, backs off, and rebuilds it. WebSocket subscribers stay connected while the pipeline cycles.

An earlier version of this code mapped decoded frames into Swift directly through gst_buffer_map. NVMM leaked at ~65 MB/s until OOM in under two minutes — Swift had no CUDA context, so the surfaces never released. Reading metadata instead of pixels makes that class of leak structurally impossible.

A Swift process wired in at this plugin point — metadata in, no pixels out — is the cleanest version of “Swift on Jetson” I know of.

Swift versus Python on the same graph

I also built the same DeepStream graph in Python.

Same camera. Same nvinfer engine. Same tracker. Same parser. Same pad-probe location. Both versions ran at about 21 FPS.

The difference was the host process around the graph:

MeasurementSwiftPython
Process CPU~26.6% of one core~52.1% of one core
Resident memory676 MB797 MB
nvmap pool421,520 kB421,520 kB

The CPU numbers were steady-state measurements over 5-minute windows and reproduced across two independent samples: 1.93× and 1.96× difference between the Python and Swift host processes.

The GPU-side nvmap pool matched to the kilobyte across runs, which is what I expected: that pool belongs to DeepStream, not to Swift or Python.

The CPU difference appears to come from the metadata path. The Swift version flattens DeepStream structs through a small C shim into fixed-size Swift values. The Python version walks the same metadata through pyds/ pybind11 under the GIL.

That is not a universal benchmark. It is one workload, one stream, one graph. The next useful experiment is multi-stream scale.

The Swift version has crossed 26 hours of continuous operation with no OOMs since the last deploy.

What WendyOS gave me

The Jetson was bootstrapped headless with WendyOS and deployed with the Wendy CLI.

That got me most of the way:

  • a JetPack-derived base;
  • wendy run for OCI containers;
  • CDI specs generated by nvidia-ctk;
  • a builder path for cross-compiling Swift to Jetson Linux.

The remaining work was not Swift language work. It was integration work:

  1. fix the CDI spec so the container could discover the Jetson video plugin stack;
  2. rebuild the Swift SDK sysroot so the C shim could compile and link against DeepStream.

CDI was missing Jetson video plugin paths

On JetPack 6 / L4T R36.x, Jetson hardware decode is not just “pass /dev/nvhost-nvdec into the container and call it done.” JetPack 6 moved the public decoder node to /dev/v4l2-nvdec, and NVIDIA’s own forum guidance says to use /dev/v4l2-nvdec instead of the older /dev/nvhost-nvdec.

On R36.5, NVIDIA still ships the Jetson V4L/libv4l2 userspace stack, including libv4l2_nvvideocodec.so, libnvv4l2.so, and libtegrav4l2.so. The Jetson Linux R36.5 package manifest describes libtegrav4l2.so as a helper library for multimedia hardware acceleration and also lists nvdec_t234_prod.fw, the NVDEC firmware for T234.

The trap is that this path depends on userspace discovery as much as device-node injection. nvv4l2decoder needs the NVIDIA GStreamer pieces and the NVIDIA V4L/libv4l2 plugin files to be visible inside the container at the paths the container actually scans.

In my case, the container started cleanly but the hardware decode path was still broken. nvv4l2decoder failed during buffer setup; depending on the pipeline, GStreamer either errored out or fell back to software decode such as avdec_h264.

The underlying CDI issue was that nvidia-ctk cdi generate --mode=csv builds the Tegra spec from NVIDIA’s l4t.csv, devices.csv, and drivers.csv files under /etc/nvidia-container-runtime/host-files-for-container.d, but the nvidia-container-toolkit 1.16.2 Tegra CSV parser only accepts dev, dir, lib, and sym rows. An env,... row is not represented in the generated CDI YAML. The same parser shape is still present in 1.18.2.

My workaround was to post-process the generated CDI YAML and add the missing containerEdits.env entries for the Jetson video path — in particular the GStreamer plugin path and the V4L/libv4l2 plugin path used by this image. CDI itself supports environment edits in containerEdits.env, so this is a schema-valid patch. NVIDIA’s Tegra CSV-to-CDI conversion just does not emit it.

It is not elegant, but it is the least surprising fix: make the generated CDI spec describe both halves of the Jetson video stack — the device/library mounts and the userspace plugin discovery paths.

The Swift SDK sysroot needed DeepStream headers

The second fix was build-time only.

Wendy builds apps in a dedicated builder container, so Swift needs a target SDK/sysroot that looks like the Jetson Linux system it is compiling for. The old Wendy SDK was enough for normal Swift code, but not for my DeepStream C shim.

A Swift SDK is the bundle of target libraries, headers, and configuration needed for cross-compilation. In my case, the SDK had to expose the DeepStream headers imported by the shim and enough link-time symbols for libnvdsgst_meta and libnvds_meta.

The standard WendyOS Swift SDK didn't have everything I needed. I cloned and updated the WendyOS Yocto project to include the DeepStream development surface I needed, rebuilt the SDK image, extracted a corrected sysroot, and generated a new Swift SDK from that sysroot.

At runtime, nothing changes. The real DeepStream libraries still come from the Jetson/DeepStream runtime path and the container/CDI setup. The custom Swift SDK exists only to make the builder container capable of producing the Jetson binary.

I also stopped pulling the full DeepStream runtime into the SDK image. The SDK only needs headers and link-time artifacts; the runtime comes from the Jetson/DeepStream container path.

The reproducible shape is:

1. extract target sysroot
2. package sysroot as a Swift SDK artifact bundle
3. install it with swift sdk install
4. build with swift build --swift-sdk <jetson-sdk-id>

SwiftPM provides swift sdk commands for managing SDK bundles, and swift-sdk-generator is the natural starting point for generating custom cross-compilation SDKs.

If I publish a prebuilt binary SDK, I’ll verify the redistribution situation first, because the generated sysroot may contain NVIDIA-delivered files.

What I learned

Swift was not the hard part.

Once the GStreamer graph was correct, Swift made a good host language for the parts around DeepStream: configuration, typed metadata, reconnect logic, metrics, and WebSockets.

The hard parts were below the language:

  • the container had to see the Jetson video plugin stack, not just the device nodes;
  • the builder needed a real Swift SDK sysroot with the DeepStream development surface;
  • the runtime libraries and build-time link artifacts had to be treated as separate problems.

None of this is Swift-the-language. It is what a Swift-on-Jetson stack needs around the language: a container runtime that describes the NVIDIA video stack completely, a sysroot that can compile against the headers your C shim imports, and an image/build recipe that keeps runtime and SDK concerns separate.

With those in place, wendy run against the Swift detector container is uneventful — which is the point.

Swift on Jetson works best when it is not pretending to be CUDA, DeepStream, or GStreamer. Let NVIDIA’s stack move pixels and use unified memory effectively. Let Swift own the control plane.


Repo: github.com/mihai-chiorean/deepstream-swift-detector — the standalone Swift detector. The Python sibling (same engine, same parser, same tracker) and the broader WendyOS sample collection it came out of are in wendylabsinc/samples under samples/deepstream-vision/.