ODINO · NATIVE INFERENCE ENGINE · JETSON THOR

Built because the alternatives did not fit the mission.

Odino is a purpose-built inference engine for NVIDIA Jetson AGX Thor, designed to host the reasoning core of the Tin Man cognitive cluster within the embedded memory envelope. FP8 native execution. Zero-copy memory mapping. Sub-two-second cold start. Built so that the cluster's Vision, Audio, and Real-time cores have the unified memory they need to function.

THE CONSTRAINT

128 GiB unified memory. Twelve cores wanting in.

Jetson AGX Thor 5000 delivers 128 GiB of LPDDR5x unified memory — substantial for an edge node, finite for a 12-core cognitive cluster. The first iteration of Tin Man used vLLM as the LLM serving engine for the Chat Core. vLLM is excellent in its native environment (data center, dedicated GPU per process). On Thor, with shared unified memory, vLLM's preallocated KV-cache and idle-time memory footprint consumed approximately 67 GiB. The Vision Core crashed under memory pressure; the Real-time loop accumulated jitter from page faults; the Audio Core could not be reliably activated alongside.

Odino was built to solve a specific problem: deliver the same reasoning capability inside roughly 15 GiB of memory and start in under two seconds, so that the other cores of the cluster have the headroom they require to operate.

PHASE 1 · VALIDATED PERFORMANCE

The numbers that justified the build.

The table below compares Odino Phase 1 to the prior vLLM configuration on the same Jetson AGX Thor 5000 hardware, executing the same FP8-quantized chat model. Validated 2025-12-06.

Metric vLLM (prior) Odino Phase 1 (validated) Δ
Memory occupation (idle) ~67 GiB 381 MiB −99.4%
Memory occupation (loaded) ~70 GiB (saturated) ~15 GiB −78%
Cold-start time 30–60 s < 2 s −96%

The recovered memory — approximately 50 GiB freed by the loaded-state comparison — is what allows Tin Man v2.1 to operate Vision (RADIO ViT-H/16), Audio (Parakeet Conformer), and Real-time perception cores alongside the reasoning core on a single Jetson Thor node.

ENGINEERING PRINCIPLES

Native compilation. Direct memory. No intermediate framework.

Three architectural decisions, applied uniformly:

  • FP8 native compilation TensorRT static compilation to a single binary engine plan, targeting Blackwell sm_110a Tensor Core kernels. No runtime framework reinterpretation layer.
  • Zero-copy memory mapping The engine plan file (~15 GiB) is mmap’d directly into the process address space. The unified memory architecture of Thor lets the GPU access pages backed by the NVMe device, with paging managed by the OS. No load phase.
  • Direct execution dispatch GPU kernels are launched from bare-metal pointers to device buffers. No framework tensor abstraction in the hot path.

Each decision trades portability for performance — accepted, because Odino is purpose-built for Tin Man's Chat Core on this specific hardware, not a general-purpose inference framework.

DETERMINISM

Same input. Same output. Auditable.

Beyond raw throughput, Odino's static compilation and bare-metal execution produce a property that mattered architecturally: deterministic output. At a given input and a fixed random seed, Odino produces the same output every time — unlike cloud inference services that introduce non-deterministic variance from load balancing, retries, and silent A/B model swaps.

This determinism is what allows the Shield Brain Guardian Core to enforce strict safety gates — including AI Act compliance rules — on the reasoning output, knowing that a given input has a reproducible characterization rather than a statistical distribution per inference.

INTEGRATION

A component, not a product.

Odino is not delivered as a standalone product. It is the Chat Core of the Tin Man cognitive cluster — the reasoning layer that interprets text-mode inputs and produces the multimodal prompt assembly that downstream cores consume. Its value is realized through the integration with Vision, Audio, Brainstem, Memory, and Realtime cores under the Shield Brain control architecture.

Phase 1 of Odino is single-token stateless execution (full attention recomputation per token, chosen to validate the FP8 native pipeline end-to-end). Phase 2 introduces KV-cache management to extend operational range to long-context interaction; sequencing is governed by the cluster roadmap rather than published in isolation.

View Tin Man specifications →

ENGAGE

Engineering deep dives are released under partnership.

The full Odino technical report, integration notes for the Tin Man cluster, Phase 2 caching architecture, and the embedded-AI engineering pathway are released under non-disclosure agreement. Engage to begin a conversation.

Direct: engage@reinventy-solutions.ca