Optimal LLM Inference on Every Accelerator
From custom kernels to distributed serving, we build the full-stack software that unlocks peak inference performance on AMD GPUs, Tenstorrent chips, and heterogeneous clusters.
1.68×
vs ROCm vLLM
DeepSeek R1 on a single server
20,000+
tok/s per node
DeepSeek R1 on MI300X cluster
1.7×
with cross-vendor GPUs
NVIDIA + AMD PD disaggregation
2.2×
throughput on 40% fewer servers
Prefix cache-aware routing
Full-Stack Inference Software
From Kernels to Clusters
Moreh covers the entire inference stack across heterogeneous accelerators — from chip-level kernels to distributed serving.
MoAI Inference Framework
Routing & Scheduling · Auto Scaling · SLO-Driven Optimization · KV Cache
Moreh vLLM
SOTA Model Optimization · Quantization · Graph Execution
Native vLLM
Moreh Libraries
Custom Kernels · GEMM/Attention/MoE · Communication
AMD Instinct GPUs
Tenstorrent Chips
NVIDIA GPUs
Why Moreh
Three ways our inference software creates value for your AI infrastructure.
Inference on Non-NVIDIA Accelerators
Full-stack software from kernels to cluster-level framework, optimized for AMD GPUs and enabling inference on Tenstorrent chips.
Heterogeneous GPU Inference
Unify GPUs across vendors, architectures, and generations into a single inference cluster — maximizing the efficiency of every chip in your data center.
Inference Cost Optimization
Maximize tokens per dollar through chip-level optimization, communication optimization, and multi-vendor infrastructure utilization.
From Our Blog
View all ›
Cross-Vendor Disaggregated Inference: GPT-OSS 120B across NVIDIA H100 and AMD MI300X
March 18, 2026
MoAI Inference Framework enables cross-vendor disaggregation with H100 for prefill and MI300X for decode, achieving up to 43% lower latency and 67% higher throughput vs. a single-vendor cluster.

Multi-Node Disaggregated Inference: DeepSeek R1 671B on AMD Instinct MI300X GPUs
March 17, 2026
Moreh’s Disaggregated Inference achieves up to 1.84x lower end-to-end latency and 12–51x reduction in P99 inter-token latency for DeepSeek R1 671B on a 5-node AMD MI300X cluster.

Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)
March 16, 2026
Moreh’s optimized inference engine achieves 1.47x improvement in end-to-end latency and throughput per GPU for DeepSeek R1 on AMD MI300X, compared to InferenceMAX baseline.
Ecosystem & Open Source
We contribute to the open-source ecosystem and partner with leading chip vendors.



















