Inference software company

Optimal LLM Inference on Every Accelerator

From custom kernels to distributed serving, we build the full-stack software that unlocks peak inference performance on AMD GPUs, Tenstorrent chips, and heterogeneous clusters.

1.68×vs ROCm vLLMDeepSeek R1 on a single server
20,000+tok/s per nodeDeepSeek R1 on MI300X cluster
1.7×with cross-vendor GPUsNVIDIA + AMD PD disaggregation
2.2×throughput on 40% fewer serversPrefix cache-aware routing
Full-Stack Inference Software

From Kernels to Clusters

Moreh covers the entire inference stack across heterogeneous accelerators — from chip-level kernels to distributed serving.

MoAI Inference Framework

Routing & Scheduling · Auto Scaling · SLO-Driven Optimization · KV Cache

Moreh vLLM

SOTA Model Optimization · Quantization · Graph Execution

Native vLLM

Moreh Libraries

Custom Kernels · GEMM/Attention/MoE · Communication

AMD Instinct GPUs

Tenstorrent Chips

NVIDIA GPUs

Value

Why Moreh

Three ways our inference software creates value for your AI infrastructure.

Inference on Non-NVIDIA Accelerators

Full-stack software from kernels to cluster-level framework, optimized for AMD GPUs and enabling inference on Tenstorrent chips.

Heterogeneous GPU Inference

Unify GPUs across vendors, architectures, and generations into a single inference cluster — maximizing the efficiency of every chip in your data center.

Inference Cost Optimization

Maximize tokens per dollar through chip-level optimization, communication optimization, and multi-vendor infrastructure utilization.

Ecosystem

Ecosystem & Open Source

We contribute to the open-source ecosystem and partner with leading chip vendors.

AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot
AMD ROCm
llm-d
Tenstorrent Metalium
SGLang
SkyPilot