Resources

Blog

Technical ReportMarch 18, 2026
Cross-Vendor Disaggregated Inference: GPT-OSS 120B across NVIDIA H100 and AMD MI300X
MoAI Inference Framework enables cross-vendor disaggregation with H100 for prefill and MI300X for decode, achieving up to 43% lower latency and 67% higher throughput vs. a single-vendor cluster.
→
Technical ReportMarch 17, 2026
Multi-Node Disaggregated Inference: DeepSeek R1 671B on AMD Instinct MI300X GPUs
Moreh’s Disaggregated Inference achieves up to 1.84x lower end-to-end latency and 12–51x reduction in P99 inter-token latency for DeepSeek R1 671B on a 5-node AMD MI300X cluster.
→
BlogMarch 16, 2026
Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)
Moreh’s optimized inference engine achieves 1.47x improvement in end-to-end latency and throughput per GPU for DeepSeek R1 on AMD MI300X, compared to InferenceMAX baseline.
→
Technical ReportFebruary 5, 2026
TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference
TIDE continuously improves inference speed by training a lightweight draft model in the background, using idle GPUs in the cluster — no extra data preparation or downtime required.
→
Technical ReportJanuary 30, 2026
HetCCL: Accelerating LLM Training with Heterogeneous GPUs
HetCCL is the first cross-vendor collective communication library enabling GPUDirect RDMA communication across NVIDIA and AMD GPUs.
→
Customer CaseDecember 29, 2025
Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20
Moreh optimized StepFun’s Step3 321B MoE model for AMD Instinct MI308X GPUs, achieving 1.30× higher decode throughput and 23% lower decode latency compared to NVIDIA H20.
→
BlogDecember 26, 2025
Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes
SLOPE Engine improves long-context prefill performance by applying context parallelism across multiple GPU servers. This also helps efficiently utilize older-generation GPUs.
→
Customer CaseNovember 25, 2025
Telco LLM Inference Optimization on AMD MI300X: 1.38× Higher Serving Capacity
Moreh optimized a Korean telco’s affiliate-developed 7.8B LLM for AMD MI300X, achieving 1.38× higher SLO-compliant serving capacity and 1.30× higher single-request throughput vs. NVIDIA H100.
→
Technical ReportNovember 18, 2025
Moreh-Tenstorrent AI Data Center Solution System Architecture
Moreh combine Tenstorrent’s lightweight and scalable hardware with our proprietary software stack to deliver an efficient and flexible solution for large-scale AI data centers.
→
Technical ReportNovember 13, 2025
21K Output Tokens Per Second DeepSeek Inference on AMD Instinct MI300X GPUs with Expert Parallelism
Moreh demonstrated that DeepSeek-R1 inference can be executed at a decoding throughput of >21,000 tokens/sec by implementing EP on the ROCm software stack.
→
BlogNovember 10, 2025
Runtime Draft Model Training: Adapting Speculative Decoding to Real-World Workloads
TIDE provides a method to optimize inference computation on newer GPUs by utilizing older or idle GPUs for runtime draft model training, resulting in better overall cost-performance at the system level.
→
BlogSeptember 23, 2025
Distributed Inference on Heterogeneous Accelerators Including GPUs, Rubin CPX, and AI Accelerators
MoAI Inference Framework supports automatic and efficient distributed inference on heterogeneous accelerators such as AMD MI300X + MI308X and NVIDIA Rubin CPX + GPU.
→
Technical ReportAugust 30, 2025
Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs
Moreh vLLM achieves 1.68x higher output TPS, 2.02x lower TTFT, and 1.59x lower TPOT compared to the original vLLM for Meta’s Llama 3.3 70B model.
→
Technical ReportAugust 29, 2025
Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs
Moreh vLLM achieves 1.68x higher output TPS, 1.75x lower TTFT, and 1.70x lower TPOT compared to the original vLLM for the DeepSeek V3/R1 671B model.
→
BlogFebruary 20, 2025
DeepSeek V3 and R1 on MoAI: 1. Fine-Tuning on AMD GPU Clusters
MoAI provides a PyTorch-compatible environment that makes LLM fine-tuning on hundreds of AMD GPUs super easy, including DeepSeek 671B MoE.
→
BlogDecember 2, 2024
Introducing Motif: A High-Performance Open-Source Korean LLM by Moreh
Moreh announces the release of Motif, a high-performance 102B Korean language model (LLM), which will be made available as an open-source model.
→
BlogSeptember 3, 2024
Fine-tuning Llama 3.1 405B on AMD GPUs
There are no barriers to fine-tune Llama 3.1 405B on the MoAI platform. The Moreh team has actually demonstrated fine-tuning on the model with 192 AMD GPUs.
→
BlogAugust 19, 2024
GPU Virtualization in the MoAI Platform
The MoAI platform provides comprehensive GPU virtualization including fine-grained resource allocation, multi-GPU scaling, and heterogeneous GPU support.
→
BlogAugust 14, 2023
Training 221B Parameter Korean LLM on 1,200 AMD MI250 GPU Cluster
Moreh trained a largest-ever Korean LLM with 221B parameters on top of the MoAI platform and an 1,200 AMD MI250 cluster system.
→
BlogNovember 11, 2022
KT’s Success Stories in AI Cloud Service and Large AI Model Training on AMD Instinct MI250 and Moreh AI Platform
KT has collaborated with Moreh and AMD to overcome the challenges in public cloud services and in-house AI model development.
→

Blog

Cross-Vendor Disaggregated Inference: GPT-OSS 120B across NVIDIA H100 and AMD MI300X

Multi-Node Disaggregated Inference: DeepSeek R1 671B on AMD Instinct MI300X GPUs

Moreh Unlocks AMD MI300X Potential: 1.5× Faster DeepSeek R1 Inference vs. SGLang (InferenceMax)

TIDE: Temporal Incremental Draft Engine for Self-Improving LLM Inference

HetCCL: Accelerating LLM Training with Heterogeneous GPUs

Step3 Inference Optimization on AMD Instinct MI308X: 1.30× Higher Decode Throughput vs. NVIDIA H20

Optimizing Long-Context Prefill on Multiple (Older-Generation) GPU Nodes

Telco LLM Inference Optimization on AMD MI300X: 1.38× Higher Serving Capacity

Moreh-Tenstorrent AI Data Center Solution System Architecture

21K Output Tokens Per Second DeepSeek Inference on AMD Instinct MI300X GPUs with Expert Parallelism

Runtime Draft Model Training: Adapting Speculative Decoding to Real-World Workloads

Distributed Inference on Heterogeneous Accelerators Including GPUs, Rubin CPX, and AI Accelerators

Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs

Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs

DeepSeek V3 and R1 on MoAI: 1. Fine-Tuning on AMD GPU Clusters

Introducing Motif: A High-Performance Open-Source Korean LLM by Moreh

Fine-tuning Llama 3.1 405B on AMD GPUs

GPU Virtualization in the MoAI Platform

Training 221B Parameter Korean LLM on 1,200 AMD MI250 GPU Cluster

KT’s Success Stories in AI Cloud Service and Large AI Model Training on AMD Instinct MI250 and Moreh AI Platform