Moreh vLLM

The best LLM inference performance on AMD GPUs

Moreh’s optimized version of vLLM improves inference throughput and latency through our proprietary libraries highly tuned for AMD GPUs, while fully enjoying the wide model support and optimization techniques of the original open-source vLLM.

Showcase

DeepSeek-R1 671B Inference

Moreh vLLM delivers industry-leading performance for the DeepSeek-R1 model on AMD Instinct MI300 series GPUs.

Normalized output tokens per second

(ROCm vLLM = 1)

All the numbers are measured on an 8x AMD Instinct MI300X GPU sever using vLLM’s benchmark_serving tool.

Performance Evaluation Reports

Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs
Moreh vLLM achieves 1.68x higher output TPS, 2.02x lower TTFT, and 1.59x lower TPOT compared to the original vLLM for Meta's Llama 3.3 70B model.
Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs
Moreh vLLM achieves 1.68x higher output TPS, 1.75x lower TTFT, and 1.70x lower TPOT compared to the original vLLM for the DeepSeek V3/R1 671B model.

3 Ways to Get Started

Install on Existing AMD GPU Servers

Moreh vLLM is packaged in a Docker container and can be easily deployed to your server.

Request On-Demand Optimization for Private AI Models

Moreh can deliver on-demand vLLM that enables customer’s proprietary AI models to be optimized and served on AMD GPUs.

Infrastructure

Use Cases

Operation

Moreh vLLM

DeepSeek-R1 671B Inference

Performance Evaluation Reports

Moreh vLLM Performance Evaluation: Llama 3.3 70B on AMD Instinct MI300X GPUs

Moreh vLLM Performance Evaluation: DeepSeek V3/R1 671B on AMD Instinct MI300X GPUs

3 Ways to Get Started

Install on Existing AMD GPU Servers

Request On-Demand Optimization for Private AI Models

Try on Moreh Cloud