AMD Rolls Out Gemma 4 Model Support Across Full Range of GPUs & CPUs
AMD has rolled out official support for Google’s Gemma 4 across its full range of GPUs & CPUs, offering support for the compact AI model.
AMD Radeon GPUs & Ryzen AI CPUs Fully Support Google’s Gemma 4 AI Model
Google rolled out its latest family of open-weights AI models, called Gemma 4, which span a range of sizes, from 2B to 31B. With this announcement, AMD is rolling out support across its entire Radeon GPU, and Ryzen AI CPU family.
Press Release: AMD is proud to provide Day Zero support for the full set of Gemma 4 models across our portfolio of AI-enabled hardware.
This includes AMD Instinct GPUs for cloud and enterprise datacenters, AMD Radeon GPUs for AI workstations, and AMD Ryzen AI processors for AI PCs. Support includes integration with the most popular AI applications like LM Studio, and support for open-source software projects, including vLLM, SGLang, llama.cpp, Ollama, and Lemonade.
Deploying with vLLM
Gemma 4 can be deployed on AMD GPUs using vLLM to take advantage of the many optimizations in this inference framework, particularly relating to support for multiple concurrent requests. The whole range of AMD GPUs supported by vLLM, including multiple generations of both Instinct and Radeon GPUs, can be used with the Gemma 4 models. This support is planned in both the Gemma 4 launch build of upstream vLLM and future nightly builds, installable as either a Docker image or Python installable package using the process documented at https://vllm.ai/.
docker pull vllm/vllm-openai-rocm:gemma4For all AMD GPUs, vLLM can be invoked with the
TRITON_ATTNbackend:vllm serve vllm/vllm-openai-rocm:gemma4 --attention-backend TRITON_ATTN
Support for other attention backends with additional optimizations on MI300 and MI350-series GPUs is planned to be available soon.
Deploying with SGLang
Gemma 4 can also be deployed on AMD MI300X/MI325X/MI35X GPUs using SGLang, which provides high-performance serving.
SGLang supports the full Gemma 4 family, including dense models (E2B, E4B, 31B) and the MoE variant (26B-A4B). This support is available in the Gemma 4 launch build of SGLang, via a Docker image following https://cookbook.sglang.io/.
All Gemma 4 models require the Triton attention backend for bidirectional image-token attention.
SGLang can be invoked as follows:
python3 -m sglang.launch_server --model-path --attention-backend triton --tp 1
The Gemma 4 model fits on a single MI300X GPU (192 GB HBM) at TP=1 with full context length. For higher throughput workloads, tensor parallelism can be increased (e.g., –tp 2).
Deploying on local hardware with LM Studio
Gemma 4 models can be easily and performantly deployed on AMD hardware through the open-source llama.cpp project and LM Studio. Users can quickly spin up these models on supported hardware, such as AMD Ryzen AI and Ryzen AI Max processors, as well as Radeon and Radeon PRO graphics cards, by downloading the popular LM Studio application and pairing it with the latest AMD Software: Adrenalin Edition drivers.
Deploying on local hardware with Lemonade Server
Lemonade Server enables deployment of Gemma 4 models on AMD hardware through an open-source local LLM server with OpenAI‑compatible APIs. It supports acceleration on AMD Radeon and Radeon PRO GPUs via ROCm, and on AMD Ryzen AI processors using the XDNA 2 NPU.
GPU deployment with Lemonade and ROCm
To run Gemma 4 on AMD GPUs with ROCm acceleration:
- Install Lemonade and download the preview ROCm build of llama.cpp for your GPU architecture from the release artifacts (e.g., llama-windows-rocm-gfx1151-x64 for Radeon 8060S).
- Point Lemonade to the ROCm build by setting the environment variable:
export LEMONADE_LLAMACPP_ROCM_BIN=/path/to/llama-server
- Start Lemonade and load the Gemma 4 model via the API:
``` lemonade-server serve curl http://localhost:8000/api/v1/pull \ -H "Content-Type: application/json" \ -d '{"model_name": "user.Gemma-4-E4B-IT", "checkpoint": "", "recipe": "llamacpp"}' ```
- Chat with the model via the OpenAI-compatible API:
``` curl http://localhost:8000/api/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model": "user.Gemma-4-E4B-IT", "messages": [{"role": "user", "content": "Hello!"}], "llamacpp": "rocm"}' ```
NPU deployment with Ryzen AI
Developers will be able to deploy Gemma 4 models on NPU by integrating Lemonade Server, which supports the latest AMD XDNA 2 NPU. NPU support for the Gemma-4 E2B and E4B models will arrive with the next Ryzen AI SW update. This update will be integrated into Lemonade and will also be available to developers directly as OnnxRuntime APIs.
Follow Wccftech on Google to get more of our news coverage in your feeds.
First Appeared on
Source link
