By Nayan Lad, Yuchen Fama, Zhiming Shen,
The future of compute is open, Intel and Exostellar lead the way
As AI and machine learning applications evolve, organizations are demanding greater compute power and smarter orchestration to train and deploy models at scale. The architecture of Intel® Gaudi® AI accelerator is built to meet this need, delivering high-performance, cost-effective compute tailored for AI workloads. But unlocking the full value of these systems requires more than cutting-edge hardware, it demands intelligent resource management, workload prioritization, and a seamless user experience.
That’s why Intel and Exostellar are excited to announce a strategic collaboration. By combining Intel® Gaudi® with Exostellar’s advanced orchestration solution - Multi Cluster Operator, we’re enabling customers to maximize utilization, control access, and streamline the sharing of Intel Gaudi’s compute resources across teams and projects. Together, we deliver an end-to-end solution with support for quota enforcement, dynamic borrowing, fair queuing, and priority-based scheduling, bringing cloud-like agility and efficiency to on-prem or hybrid AI infrastructure.
This collaboration reflects a shift to what the industry has seen with NVIDIA and run.ai, but now extends to a broader, more competitive AI hardware ecosystem.With this collaboration, we’re not just improving performance—we’re empowering organizations to build and scale AI initiatives faster, more efficiently, and more cost-effectively.
The Problem: GPU Infrastructure Bottlenecks
According to findings by Exostellar, enterprise AI infrastructure teams face mounting challenges:
- High upfront costs and significant GPU resource waste – Building large GPU clusters can require hundreds of millions in capital investment. Yet real-world utilization often falls below 50% - and in many cases, under 15% .[1]
- Longer wait times – Developers often wait days or weeks for GPU availability
- Low utilization – Without dynamic orchestration, large portions of GPU fleets remain unused for extended periods.
- Vendor lock-in – limits flexibility in hardware and software choices
The Solution: Intelligent Orchestration Meets Open Architecture
This solution combines Intel Gaudi’s price-performance with Exostellar's advanced orchestration capabilities.
Intel Gaudi 3 AI accelerators is purpose built for high-efficiency AI training and inference across deep learning, large language models (LLMs), and generative AI workloads:
- Scalability: Processing power and networking bandwidth enable scale across large clusters without bottlenecks.
- Advanced Memory Architecture: Optimized for transformer models, with up to128GB of high-bandwidth memory for large models and datasets.
- Developer-Friendly Software Stack: Supports frameworks like PyTorch andTensorFlow, and inferencing tools like vLLM, TensorRT-LLM, and Ray offering flexible integration for AI development.
- Model Optimization: Ensuring popular models run as fast as possible on Intel Gaudi hardware capabilities, available on Hugging Face.
- Greater Diversity: Expands options for deploying LLMs beyond traditional GPUs.
Exostellar Multi-Cluster Operator: Kubernetes-Native AI Orchestration
- xPU software-defined virtualization for unified resource pooling and sharing across heterogeneous clusters and hardware
- Hierarchical quota management mirroring organizational structure with easy UI for non-technical quota managers
- Cross-team quota borrowing and resource sharing maximize utilization and minimize idle time.
- Priority-based preemption ensures critical workloads get resources and fairness across teams.
Enterprise Benefits
1. Cost Optimization
- 2-3x improvement in developer iteration cycles, accelerating time-to-market
- Autonomous resource optimization with overquota, binpacking and advanced scheduling
2. OperationalSimplicity
- Single pane of glass for multi-cluster and multi-vendor environments
- 40-80% reduction of manual quota management and resource allocation.
3. Future-Proof Architecture
- Vendor-agnostic design supports Intel Gaudi, NVIDIA, AMD, and emerging accelerators
- Kubernetes-native integration with existing workflows as a drop-in infra middleware
- Open standards preventing vendor lock-in
Technical Deep Dive: Multi-Cluster Operator Features
Software-Defined xPU Virtualization: Multi-Cluster Operator unifies diverse hardware including Intel Gaudi, Xeon CPUs, and traditional GPUs into pooled resources with granular allocation controls.
1. Advanced Resource Pooling Features:
- Non-overlapping pool enforcement preventing resource conflicts across hardware types
- Logical grouping of nodes with specific labels (group by vendor/model, network proximity, etc.)
- Quota borrowing allowing temporary resource sharing during peak demand
- Namespace isolation enabling secure multi-tenant environments across hardware architectures
2. Hierarchical Quota Tree Management:
- Multi-level quota trees providing the flexibility to create arbitrary hierarchical structure to mirror organizational structure.
- Intuitive UI allows non-technical program managers to allocate resources.
- Configurable user access control for each quota node, delegated by infrastructure admin to department admins.
3. Overquota: Temporary borrowing of unused resources
- Teams can exceed their assigned quota when another team's quota has idle resources
- Automatically reclaim borrowed resources through preemption back to the lender
4. Oversubscription: Strategic over-allocation of total resources
- System allocates more combined quota than physical GPUs available
- Works because workloads rarely all peak simultaneously
5. Coming Soon: Topology-aware scheduling (TAS) for affordable distributed training/inference
Intel’s cost-effective Ethernet and high-speed interconnects combined with Exostellar’s TAS will optimize pod-to-pod placement based on network topology, maximizing bandwidth and minimizing network hops for distributed training and inference. The result? Accelerated performance at a fraction of the cost.
Market Timing: OpenAI Infrastructure
The strategic importance of orchestration software is clear. But it also raises red flags about vendor lock-in at a time when enterprises demand choice and control.
Our collaboration offers to achieve a compelling alternative:
· Open ecosystem vs. proprietary stack
· Multi-vendor support vs. single-vendor dependency
· 50%+ cost savings vs. premium pricing
Getting Started
Intel and Exostellar are committed to making enterprise-grade AI infrastructure accessible.
This collaboration signals the future of AI infrastructure: open, intelligent and cost-effective. As AI workloads evolve from experimental to production, enterprises need orchestration platforms that boost ROI while maintaining flexibility.
Intel Gaudi is available through major cloud providers and system integrators. Exostellar Multi-ClusterOperator launches in July 2025 offering enterprise features like multi-cluster management, quota enforcement and advanced scheduling.
For more information reach out to https://www.exostellar.ai/contact
More information on Gaudi AI accelerators can be found here: https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html
Intel and Exostellar are collaborating on joint customer deployments, technical integration, and go-to-market initiatives. Contact your Intel orExostellar representative for pilot program opportunities.