Exostellar

By Nayan Lad, Yuchen Fama, Zhiming Shen,

The future of compute is open, Intel and Exostellar lead the way

As AI and machine learning applications evolve, organizations are demanding greater compute power and smarter orchestration to train and deploy models at scale. The architecture of Intel^® Gaudi^® AI accelerator is built to meet this need, delivering high-performance, cost-effective compute tailored for AI workloads. But unlocking the full value of these systems requires more than cutting-edge hardware, it demands intelligent resource management, workload prioritization, and a seamless user experience.

That’s why Intel and Exostellar are excited to announce a strategic collaboration. By combining Intel^® Gaudi^® with Exostellar’s advanced orchestration solution - Multi Cluster Operator, we’re enabling customers to maximize utilization, control access, and streamline the sharing of Intel Gaudi’s compute resources across teams and projects. Together, we deliver an end-to-end solution with support for quota enforcement, dynamic borrowing, fair queuing, and priority-based scheduling, bringing cloud-like agility and efficiency to on-prem or hybrid AI infrastructure.

This collaboration reflects a shift to what the industry has seen with NVIDIA and run.ai, but now extends to a broader, more competitive AI hardware ecosystem.With this collaboration, we’re not just improving performance—we’re empowering organizations to build and scale AI initiatives faster, more efficiently, and more cost-effectively.

‍

Tony Shakib, Chief Executive Officer, Exostellar, "This partnership brings together Intel’s price-performance leadership with Exostellar’s orchestration capabilities to address a key challenge in AI infrastructure: how to provide teams timely access to compute while maintaining cost and operational control. Exostellar’s platform gives organizations visibility across environments, fine-grained control over resource allocation, and the ability to reduce delays in accessing accelerators—ultimately helping teams deliver AI solutions to market faster and supercharge their ROI."

‍

Arijit Bandyopadhyay, CTO – Enterprise Analytics & AI, Head of Strategy, CSO Group, Intel Corporation, "Intel Gaudi is designed to deliver scalable, efficient performance for enterprise AI. Our partnership with Exostellar enhances that value by adding orchestration capabilities that ensure compute resources are allocated efficiently across teams and workloads. The combined solution enables organizations to accelerate deployment cycles and improve the return on their infrastructure investments."

‍

The Problem: GPU Infrastructure Bottlenecks

According to findings by Exostellar, enterprise AI infrastructure teams face mounting challenges:

High upfront costs and significant GPU resource waste – Building large GPU clusters can require hundreds of millions in capital investment. Yet real-world utilization often falls below 50% - and in many cases, under 15% .[1]
Longer wait times – Developers often wait days or weeks for GPU availability ‍
Low utilization – Without dynamic orchestration, large portions of GPU fleets remain unused for extended periods.‍
Vendor lock-in – limits flexibility in hardware and software choices

_[1]_{https://wandb.ai/wandb_fc/articles/reports/Monitor-Improve-GPU-Usage-for-Model-Training--Vmlldzo1NDQzNjM3}

‍

The Solution: Intelligent Orchestration Meets Open Architecture

This solution combines Intel Gaudi’s price-performance with Exostellar's advanced orchestration capabilities. ‍

Intel Gaudi 3 AI accelerators is purpose built for high-efficiency AI training and inference across deep learning, large language models (LLMs), and generative AI workloads:

‍Scalability: Processing power and networking bandwidth enable scale across large clusters without bottlenecks.
‍Advanced Memory Architecture: Optimized for transformer models, with up to128GB of high-bandwidth memory for large models and datasets.
‍Developer-Friendly Software Stack: Supports frameworks like PyTorch andTensorFlow, and inferencing tools like vLLM, TensorRT-LLM, and Ray offering flexible integration for AI development.
‍Model Optimization: Ensuring popular models run as fast as possible on Intel Gaudi hardware capabilities, available on Hugging Face.
‍Greater Diversity: Expands options for deploying LLMs beyond traditional GPUs.

‍

Exostellar Multi-Cluster Operator: Kubernetes-Native AI Orchestration

xPU software-defined virtualization for unified resource pooling and sharing across heterogeneous clusters and hardware
‍Hierarchical quota management mirroring organizational structure with easy UI for non-technical quota managers
‍Cross-team quota borrowing and resource sharing maximize utilization and minimize idle time.
‍Priority-based preemption ensures critical workloads get resources and fairness across teams.

‍

Enterprise Benefits

1. Cost Optimization

2-3x improvement in developer iteration cycles, accelerating time-to-market
Autonomous resource optimization with overquota, binpacking and advanced scheduling

2. OperationalSimplicity

Single pane of glass for multi-cluster and multi-vendor environments
40-80% reduction of manual quota management and resource allocation.

3. Future-Proof Architecture

Vendor-agnostic design supports Intel Gaudi, NVIDIA, AMD, and emerging accelerators
Kubernetes-native integration with existing workflows as a drop-in infra middleware
Open standards preventing vendor lock-in

‍

Technical Deep Dive: Multi-Cluster Operator Features

Software-Defined xPU Virtualization: Multi-Cluster Operator unifies diverse hardware including Intel Gaudi, Xeon CPUs, and traditional GPUs into pooled resources with granular allocation controls.

1. Advanced Resource Pooling Features:

Non-overlapping pool enforcement preventing resource conflicts across hardware types
Logical grouping of nodes with specific labels (group by vendor/model, network proximity, etc.)
Quota borrowing allowing temporary resource sharing during peak demand
Namespace isolation enabling secure multi-tenant environments across hardware architectures

2. Hierarchical Quota Tree Management:

Multi-level quota trees providing the flexibility to create arbitrary hierarchical structure to mirror organizational structure.
Intuitive UI allows non-technical program managers to allocate resources.
Configurable user access control for each quota node, delegated by infrastructure admin to department admins.

3. Overquota: Temporary borrowing of unused resources

Teams can exceed their assigned quota when another team's quota has idle resources
Automatically reclaim borrowed resources through preemption back to the lender

4. Oversubscription: Strategic over-allocation of total resources

System allocates more combined quota than physical GPUs available
Works because workloads rarely all peak simultaneously

5. Coming Soon: Topology-aware scheduling (TAS) for affordable distributed training/inference

Intel’s cost-effective Ethernet and high-speed interconnects combined with Exostellar’s TAS will optimize pod-to-pod placement based on network topology, maximizing bandwidth and minimizing network hops for distributed training and inference. The result? Accelerated performance at a fraction of the cost.

‍

Market Timing: OpenAI Infrastructure

The strategic importance of orchestration software is clear. But it also raises red flags about vendor lock-in at a time when enterprises demand choice and control.

Our collaboration offers to achieve a compelling alternative:

· Open ecosystem vs. proprietary stack

· Multi-vendor support vs. single-vendor dependency

· 50%+ cost savings vs. premium pricing

‍

Getting Started

Intel and Exostellar are committed to making enterprise-grade AI infrastructure accessible.

This collaboration signals the future of AI infrastructure: open, intelligent and cost-effective. As AI workloads evolve from experimental to production, enterprises need orchestration platforms that boost ROI while maintaining flexibility.

Intel Gaudi is available through major cloud providers and system integrators. Exostellar Multi-ClusterOperator launches in July 2025 offering enterprise features like multi-cluster management, quota enforcement and advanced scheduling.

For more information reach out to https://www.exostellar.ai/contact

More information on Gaudi AI accelerators can be found here: https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi.html

Intel and Exostellar are collaborating on joint customer deployments, technical integration, and go-to-market initiatives. Contact your Intel orExostellar representative for pilot program opportunities.

Intel and Exostellar Multi-Cluster Operator: AI Acceleration Without the Bottleneck