June 24, 2024

What You Need To Know About The New World Of Cloud-Based Superclusters

Greg Pavlik, Senior Vice President and Chief Technology Officer, Oracle Cloud Platform.

In the first wave of cloud computing, companies moved many common workloads from expensive private data centers onto the shared compute, networking and storage of the public cloud, paying only for what they used.

That, of course, was a great start, but going forward, there’s a huge opportunity to solve thorny scientific and engineering problems by running clusters of compute resources on the public cloud. These are the types of jobs that typically require pricey on-premises clusters or even the use of supercomputers. Drug discovery, medical research, oil and gas exploration, weather forecasting, cryptology and quantum mechanics are all applications that can take advantage of superclusters.

The problem until recently has been that computing clusters tend to be expensive, hard to design and difficult to scale, putting their capabilities beyond the reach of many organizations. But now, better ways to manage the massive resources of the public cloud, along with breakthroughs in artificial intelligence (AI), are making computer clusters available to more businesses and agencies that could not afford big on-premises installations or even supercomputer time.

Deploying AI on the public cloud can make sense of gigantic volumes of data by facilitating the construction of “trained models.” To use a sports analogy, just as athletes train to achieve peak performance for an upcoming race or game, machine learning (ML) models are trained to perform best by absorbing lots of relevant data.

The use of a cloud-based cluster of compute nodes or superclusters (clusters of clusters) speeds the development of AI. Many AI startups—including Mosaic ML, Adept AI, Aleph Alpha, Cohere, Character AI, Vector Space Biosciences, Altair DesignAI and Twelve Labs (the first three companies are partners of Oracle)—test and build powerful AI models in a cost-effective way using cloud-based clusters, for example.

Going forward, more companies of all sizes will be able to deploy cloud-based clusters to train models useful for natural language processing (NLP) or building consumer recommendation engines that factor in billions of parameters and analyze vast troves of data.

So, what should customers look for in a cloud provider when they want to run cloud-based superclusters?

Fast, Smart Networking

As in any form of computing, hardware is important, but in the case of superclusters, the ability of connected processors to share data fast over the network is critical.

For that reason, a cloud provider’s underlying network technology is a critical factor. If the network is slow, work will be delayed. While InfiniBand is a common approach to eliminating network bottlenecks, the use of Remote Direct Memory Access over Converged Ethernet (RoCE v2) networking also reduces latency and eliminates lost packets—two key concerns in clustered configurations. RoCE v2, in other words, ensures that data keeps moving quickly to feed all that complex computation.

This need for speed is especially applicable in large language model (LLM) training, which requires a huge amount of data crunching and transfer. Low latency means that the cloud cluster takes full advantage of super-fast computer processing and network speeds.

Best And Brightest Hardware

As noted, superclusters require fast hardware, including a large number of powerful network interface cards and processors, and perhaps more commonly today, high-performance GPUs. Cloud computing, by its nature, harnesses tens of thousands or more processors to act as distributed computing nodes.

GPUs, because they comprise many more tiny cores than traditional CPUs, are well suited for this work. NVIDIA GPUs, for example, have become extremely popular for running these sorts of jobs (NVIDIA is a partner of my company).

Customers, depending on their workload, may want to use the full firepower of bare metal GPUs in a cluster for heavy-duty scientific or engineering jobs or opt for virtualized compute for more general-purpose applications, so they should make sure their cloud provider offers both options.

Another key consideration is how the cloud provider in question gets its chips. Some are pushing their own proprietary microprocessors. But one advantage to sticking with experienced third-party chip makers is that their silicon comes with a supporting ecosystem of popular frameworks and tools, not to mention a sizeable talent pool that knows how to get the most out of the hardware.

That, and the fact that third-party chips run in more than one cloud, as well as on-premises, may also comfort businesses that do not want to lock themselves into one technology provider for everything from silicon all the way up the software stack.

Continued Cluster Commitment

Prospective customers should also make sure that their preferred cloud provider sees supercluster hosting as a continuing priority. Does it plan to grow the size of its clusters and support capabilities? Can your provider of choice run several tens of thousands of nodes in a cluster? Will it continue to staff that effort?

Cloud-based superclusters are less expensive and can be more flexible than their on-premises analogs. It is, for example, much easier for a business to spin up—or down—a supercluster on a cloud as its needs change, compared to building out—and paying for—more corporate data center resources that may lie fallow part of the time.

Companies that do their homework in evaluating a cloud computing partner will be more likely to successfully deploy superclusters to solve their toughest problems. Part of that prep work should include confirming that the cloud provider is fully committed to helping them build, test, optimize and deploy needed supercluster resources before proceeding.

To sum up, the successful adoption of cloud superclusters requires the convergence of ultra-fast networks and the most powerful hardware suited for distributed operations. In this world, a cloud must scale up and down as needed, offer clever caching and quickly deploy workloads where they will run optimally.

That wish list—along with the cloud provider’s in-house support expertise—can help more businesses reap the biggest rewards from cloud-based superclusters.


Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


Leave a Reply

Your email address will not be published. Required fields are marked *