Can Intels XPU vision guide the industry into an era of heterogeneous computing? – VentureBeat

This article is part of the Technology Insight series, made possible with funding from Intel.

As data sprawls out from the network core to the intelligent edge, increasingly diverse compute resources follow, balancing power, performance, and response time. Historically, graphics processors (GPUs) were the offload target of choice for data processing. Today field programmable gate arrays (FPGAs), vision processing units (VPUs), and application specific integrated circuits (ASICs) also bring unique strengths to the table. Intel refers to those accelerators (and anything else to which a CPU can send processing tasks) as XPUs.

The challenge software developers face is determining which XPU is best for their workload; arriving at an answer often involves lots of trial and error. Faced with a growing list of architecture-specific programming tools to support, Intel spearheaded a standards-based programming model called oneAPI to unify code across XPU types. Simplifying software development for XPUs cant happen soon enough. After all, the move to heterogeneous computingprocessing on the best XPU for a given applicationseems inevitable, given evolving use cases and the many devices vying to address them.

KEY POINTS

Intels strategy faces headwind from NVIDIAs incumbent CUDA platform, which assumes youre using NVIDIA graphics processors exclusively. That walled garden may not be as impenetrable as it once was. Intel already has a design win with its upcoming Xe-HPC GPU, code-named Ponte Vecchio. The Argonne National Laboratorys Aurora supercomputer, for example, will feature more than 9,000 nodes, each with six Xe-HPCs totaling more than 1 exa/FLOP/s of sustained DP performance.

Time will tell if Intel can deliver on its promise to streamline heterogenous programming with oneAPI, lowering the barrier to entry for hardware vendors and software developers alike. A compelling XPU roadmap certainly gives the industry a reason to look more closely.

The total volume of data spread between internal data centers, cloud repositories, third-party data centers, and remote locations is expected to increase by more than 42% from 2020 to 2022, according to The Seagate Rethink Data Survey. The value of that information depends on what you do with it, where, and when. Some data can be captured, classified, and stored to drive machine learning breakthroughs. Other applications require a real-time response.

The compute resources needed to satisfy those use cases look nothing alike. GPUs optimized for server platforms consume hundreds of watts each, while VPUs in the single-watt range might power smart cameras or computer vision-based AI appliances. In either example, a developer must decide on the best XPU for processing data as efficiently as possible. This isnt a new phenomenon. Rather, its an evolution of a decades-long trend toward heterogeneity, where applications can run control, data, and compute tasks on the hardware architecture best suited to each specific workload.

Transitioning to heterogeneity is inevitable for the same reasons we went from single core to multicore CPUs, says James Reinders, an engineer at Intel specializing in parallel computing. Its making our computers more capable, and able to solve more problems and do things they couldnt do in the past but within the constraints of hardware we can design and build.

As with the adoption of multicore processing, which forced developers to start thinking about their algorithms in terms of parallelism, the biggest obstacle to making computers more heterogenous today is the complexity of programming them.

It used to be that developers programmed close to the hardware using low-level languages, providing very little abstraction. The code was often fast and efficient, but not portable. These days, higher-level languages extend compatibility across a broader swathe of hardware while hiding a lot of unnecessary details. Compilers, runtimes, and libraries underneath the code make the hardware do what you want. It makes sense that were seeing more specialized architectures enabling new functionality through abstracted languages.

Even now, new accelerators require their own software stacks, gobbling up the hardware vendors time and money. From there, developers make their own investment into learning new tools so they can determine the best architecture for their application.

Instead of spending time rewriting and recompiling code using different libraries and SDKs, imagine an open, cross-architecture model that can be used to migrate between architectures without leaving performance on the table. Thats what Intel is proposing with its oneAPI initiative.

oneAPI supports high-level languages (Data Parallel C++, or DPC++), a set of APIs and libraries, and a hardware abstraction layer for low-level XPU access. On top of the open specification, Intel has its own suite of toolkits for various development tasks. The Base Toolkit, for example, includes the DPC++ compiler, a handful of libraries, a compatibility tool for migrating NVIDIA CUDA code to DPC++, the optimization oriented VTune profiler, and the Advisor analysis tool, which helps identify the best kernels to offload. Other toolkits home in on more specific segments, such as HPC, AI and machine learning acceleration, IoT, rendering, and deep learning inference.

When we talk about oneAPI at Intel, its a pretty simple concept, says Intels Reinders. I want as much as possible to be the same. Its not that theres one API for everything. Rather, if I want to do fast Fourier transforms, I want to learn the interface for an FFT library, then I want to use that same interface for all my XPUs.

Intel isnt putting its clout behind oneAPI for purely selfless reasons. The company already has a rich portfolio of XPUs that stand to benefit from a unified programming model (in addition to the host processors tasked with commanding them). If each XPU was treated as an island, the industry would end up stuck where it was before oneAPI: with independent software ecosystems, marketing resources, and training for each architecture. By making as much common as possible, developers can spend more time innovating and less time reinventing the wheel.

An enormous number of FLOP/s, or floating-point operations per second, come from GPUs. NVIDIAs CUDA is the dominant platform for general purpose GPU computing, and it assumes youre using NVIDIA hardware. Because CUDA is the incumbent technology, developers are reluctant to change software that already works, even if theyd prefer more hardware choice.

If Intel wants the community to look beyond proprietary lock-in, it needs to build a better mousetrap than its competition, and that starts with compelling GPU hardware. At its recent Architecture Day 2021, Intel disclosed that a pre-production implementation of its Xe-HPC architecture is already producing more than 45 TFLOPS of FP32 throughput, more than 5 TB/s of fabric bandwidth, and more than 2 TB/s of memory bandwidth. At least on paper, thats higher single-precision performance than NVIDIAs fastest data center processor.

The world of XPUs is more than just GPUs though, which is exhilarating and terrifying, depending on who you ask. Supported by an open, standards-based programming model, a panoply of architectures might enable time-to-market advantages, dramatically lower power consumption, or workload-specific optimizations. But without oneAPI (or something like it), developers are stuck learning new tools for every accelerator, stymying innovation and overwhelming programmers.

Fortunately, were seeing signs of life beyond NVIDIAs closed platform. As an example, the team responsible for RIKENs Fugaku supercomputer recently used Intels oneAPI Deep Neural Network Library (oneDNN) as a reference to develop its own deep learning process library. Fugaku employs Fujitsu A64FX CPUs, based on Armv8-A with the Scalable Vector Extension (SVE) instruction set, which didnt have a DL library yet. Optimizing Intels code for Armv8-A processors enabled an up to 400x speed-up compared to simply recompiling oneDNN without modification. Incorporating those changes into the librarys main branch makes the teams gains available to other developers.

Intels Reinders acknowledges the whole thing sounds a lot like open source. However, the XPU philosophy goes a step further, affecting the way code is written so that its ready for different types of accelerators running underneath it. Im not worried that this is some type of fad, he says. Its one of the next major steps in computing. It is not a question of whether an idea like oneAPI will happen, but rather when it will happen.

Continue reading here:
Can Intels XPU vision guide the industry into an era of heterogeneous computing? - VentureBeat

Related Posts
This entry was posted in $1$s. Bookmark the permalink.