Introducing Triton: Open-Source GPU Programming for Neural Networks |

We're releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU codemost of the time on par with what an expert would be able to produce. Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLASsomething that many GPU programmers can't doin under 25 lines of code. Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementations, and we're excited to work with the community to make GPU programming more accessible to everyone.

Novel research ideas in the field of Deep Learning are generally implemented using a combination of native framework operators. While convenient, this approach often requires the creation (and/or movement) of many temporary tensors, which can hurt the performance of neural networks at scale. These issues can be mitigated by writing specialized GPU kernels, but doing so can be surprisingly difficult due to the many intricacies of GPU programming. And, although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines. This has led us to extend and improve Triton, a recent language and compiler whose original creator now works at OpenAI.

The architecture of modern GPUs can be roughly divided into three major componentsDRAM, SRAM and ALUseach of which must be considered when optimizing CUDA code:

Basic architecture of a GPU.

Reasoning about all these factors can be challenging, even for seasoned CUDA programmers with many years of experience. The purpose of Triton is to fully automate these optimizations, so that developers can better focus on the high-level logic of their parallel code. Triton aims to be broadly applicable, and therefore does not automatically schedule work across SMs -- leaving some important algorithmic considerations (e.g. tiling, inter-SM synchronization) to the discretion of developers.

Compiler optimizations in CUDA vs Triton.

Out of all the Domain Specific Languages and JIT-compilers available, Triton is perhaps most similar to Numba: kernels are defined as decorated Python functions, and launched concurrently with different program_ids on a grid of so-called instances. However, as shown in the code snippet below, the resemblance stops there: Triton exposes intra-instance parallelism via operations on blockssmall arrays whose dimensions are powers of tworather than a Single Instruction, Multiple Thread (SIMT) execution model. In doing so, Triton effectively abstracts away all the issues related to concurrency within CUDA thread blocks (e.g., memory coalescing, shared memory synchronization/conflicts, tensor core scheduling).

Vector addition in Triton.

While this may not be particularly helpful for embarrassingly parallel (i.e., element-wise) computations, it can greatly simplify the development of more complex GPU programs.

Consider for example the case of a fused softmax kernel (below) in which each instance normalizes a different row of the given input tensor $X in mathbb{R}^{M times N}$. Standard CUDA implementations of this parallelization strategy can be challenging to write, requiring explicit synchronization between threads as they concurrently reduce the same row of $X$. Most of this complexity goes away with Triton, where each kernel instance loads the row of interest and normalizes it sequentially using NumPy-like primitives.

Fused softmax in Triton.

Note that the Triton JIT treats X and Y as pointers rather than tensors; we felt like retaining low-level control of memory accesses was important to address more complex data structures (e.g., block-sparse tensors).

Importantly, this particular implementation of softmax keeps the rows of $X$ in SRAM throughout the entire normalization process, which maximizes data reuse when applicable (~<32K columns). This differs from PyTorchs internal CUDA code, whose use of temporary memory makes it more general but significantly slower (below). The bottom line here is not that Triton is inherently better, but that it simplifies the development of specialized kernels that can be much faster than those found in general-purpose libraries.

A100 performance of fused softmax for M=4096.

The lower performance of the Torch (v1.9) JIT highlights the difficulty of automatic CUDA code generation from sequences of high-level tensor operations.

Fused softmax with the Torch JIT.

Being able to write fused kernels for element-wise operations and reductions is important, but not sufficient given the prominence of matrix multiplication tasks in neural networks. As it turns out, Triton also works very well for those, achieving peak performance with just ~25 lines of Python code. On the other hand, implementing something similar in CUDA would take a lot more effort and would even be likely to achieve lower performance.

Matrix multiplication in Triton.

One important advantage of handwritten matrix multiplication kernels is that they can be customized as desired to accommodate fused transformations of their inputs (e.g., slicing) and outputs (e.g., Leaky ReLU). Without a system like Triton, non-trivial modifications of matrix multiplication kernels would be out-of-reach for developers without exceptional GPU programming expertise.

V100 tensor-core performance of matrix multiplication with appropriately tuned values for BLOCK$_M$, BLOCK$_N$, BLOCK$_K$, GROUP$_M$.

The good performance of Triton comes from a modular system architecture centered around Triton-IR, an LLVM-based intermediate representation in which multi-dimensional blocks of values are first-class citizens.

High-level architecture of Triton.

The @triton.jit decorator works by walking the Abstract Syntax Tree (AST) of the provided Python function so as to generate Triton-IR on-the-fly using a common SSA construction algorithm. The resulting IR code is then simplified, optimized and automatically parallelized by our compiler backend, before being converted into high-quality LLVM-IRand eventually PTXfor execution on recent NVIDIA GPUs. CPUs and AMD GPUs are not supported at the moment, but we welcome community contributions aimed at addressing this limitation.

We have found that the use of blocked program representations via Triton-IR allows our compiler to automatically perform a wide variety of important program optimizations. For example, data can be automatically stashed to shared memory by looking at the operands of computationally intensive block-level operations (e.g., tl.dot)and allocated/synchronized using standard liveness analysis techniques.

The Triton compiler allocates shared memory by analyzing the live range of block variables used in computationally intensive operations.

On the other hand, Triton programs can be efficiently and automatically parallelized both (1) across SMs by executing different kernel instances concurrently, and (2) within SMs by analyzing the iteration space of each block-level operation and partitioning it adequately across different SIMD units, as shown below.

Element-wise

FP16 matrix multiplication

Vectorized

Tensorized

GPU

Element-wise

FP16 matrix mult.multiplication

Vectorized

Tensorized

GPU

Automatic parallelization in Triton. Each block-level operation defines a blocked iteration space that is automatically parallelized to make use of the resources available on a Streaming Multiprocessor (SM).

We intend for Triton to become a community-driven project. Feel free to fork our repository on GitHub!

If youre interested in joining our team and working on Triton & GPU kernels, were hiring!

Go here to see the original:
Introducing Triton: Open-Source GPU Programming for Neural Networks

Research, Evaluation and Learning at the International Rescue Committee - World - ReliefWeb [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Conserving Biodiversity with AI - BBN Times [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
DevOps Fundamentals You Ever Wanted To Know - hackernoon.com [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Another Perspective on Evictions - Bacon's Rebellion [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Amitabh Bachchan on fans alternate job suggestion: My job is now insured - The Indian Express [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Will You Soon Download Packaging Machine Controls from the Internet? - Packaging Digest [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
5 free resources every data scientist should start using today - The Next Web [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Who's hoping to make an Epic impact on Green Bay area music scene with a new concert venue? | Streetwise - Green Bay Press Gazette [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Industrial robots are dominating but are they safe from cyber-attacks? - TechHQ [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Friday Rant - Rise of the Rogue-Bots? - Diginomica [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Important Reasons Why You Should Pick RoR As Your Web-Based Development Project - Customer Think [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Portrait of the software developer as an artist - ComputerWeekly.com [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Python may be your safest bet for a career in coding - Gadgets Now [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
1Password is coming to Linux - ZDNet [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
IBM creates an open source tool to simplify API documentation - TechRepublic [Last Updated On: August 10th, 2020] [Originally Added On: August 10th, 2020]
Mastercard : Accelerate Ignites Next Generation of Fintech Disruptors and Partners to Build the Future of Commerce - Marketscreener.com [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Expanding the Universe of Haptics | by Lofelt | Aug, 2020 - Medium [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
UX Designer Salary: 5 Important Things to Know - Dice Insights [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Persistent memory reshaping advanced analytics to improve customer experiences - IT World Canada [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
NextCorps and SecondMuse Open Application Period for Programs that Help Climate Technology Startups Accelerate Hardware Manufacturing - GlobeNewswire [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Buried deep in the ice is the GitHub code vault humanity's safeguard against devastation - ABC News [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Top 12 Most Used Tools By Developers In 2020 - Analytics India Magazine [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Facebook's React 17 JavaScript library: Here's why its top feature is 'no new features' - ZDNet [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
CORRECTING and REPLACING Anyscale Hosts Inaugural Ray Summit on Scalable Python and Scalable Machine Learning - Business Wire [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
Google: Here's how much we give to open source through our GitHub activity - ZDNet [Last Updated On: August 12th, 2020] [Originally Added On: August 12th, 2020]
How Chriselle Lim And Joan Nguyen Created Bmo, The Coworking Space And Virtual Classroom Of The Future (With A Childcare Twist) - Forbes [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
How Will Public Libraries Adapt To New School Year Norms? - Book Riot [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
Google: We'll test hiding the full URL in Chrome 86 to combat phishing - ZDNet [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
How to install Python 3 and PIP 3 on Ubuntu 20.04 LTS - Linux Shout - H2S Media [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
What are Bitcoin Wallets: Everything You Need to Know - Programming Insider [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
JSHint is Now Free Software after Updating License to MIT Expat - WP Tavern [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
How to learn JavaScript: These are the best online courses - Mashable [Last Updated On: August 13th, 2020] [Originally Added On: August 13th, 2020]
What developers need to know about inter-blockchain communication - ComputerWeekly.com [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
Introducing the CDK construct library for the serverless LAMP stack - idk.dev [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
IBM asked software developers to take on the wrath of Mother Nature - The Drum [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
Aspire Technology Launches First Truly Secure Public Blockchain for Creation of Digital Assets - GlobeNewswire [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
GM Creates And Shares New Workplace Safety Technologies - Pulse 2.0 [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
Key Considerations and Tools for IP Protection of Computer Programs in Europe and Beyond - Lexology [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
The state of application security: What the statistics tell us - CSO Online [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
Open Source: What's the delay on the former high/middle school on North Mulberry? - knoxpages.com [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
The Risks Associated with OSS and How to Mitigate Them - Security Boulevard [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
news digest: Microsoft launches open source website, TensorFlow Recorder released, and Stackery brings serverless to the Jamstack - SD Times -... [Last Updated On: August 14th, 2020] [Originally Added On: August 14th, 2020]
Build Your Own PaaS with Crossplane: Kubernetes, OAM, and Core Workflows - InfoQ.com [Last Updated On: August 17th, 2020] [Originally Added On: August 17th, 2020]
ISRO Is Recruiting For Vacancies with Salary Upto Rs 54000: How to Apply - The Better India [Last Updated On: August 17th, 2020] [Originally Added On: August 17th, 2020]
Does technology increase the problem of racism and discrimination? - TechTarget [Last Updated On: August 17th, 2020] [Originally Added On: August 17th, 2020]
CORRECTING and REPLACING Anyscale Hosts Inaugural Ray Summit on Scalable Python and Scalable Machine Learning - Yahoo Finance [Last Updated On: August 17th, 2020] [Originally Added On: August 17th, 2020]
In the City: Take advantage of open recreation, cultural and park amenities - Coloradoan [Last Updated On: August 17th, 2020] [Originally Added On: August 17th, 2020]
Exploring the future of modern software development - ComputerWeekly.com [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Hadoop Developer Interview Questions: What to Know to Land the Job - Dice Insights [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
SiFive Opens Business Unit to Build Chips With Arm and RISC-V Inside - Electronic Design [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Use Pulumi and Azure DevOps to deploy infrastructure as code - TechTarget [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Why ASP.NET Core Is Regarded As One Of The Best Frameworks For Building Highly Scalable And Modern Web Applications - WhaTech [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
NITK figures 4th in Google Summer of Code ranking - BusinessLine [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Learn More About Dynamo for Revit: Features, Functions, and News - ArchDaily [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Linux Foundation showcases the greater good of open source - ComputerWeekly.com [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Programming language Kotlin 1.4 is out: This is how it's improved quality and performance - ZDNet [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Top 10 Languages That Paid Highest Salaries Worldwide In 2020 - Analytics India Magazine [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Programming language Rust: Mozilla job cuts have hit us badly but here's how we'll survive - ZDNet [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
In-App Bidding Gathers Steam, But Adoption Looks Nothing Like Header Bidding On The Web - AdExchanger [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
13 thoughts on Fitting Snake Into A QR Code - Hackaday [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Newham test and trace app was designed by man who grew up in the borough - Newham Recorder [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
'Trapped in a code' the fight over our algorithmic future - Open Democracy [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Telegram launches one-on-one video calls on iOS and Android - The Verge [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
AWS Controllers for Kubernetes Will Be A 'Boon For Developers' - CRN: Technology news for channel partners and solution providers [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Coding within company constraints - ComputerWeekly.com [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Open Source and Open Standards: The Recipe for Success Featured - The Fast Mode [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
How Intel helped give the worlds first cyborg a voice - The Next Web [Last Updated On: August 21st, 2020] [Originally Added On: August 21st, 2020]
Tiger Woods, Rory McIlroy near bottom of field at The Northern Trust - ESPN [Last Updated On: August 22nd, 2020] [Originally Added On: August 22nd, 2020]
Intel Owl OSINT tool automates the intel-gathering process using a single API - The Daily Swig [Last Updated On: August 22nd, 2020] [Originally Added On: August 22nd, 2020]
IOTA Foundation presents the current projects in the mobility industry - Crypto News Flash [Last Updated On: August 22nd, 2020] [Originally Added On: August 22nd, 2020]
How 'Fortnite' and 'Second Life' Shaped the Future of Indian Market - Santa Fe Reporter [Last Updated On: August 22nd, 2020] [Originally Added On: August 22nd, 2020]
Apple Enters $ 2 Trillion Club, Github's Chinese Counterpart And More In This Week's Top News - Analytics India Magazine [Last Updated On: August 22nd, 2020] [Originally Added On: August 22nd, 2020]
As world grapples with pandemic, schools are the epicenter - ABC News [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
Why Businesses Should Embrace Modernizing Their Legacy Applications - TechBullion [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
Is It Time To Rename RPG? - IT Jungle [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
Phantasy Star Online programmers on breaking new ground and their Diablo-style isometric prototype - Polygon [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
How To Learn To Program In Python By Playing Videogames - Analytics India Magazine [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
New Microsoft program to help develop the quantum computing workforce of the future in India - Microsoft [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
How the Docker Revolution Will Change Your Programming, Part 1 - Walter Bradley Center for Natural and Artificial Intelligence [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]
The art of developing happy customers - ComputerWeekly.com [Last Updated On: August 24th, 2020] [Originally Added On: August 24th, 2020]