Microsoft and Nvidia team up to train one of the worlds largest language models – VentureBeat

The Transform Technology Summits start October 13th with Low-Code/No Code: Enabling Enterprise Agility. Register now!

Microsoft and Nvidia today announced that they trained what they claim is the largest and most capable AI-powered language model to date: Megatron-Turing Natural Language Generation (MT-NLP). The successor to the companies Turing NLG 17B and Megatron-LM models, MT-NLP contains 530 billion parameters and achieves unmatched accuracy in a broad set of natural language tasks, Microsoft and Nvidia say including reading comprehension, commonsense reasoning, and natural language inferences.

The quality and results that we have obtained today are a big step forward in the journey towards unlocking the full promise of AI in natural language. The innovations of DeepSpeed and Megatron-LM will benefit existing and future AI model development and make large AI models cheaper and faster to train, Nvidias senior director of product management and marketing for accelerated computing, Paresh Kharya, and group program manager for the Microsoft Turing team, Ali Alvi wrote in a blog post. We look forward to how MT-NLG will shape tomorrows products and motivate the community to push the boundaries of natural language processing (NLP) even further. The journey is long and far from complete, but we are excited by what is possible and what lies ahead.

In machine learning, parameters are the part of the model thats learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. Language models with large numbers of parameters, more data, and more training time have been shown to acquire a richer, more nuanced understanding of language, for example gaining the ability to summarize books and even complete programming code.

To train MT-NLG, Microsoft and Nvidia say that they created a training dataset with 270 billion tokens from English-language websites. Tokens, a way of separating pieces of text into smaller units in natural language, can either be words, characters, or parts of words. Like all AI models, MT-NLP had to train by ingesting a set of examples to learn patterns among data points, like grammatical and syntactical rules.

The dataset largely came from The Pile, an 835GB collection of 22 smaller datasets created by the open source AI research effort EleutherAI. The Pile spans academic sources (e.g., Arxiv, PubMed), communities (StackExchange, Wikipedia), code repositories (Github), and more, which Microsoft and Nvidia say they curated and combined with filtered snapshots of the Common Crawl, a large collection of webpages including news stories and social media posts.

Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80GB GPUs.

When benchmarked, Microsoft says that MT-NLP can infer basic mathematical operations even when the symbols are badly obfuscated. While not extremely accurate, the model seems to go beyond memorization for arithmetic and manages to complete tasks containing questions that prompt it for an answer, a major challenge in NLP.

Its well-established that models like MT-NLP can amplify the biases in data on which they were trained, and indeed, Microsoft and Nvidia acknowledge that the model picks up stereotypes and biases from the [training] data. Thats likely because a portion of the dataset was sourced from communities with pervasivegender, race,physical, and religious prejudices, which curation cant completely address.

In a paper, the Middlebury Institute of International Studies Center on Terrorism, Extremism, and Counterterrorism claim that GPT-3 and similar models can generate informational and influential text that might radicalize people into far-right extremist ideologies and behaviors. A group at Georgetown University has used GPT-3 to generate misinformation, including stories around a false narrative, articles altered to push a bogus perspective, and tweets riffing on particular points of disinformation. Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular open source models, including Googles BERT, XLNet,andFacebooks RoBERTa.

Microsoft and Nvidia claim that theyre committed to working on addressing [the] problem and encourage continued research to help in quantifying the bias of the model. They also say that any use of Megatron-Turing in production must ensure that proper measures are put in place to mitigate and minimize potential harm to users, and follow tenets such as those outlined in Microsofts Responsible AI Principles.

We live in a time [when] AI advancements are far outpacing Moores law. We continue to see more computation power being made available with newer generations of GPUs, interconnected at lightning speeds. At the same time, we continue to see hyper-scaling of AI models leading to better performance, with seemingly no end in sight, Kharya and Alvi continued. Marrying these two trends together are software innovations that push the boundaries of optimization and efficiency.

Projects like MT-NLP, AI21 Labs Jurassic-1, Huaweis PanGu-Alpha, Navers HyperCLOVA, and the Beijing Academy of Artificial Intelligences Wu Dao 2.0 are impressive from an academic standpoint, but building them doesnt come cheap. For example, the training dataset for OpenAIs GPT-3 one of the worlds largest language models was 45 terabytes in size, enough to fill 90 500GB hard drives.

AI training costs dropped 100-fold between 2017 and 2019, according to one source, but the totals still exceed the compute budgets of most startups. The inequity favors corporations with extraordinary access to resources at the expense of small-time entrepreneurs, cementing incumbent advantages.

For example, OpenAIs GPT-3 required an estimated 3.1423^23 floating-point operations per second (FLOPS) of compute during training. In computer science, FLOPS is a measure of raw processing performance, typically used to compare different types of hardware. Assuming OpenAI reserved 28 teraflops 28 trillion floating-point operations per second of compute across a bank of Nvidia V100 GPUs, a common GPU available through cloud services, itd take $4.6 million for a single training run. One Nvidia RTX 8000 GPU with 15 teraflops of compute would be substantially cheaper but itd take 665 years to finish the training.

Microsoft and Nvidia says that it observed between 113 to 126 teraflops per second per GPU while training MT-NLP. The cost is likely to have been in the millions of dollars.

A Synced report estimated that a fake news detection model developed by researchers at the University of Washington cost $25,000 to train, and Google spent around $6,912 to train a language model called BERT that it used to improve the quality of Google Search results. Storage costs also quickly mount when dealing with datasets at the terabyte or petabyte scale. To take an extreme example, one of the datasets accumulated by Teslas self-driving team 1.5 petabytes of video footage would cost over $67,500 to store in Azure for three months, according to CrowdStorage.

The effects of AI and machine learning model trainingon the environmenthave also been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly626,000 pounds of carbon dioxide, equivalent to nearly five times the lifetime emissions of the average U.S. car. OpenAI itself has conceded that models like Codex require significant amounts of compute on the order of hundreds of petaflops per day which contributes to carbon emissions.

In a sliver of good news, the cost for FLOPS and basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark ImageNet has been decreasing by a factor of two every 16 months. Other recent research suggests that large language models arent always more complex than smaller models, depending on the techniques used to train them.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says when it comes to natural language, its an open question whether larger models are the right approach. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets, Antoniak told VentureBeat in a previous interview. These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.

See the original post here:
Microsoft and Nvidia team up to train one of the worlds largest language models - VentureBeat

Related Posts
This entry was posted in $1$s. Bookmark the permalink.