Posted by Jiefeng Chen, Student Researcher, and Jinsung Yoon, Research Scientist, Cloud AI Team
In the fast-evolving landscape of artificial intelligence, large language models (LLMs) have revolutionized the way we interact with machines, pushing the boundaries of natural language understanding and generation to unprecedented heights. Yet, the leap into high-stakes decision-making applications remains a chasm too wide, primarily due to the inherent uncertainty of model predictions. Traditional LLMs generate responses recursively, yet they lack an intrinsic mechanism to assign a confidence score to these responses. Although one can derive a confidence score by summing up the probabilities of individual tokens in the sequence, traditional approaches typically fall short in reliably distinguishing between correct and incorrect answers. But what if LLMs could gauge their own confidence and only make predictions when they're sure?
Selective prediction aims to do this by enabling LLMs to output an answer along with a selection score, which indicates the probability that the answer is correct. With selective prediction, one can better understand the reliability of LLMs deployed in a variety of applications. Prior research, such as semantic uncertainty and self-evaluation, has attempted to enable selective prediction in LLMs. A typical approach is to use heuristic prompts like Is the proposed answer True or False? to trigger self-evaluation in LLMs. However, this approach may not work well on challenging question answering (QA) tasks.
In "Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs", presented at Findings of EMNLP 2023, we introduce ASPIRE a novel framework meticulously designed to enhance the selective prediction capabilities of LLMs. ASPIRE fine-tunes LLMs on QA tasks via parameter-efficient fine-tuning, and trains them to evaluate whether their generated answers are correct. ASPIRE allows LLMs to output an answer along with a confidence score for that answer. Our experimental results demonstrate that ASPIRE significantly outperforms state-of-the-art selective prediction methods on a variety of QA datasets, such as the CoQA benchmark.
Imagine teaching an LLM to not only answer questions but also evaluate those answers akin to a student verifying their answers in the back of the textbook. That's the essence of ASPIRE, which involves three stages: (1) task-specific tuning, (2) answer sampling, and (3) self-evaluation learning.
Task-specific tuning: ASPIRE performs task-specific tuning to train adaptable parameters (p) while freezing the LLM. Given a training dataset for a generative task, it fine-tunes the pre-trained LLM to improve its prediction performance. Towards this end, parameter-efficient tuning techniques (e.g., soft prompt tuning and LoRA) might be employed to adapt the pre-trained LLM on the task, given their effectiveness in obtaining strong generalization with small amounts of target task data. Specifically, the LLM parameters () are frozen and adaptable parameters (p) are added for fine-tuning. Only p are updated to minimize the standard LLM training loss (e.g., cross-entropy). Such fine-tuning can improve selective prediction performance because it not only improves the prediction accuracy, but also enhances the likelihood of correct output sequences.
Answer sampling: After task-specific tuning, ASPIRE uses the LLM with the learned p to generate different answers for each training question and create a dataset for self-evaluation learning. We aim to generate output sequences that have a high likelihood. We use beam search as the decoding algorithm to generate high-likelihood output sequences and the Rouge-L metric to determine if the generated output sequence is correct.
Self-evaluation learning: After sampling high-likelihood outputs for each query, ASPIRE adds adaptable parameters (s) and only fine-tunes s for learning self-evaluation. Since the output sequence generation only depends on and p, freezing and the learned p can avoid changing the prediction behaviors of the LLM when learning self-evaluation. We optimize s such that the adapted LLM can distinguish between correct and incorrect answers on their own.
In the proposed framework, p and s can be trained using any parameter-efficient tuning approach. In this work, we use soft prompt tuning, a simple yet effective mechanism for learning soft prompts to condition frozen language models to perform specific downstream tasks more effectively than traditional discrete text prompts. The driving force behind this approach lies in the recognition that if we can develop prompts that effectively stimulate self-evaluation, it should be possible to discover these prompts through soft prompt tuning in conjunction with targeted training objectives.
After training p and s, we obtain the prediction for the query via beam search decoding. We then define a selection score that combines the likelihood of the generated answer with the learned self-evaluation score (i.e., the likelihood of the prediction being correct for the query) to make selective predictions.
To demonstrate ASPIREs efficacy, we evaluate it across three question-answering datasets CoQA, TriviaQA, and SQuAD using various open pre-trained transformer (OPT) models. By training p with soft prompt tuning, we observed a substantial hike in the LLMs' accuracy. For example, the OPT-2.7B model adapted with ASPIRE demonstrated improved performance over the larger, pre-trained OPT-30B model using the CoQA and SQuAD datasets. These results suggest that with suitable adaptations, smaller LLMs might have the capability to match or potentially surpass the accuracy of larger models in some scenarios.
When delving into the computation of selection scores with fixed model predictions, ASPIRE received a higher AUROC score (the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence) than baseline methods across all datasets. For example, on the CoQA benchmark, ASPIRE improves the AUROC from 51.3% to 80.3% compared to the baselines.
An intriguing pattern emerged from the TriviaQA dataset evaluations. While the pre-trained OPT-30B model demonstrated higher baseline accuracy, its performance in selective prediction did not improve significantly when traditional self-evaluation methods Self-eval and P(True) were applied. In contrast, the smaller OPT-2.7B model, when enhanced with ASPIRE, outperformed in this aspect. This discrepancy underscores a vital insight: larger LLMs utilizing conventional self-evaluation techniques may not be as effective in selective prediction as smaller, ASPIRE-enhanced models.
Our experimental journey with ASPIRE underscores a pivotal shift in the landscape of LLMs: The capacity of a language model is not the be-all and end-all of its performance. Instead, the effectiveness of models can be drastically improved through strategic adaptations, allowing for more precise, confident predictions even in smaller models. As a result, ASPIRE stands as a testament to the potential of LLMs that can judiciously ascertain their own certainty and decisively outperform larger counterparts in selective prediction tasks.
In conclusion, ASPIRE is not just another framework; it's a vision of a future where LLMs can be trusted partners in decision-making. By honing the selective prediction performance, we're inching closer to realizing the full potential of AI in critical applications.
Our research has opened new doors, and we invite the community to build upon this foundation. We're excited to see how ASPIRE will inspire the next generation of LLMs and beyond. To learn more about our findings, we encourage you to read our paper and join us in this thrilling journey towards creating a more reliable and self-aware AI.
We gratefully acknowledge the contributions of Sayna Ebrahimi, Sercan O Arik, Tomas Pfister, and Somesh Jha.
View original post here:
Introducing ASPIRE for selective prediction in LLMs Google Research Blog - Google Research
- Is Google Advertising Revenue 70%, 80%, Or 90% Of Alphabets Total Revenue? - Forbes [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google My Business Photos Being Added To Google Posts Without Option To Delete - Search Engine Roundtable [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Even amid the affluence of tech capital in Silicon Valley, local news struggles - CNBC [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Where in the world was Santa? It depended on which online tracker you were following - The Boston Globe [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Huawei, Facebook, and Oracle Put Pressure on Google - Market Realist [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Huawei and Google Diverge in Their Treatment of ToTok - Market Realist [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google Maps: Aftermath of plane crash in Somalia discovered - what happened? - Express [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Why Apple, Google, and other big tech companies create their own fonts - Mashable [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- ProBeat: Google only updated Android distribution data once in 2019 - VentureBeat [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- 10 things to try with your new Google Nest smart speaker - VentureBeat [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google workers exposed to chemical that causes birth defects - City A.M. [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- The most popular products of 2019, according to Google - TODAY [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google Chromes five security features that every user should know - Hindustan Times [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Googles YouTube Goes To War With Bitcoin And Crypto [Updated] - Forbes [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google is poised to make another blitz at CES 2020 - CNET [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- These Were The Top Google Searches And Trends Of 2019 - Forbes [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Google Search now lets you add movies and shows to a 'Watchlist' - Engadget [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- 31-year-old Google executive says reading this one book has had a huge influence on her career - CNBC [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Obama praises book that slams his White House for its Google relationship - Mashable [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Why Google was the most important brand marketer of the 2010s - Fast Company [Last Updated On: December 30th, 2019] [Originally Added On: December 30th, 2019]
- Amazon and Facebook Are the Most 'Evil' Tech Companies, According to Experts. Google Isn't Far Behind - Inc. [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Rich Results testing tool now reports on unloadable embedded resources - Search Engine Land [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Assistant routines haven't worked on Android Auto for over a year, still no fix in sight (Update: Google acknowledges) - Android Police [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Jussie Smollett is probably toast now that Google is handing his data to the special prosecutor - Washington Examiner [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Americans trust Amazon and Google more than the police or the government - MarketWatch [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Using Google Authenticator? Here's why you should get rid of it - ZDNet [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Googles hidden AR tool will blow your mind - Creative Bloq [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Kids, Want to Win a $30,000 Scholarship and Show Your Art to Billions? Googles Annual Doodle Contest Is Now Open - artnet News [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- 1 Reason 2020 Will Be a Big Year for Google and Facebook - The Motley Fool [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Health Exec Defends Controversial Partnership With Ascension: Were Super Proud Of It - Forbes [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Labs arrive in Google app to let you experiment with features like pinch-to-zoom - 9to5Google [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Sorry, Alexa and Siri, but only Google Home can do these 5 things - CNET [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Kittle photobombed by The Rock in roster Google search - NBCSports.com [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- This Is How Your iPhone Is A Cool New Way To Access Google - Forbes [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Googles Takeover of Fitbit Faces Another Regulatory Hurdle - Motley Fool [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Health VP on Ascension partnership: 'The press has made this into something it's not' - Healthcare IT News [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Maps keeps a detailed record of everywhere you go here's how to stop it - CNBC [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Will Googles more-efficient Reformer mitigate or accelerate the arms race in AI? - ZDNet [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Rachel Bovard: Congress has a role to play in regulating Google - Home - WSFX [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Why Google added little logos next to search results this week - CNBC [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Report: Google wants to bring the Steam game store to Chrome OS? - Ars Technica [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- BT partners with Google to bundle free Stadia with broadband deals in the UK - The Verge [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Play [Last Updated On: January 18th, 2020] [Originally Added On: January 18th, 2020]
- Google Photos app for Android will soon phase out the hamburger menu - GSMArena.com news - GSMArena.com [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- What Is Google Coral And Do You Need It? - Lifehacker Australia [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google and Amazon limit employees travel because of coronavirus fears - The Verge [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google, Toyota Tsusho invest in WhereIsMyTransport to map transport in emerging cities - TechCrunch [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- This Is Huaweis Alarming New Surprise For Google: Heres Why You Should Be Concerned - Forbes [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google and Microsoft offer free teleconferencing tools to combat coronavirus - TechRadar [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google bans on-site job interviews for the foreseeable future due to coronavirus - The Verge [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- AWS to double sales droids as Google, Microsoft's growing clouds threaten to gobble larger slices of Bezos' pie - The Register [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google's Exposure To Travel Will Impact Revenue, BofA Says - Benzinga [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google Cloud goes after the telco business with Anthos for Telecom and its Global Mobile Edge Cloud - TechCrunch [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Apple, Microsoft, Google look to move production away from China. That's not going to be easy - CNBC [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google will lose its John Legend Google Assistant voice on March 23rd - The Verge [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google and Microsoft are giving away enterprise conferencing tools due to coronavirus - The Verge [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Google Stadia now supports 4K streaming on the web - The Verge [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Star Engineer Who Crossed Google Is Ordered to Pay $179 Million to Company - The New York Times [Last Updated On: March 5th, 2020] [Originally Added On: March 5th, 2020]
- Why companies like Microsoft and Google are betting big on Africa - CNBC [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Google Announces A Coronavirus Incentive For G SuiteAnd Other Small Business Tech News - Forbes [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Microsoft, Google, and Twitter Are Telling Employees to Work From Home Because of Coronavirus. Should You? - Inc. [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Facebook, Google among those kicking some cash over to Silicon Valley communities affected by coronavirus cancellations - CNBC [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Google now giving away three months of Stadia access to Chromecast owners - The Verge [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Google location data turned a random biker into a burglary suspect - The Verge [Last Updated On: March 8th, 2020] [Originally Added On: March 8th, 2020]
- Apple, Google and others partner with Ad Council and US govt to expand coronavirus messaging - The Drum [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google Has No Plans To Postpone Killing Third-Party Cookies In Chrome - AdExchanger [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Why Zoom is winning so much hype over Microsoft and Google - Business Insider [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Logged On From the Laundry Room: How the C.E.O.s of Google, Pfizer and Slack Work From Home - The New York Times [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google cancels its infamous April Fools jokes this year - The Verge [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google Tests Audience Buying In ADH, A Big Step From Analytics To Activation - AdExchanger [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Googles new Pixel Buds could hit spring release date, as they may have just hit the FCC - The Verge [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google Removes Infowars Android App From Online Store Over Coronavirus Misinformation - Variety [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Cruising Through South Central Los Angeles With Google Street View : The Picture Show - NPR [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google ups Duo group calling limit from eight to twelve - The Verge [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Outside China, Android isnt Android without Google - The Verge [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google has banned the Infowars Android app over false coronavirus claims - The Verge [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- My top 3 Google Home pet peeves and how to fix them - CNET [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Google Unveiled a Massive Stimulus Program of Its Own - Inc. [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Facebook, Google and Twitter Struggle to Handle Novembers Election - The New York Times [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]
- Test and trace with Apple and Google - TechCrunch [Last Updated On: March 30th, 2020] [Originally Added On: March 30th, 2020]