Mind the (Accent) Gap: DefinedCrowd Contributing to More Inclusive Speech Technology – PRNewswire

Posted: July 29, 2021 at 8:47 pm

SEATTLE, July 29, 2021 /PRNewswire/ -- DefinedCrowd, the one-stop-shop for high-quality artificial intelligence training data, today releasedthe first of a series offree Spanish-accented English speech datasets to allow AI developers totesthow well their speech recognition models understand nonnative Englishspeakers,a demographic represented by over 35 million people in the United States.

"There isan accent gap in speech technology. Research shows that speech recognition technologies are not nearly as accurate in understanding nonnative accents as they are in understanding white, non-immigrant, upper-middle-class Americans,"said Dr. Daniela Braga, founder and CEO ofDefinedCrowd.

It is not a surprising phenomenon; it is this demographic that had access to and trained the technology from the beginning. To address the bias present in speech recognition technology,DefinedCrowdhas releasedthe first offour sets of Spanish-accented English speech datasets, which developers can use to test or benchmark their models to identify bias and areas which need more training data.

"Unfortunately, it has resulted in models that are more useful to some people than to others. And that must change," said Dr. Braga.

However, many companies do not have the resources to train or test their systems withdifferentaccents, meaning that speech recognition systemsare likely to provide an unresponsive, inaccurate, and even isolating experience to nonnative English speakers.

This is clearly bad for business: according to the U.S. Census, over 35 million people in the United States are native speakers of a language other than English. Sixty percent of these people speak Spanish at home.

"For companies with AI solutions to compete in the large nonnative English-speaking market in the U.S., speech models need to be able to understand a wide range of different Spanish accents, originating from all the Americas," said ChristopherShulby, Director ofMachine Learning EngineeringatDefinedCrowd.

Thefirstdataset, released in two phases,includesSpanish-accented English data from the Americas, includingArgentina, Brazil, Canada, Chile, Colombia, Dominican Republic, Guatemala, Honduras, Mexico, Nicaragua, Panama, Peru, the United States,Uruguayand Venezuela.

Subsequent releases will include datasetsfromnative Spanish speakers fromaround the world, includingAustralia, China, Finland, France, Germany, India, Israel, Italy, Norway, Portugal, Russia, Spain, Sweden, and the United Kingdom.

The datasets represent speakers aged from 18 - 40, with an equal distribution of male and female speakers.

To access the data, developerswill need to register onDefinedCrowd'sMarketplace here, after which they will receive a link to download the dataset.

Contact:[emailprotected]

Related Images

free-speech-dataset.png Free Speech Dataset

SOURCE DefinedCrowd Corp.

Here is the original post:

Mind the (Accent) Gap: DefinedCrowd Contributing to More Inclusive Speech Technology - PRNewswire

Related Posts