Microsoft has announced the general availability of a one-of-a-kind customised, synthetic voice for brands, generated using Custom Neural Voice, the Text-to-Speech (TTS) feature of Speech in Azure Cognitive Services.

What Is It?

The Custom Neural Voice from Microsoft uses deep neural networks and a powerful base model built with speech data from many different speakers to create a Neural “text-to-speech” (TTS) model that is able to learn the way phonetics are combined in natural human speech rather than using classical programming or statistical methods. The result is a very natural sounding voice.

Microsoft is now inviting a customer to apply to be approved to use it, or developers can now add TTS capabilities to their apps by creating an Azure Speech instance and selecting from over 200 pre-built TTS and Neural TTS voices across 54 languages/locales.


The benefit of this synthetic voice system is that it does not require a large volume of voice data to produce a fluent, natural sound because of the extra power of the deep neural networks and base model. Users can, therefore, expect to be able to build realistic voices with just a small number of training audios and companies can spend a fraction of the effort traditionally needed to prepare training data while at the same time increasing the naturalness of the synthetic speech output when compared to traditional training methods.

Why Have A Brand Voice?

According to Microsoft, we are now in a world where voice-based interactions are increasingly becoming the norm and, therefore, “your voice is your brand”. Microsoft says that a recognisable digital brand voice can help customers connect with a brand in new ways.

Microsoft points out that it has received interest in customised synthetic brand voices from a range of businesses across the Media and Entertainment, Telecom, Automobile, Education, and Hospitality sectors.  Examples of where/how a brand voice can be used include usage for apps, on a website (customer service chatbots), in videos, on the telephone (centre operations combined with conversational AI), on a range of devices (e.g. phones, speakers, TV/cable boxes), in cars as a key interaction point with customers, smart voice assistants, in online learning materials and audio books, for public service announcements (stations, airports and venues), or as assistive technology to help with accessibility.

What Does This Mean For Your Business?

Bots are commonplace these days and as Microsoft’s announcement demonstrates, the technology to quickly create a realistic ‘brand voice’ and the opportunities for companies to use one are now much more common and widespread.  Realistic, AI-powered voices can be really helpful to companies that want to scale-up customer service without huge expense, plus it is a flexible tool that can help companies to re-enforce their brand in a very modern way.  Giving people access to the power of their deep neural networks and base model means that for companies wanting to use Microsoft’s synthetic voice (which companies can apply to do here can a get a really professional sounding brand voice together much more quickly for a fraction of the effort than if they used other traditional methods.  Making this technology available, albeit by application, means that many more, smaller businesses can now seriously consider having their own realistic-sounding voice/bot. The pace at which this kind of technology is developing is good news for all kinds of companies looking to use this as an element of their service in the near future.