text to speech microsoft

Microsoft Research Blog

Azure ai milestone: new neural text-to-speech models more closely mirror natural speech.

Published December 17, 2021

By Sheng Zhao , Partner Group Engineer Manager

Share this page

Share on Facebook
Share on LinkedIn
Share on Reddit
Subscribe to our RSS feed

Neural Text-to-Speech—along with recent milestones in computer vision and question answering—is part of a larger Azure AI mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work—with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, a joint representation of three cognitive attributes: monolingual text (X), audio or visual sensory signals (Y), and multilingual (Z). For more information about these efforts, read the XYZ-code blog post .

Neural Text-to-Speech (opens in new tab)  (Neural TTS), a powerful speech synthesis capability of Azure Cognitive Services (opens in new tab) , enables developers to convert text to lifelike speech. It is used in voice assistant scenarios, content read aloud capabilities, accessibility tools, and more. Neural TTS has now reached a significant milestone in Azure, with a new generation of Neural TTS model called Uni-TTSv4, whose quality shows no significant difference from sentence-level natural speech recordings.

Microsoft debuted the original technology three years ago, with close to human-parity (opens in new tab) quality. This resulted in TTS audio that was more fluid, natural sounding, and better articulated. Since then, Neural TTS has been incorporated into Microsoft flagship products such as Edge Read Aloud (opens in new tab) , Immersive Reader (opens in new tab) , and Word Read Aloud (opens in new tab) . It’s also been adopted by many customers such as AT&T (opens in new tab) , Duolingo (opens in new tab) , Progressive (opens in new tab) , and more. Users can choose from multiple pre-set voices or record and upload their own sample to create custom voices instead. Over 110 languages are supported, including a wide array of language variants, also known as locales.

The latest version of the model, Uni-TTSv4, is now shipping into production on a first set of eight voices (shown in the table below). We will continue to roll out the new model architecture to the remaining 110-plus languages and Custom Neural Voice (opens in new tab) in the coming milestone. Our users will automatically get significantly better-quality TTS through the Azure TTS API (opens in new tab) , Microsoft Office, and Edge browser. 

Measuring TTS quality

Text-to-speech quality is measured by the Mean Opinion Score (MOS), a widely recognized scoring method for speech quality evaluation. For MOS studies, participants rate speech characteristics for both recordings of peoples’ voices and TTS voices on a five-point scale. These characteristics include sound quality, pronunciation, speaking rate, and articulation. For any model improvement, we first conduct a side-by-side comparative MOS test ( CMOS (opens in new tab) ) with production models. Then, we do a blind MOS test on the held-out recording set (recordings not used in training) and the TTS-synthesized audio and measure the difference between the two MOS scores.

During research of the new model, Microsoft submitted the Uni-TTSv4 system to Blizzard Challenge 2021 under its code name, DelightfulTTS. Our paper, “ DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 ,” provides in-depth detail of our research and the results. The Blizzard Challenge is a well-known TTS benchmark organized by world-class experts in TTS fields, and it conducts large-scale MOS tests on multiple TTS systems with hundreds of listeners. Results from Blizzard Challenge 2021 demonstrate that the voice built with the new model shows no significant difference from natural speech on the common dataset.

microsoft research podcast

What’s Your Story: Weishung Liu

Principal PM Manager Weishung Liu shares how a career delivering products and customer experiences aligns with her love of people and storytelling and how—despite efforts to defy the expectations that come with growing up in Silicon Valley—she landed in tech.

Measurement results for Uni-TTSv4 and comparison

The MOS scores below are based on samples produced by the Uni-TTSv4 model under the constraints of real-time performance requirements.

A Wilcoxon signed-rank test (opens in new tab) was used to determine whether the MOS scores differed significantly between the held-out recordings and TTS. A p-value less than 0.05 (typically ≤ 0.05) is statistically significant and a p-value higher than 0.05 (> 0.05) is not statistically significant. A positive CMOS number shows gain over production, which shows it is more highly preferred by people judging the voice in terms of naturalness.

Locale (voice)	Human recording (MOS)	Uni-TTSv4 (MOS)	Wilcoxon p-value	CMOS vs PROD
En-US (Jenny)	4.33(±0.04)	4.29(±0.04)	0.266	+0.116
En-US (Sara)	4.16(±0.05)	4.12 (±0.05)	0.41	+0.129
Zh-CN (Xiaoxiao)	4.54(±0.05)	4.51(±0.05)	0.44	+0.181
It-IT (Elsa)	4.59(±0.04)	4.58(±0.03)	0.34	+0.25
Ja-JP (Nanami)	4.44(±0.04)	4.37(±0.05)	0.053	+0.19
Ko-KR(Sun-hi)	4.24(±0.06)	4.15(±0.06)	0.11	+0.097
Es-ES (Alvaro)	4.36(±0.05)	4.33(±0.04)	0.312	+0.18
Es-MX (Dalia)	4.45 (±0.05)	4.39(±0.05)	0.103	+0.076

A comparison of human and Uni-TTSv4 audio samples

Listen to the recording and TTS samples below to hear the quality of the new model. Note that the recording is not part of the training set.

These voices are updated to the new model in the Azure TTS online service. You can also try the demo (opens in new tab) with your own text. More voices will be upgraded to Uni-TTSv4 later.

En-US (Jenny)

The visualizations of the vocal quality continue in a quartet and octet.

Human recording

En-US (Sara)

Like other visitors, he is a believer.

Zh-CN (Xiaoxiao)

另外,也要规避当前的地缘局势风险,等待合适的时机介入。

It-IT (Elsa)

La riunione del Consiglio di Federazione era prevista per ieri.

Ja-JP (Nanami)

責任はどうなるのでしょうか?

Ko-KR (Sun-hi)

그는 마지막으로 이번 앨범 활동 각오를 밝히며 인터뷰를 마쳤다

Es-ES (Alvaro)

Al parecer, se trata de una operación vinculada con el tráfico de drogas.

Es-MX (Dalia)

Haber desempeñado el papel de Primera Dama no es una tarea sencilla.

How Uni-TTSv4 works to better represent human speech

Over the past 3 years, Microsoft has been improving its engine to make TTS that more closely aligns with human speech. While the typical Neural TTS quality of synthesized speech has been impressive, the perceived quality and naturalness still have space to improve compared to human speech recordings. We found this is particularly the case when people listen to TTS for a while. It is in the very subtle nuances, such as variations in tone or pitch, that people are able to tell whether a speech is generated by AI.

Why is it so hard for a TTS voice to reflect human vocal expression more closely? Human speech is usually rich and dynamic. With different emotions and in different contexts, a word is spoken differently. And in many languages this difference can be very subtle. The expressions of a TTS voice are modeled with various acoustic parameters. Currently it is not very efficient for those parameters to model all the coarse-grained and fine-grained details on the acoustic spectrum of human speech. TTS is also a typical one-to-many mapping problem where there could be multiple varying speech outputs (for example, pitch, duration, speaker, prosody, style, and others) for a given text input. Thus, modeling such variation information is important to improve the expressiveness and naturalness of synthesized speech.

To achieve these improvements in quality and naturalness, Uni-TTSv4 introduces two significant updates in acoustic modeling. In general, transformer models learn the global interaction while convolutions efficiently capture local correlations. First, there’s a new architecture with transformer and convolution blocks, which better model the local and global dependencies in the acoustic model. Second, we model variation information systematically from both explicit perspectives (speaker ID, language ID, pitch, and duration) and implicit perspectives (utterance-level and phoneme-level prosody). These perspectives use supervised and unsupervised learning respectively, which ensures end-to-end audio naturalness and expressiveness. This method achieves a good balance between model performance and controllability, as illustrated below:

Acoustic model and vocoder diagram, described from left to right. Text is input into a text encoder. An arrow points from the text encoder to a spectrum decoder. Both implicit and explicit information are input between the encoder and decoder stages. From the Spectrum decoder, and arrow points to a vocoder. The vocoder points to an audio wave visual representation, representing conversion from mel spectrum into audio samples.

To achieve better voice quality, the basic modelling block needs fundamental improvement. The global and local interactions are especially important for non-autoregressive TTS, considering it has a longer output sequence than machine translation or speech recognition in the decoder, and each frame in the decoder cannot see its history as the autoregressive model does. So, we designed a new modelling block which combines the best of transformer and convolution, where self-attention learns the global interaction while the convolutions efficiently capture the local correlations.

Improved conformer module diagram from bottom to top. Four layers represented by boxes are each joined by a sub-layer labeled “Add and Norm.” Two arrows at the bottom of each layer point to the base layer as well as the sub-layer boxes. The first layer is labeled “Conv Feed Forward.” The second layer is “Depthwise Convolution.” The Third is “Self Attention.” The fourth is “Conv Feed Forward.”

The new variance adaptor, based on FastSpeech2 , introduces a hierarchical implicit information modelling pipeline from utterance-level prosody and phoneme-level prosody perspectives, together with the explicit information like duration, pitch, speaker ID, and language ID. Modeling these variations can effectively mitigate the one-to-many mapping problem and improves the expressiveness and fidelity of synthesized speech.

Variance adaptor diagram from bottom to top. Along the left side, an arrow moves from bottom to top, ending in a circle labeled “LR” and showing the full modeling process. To the right of the vertical arrow, Language and Speaker ID are added to hidden embeddings. Next, vectors are predicted with an utterance-level prosody predictor and then a phoneme-level prosody predictor. Then, a pitch predictor is used, and the hidden representation is expanded with a duration predictor.

We use our previously proposed HiFiNet (opens in new tab) —a new generation of Neural TTS vocoder—to convert spectrum into audio samples.

Publication DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021

For more details of the above system, refer to the paper .

Working to advance AI with XYZ-code in a responsible way

We are excited about the future of Neural TTS with human-centric and natural-sounding quality under the XYZ-Code (opens in new tab) AI framework. Microsoft is committed to the advancement and use of AI grounded in principles that put people first and benefit society. We are putting these  Microsoft AI principles (opens in new tab)  into practice throughout the company and strongly encourage developers to do the same. For guidance on deploying AI responsibly, visit Responsible use of AI with Cognitive Services (opens in new tab) .

Get started with Neural TTS in Azure

Neural TTS in Azure offers over 270 neural voices across over 110 languages (opens in new tab) and locales. In addition, the capability enables organizations to create a unique brand voice in multiple languages and styles. To explore the capabilities of Neural TTS with some of its different voice offerings, try the demo (opens in new tab) .

For more information:

Read our documentation (opens in new tab) .
Check out our sample code (opens in new tab) .
Check out the code of conduct (opens in new tab) for integrating Neural TTS into your apps.

Acknowledgments

The research behind Uni-TTSv4 was conducted by a team of researchers from across Microsoft, including Yanqing Liu, Zhihang Xu, Xu Tan, Bohan Li, Xiaoqiang Wang, Songze Wu, Jie Ding, Peter Pan, Cheng Wen, Gang Wang, Runnan Li, Jin Wu, Jinzhu Li, Xi Wang, Yan Deng, Jingzhou Yang, Lei He, Sheng Zhao, Tao Qin, Tie-Yan Liu, Frank Soong, Li Jiang, Xuedong Huang with the support from all the Azure Speech and Cognitive Services team members, Integrated Training Platform, and ONNX Runtime teams (opens in new tab) for making this great accomplishment possible.

Related publications

Fastspeech 2: fast and high-quality end-to-end text to speech, delightfultts: the microsoft speech synthesis system for blizzard challenge 2021, meet the authors.

Partner Group Engineer Manager

Continue reading

three white icons on a blue and green gradient background

Intelligent monitoring: Towards AI-assisted monitoring for cloud services

AI Frontiers: Models and Systems with Ece Kamar

a screenshot of a computer screen shot of a man

AI Explainer: Foundation models and the next era of AI

Research Focus: Week of September 26, 2022

Research areas.

Follow on X
Like on Facebook
Follow on LinkedIn
Subscribe on Youtube
Follow on Instagram

Share this page:

Use voice typing to talk instead of type on your PC

With voice typing, you can enter text on your PC by speaking. Voice typing uses online speech recognition, which is powered by Azure Speech services.

How to start voice typing

To use voice typing, you'll need to be connected to the internet, have a working microphone, and have your cursor in a text box.

Once you turn on voice typing, it will start listening automatically. Wait for the "Listening..." alert before you start speaking.


Turn on voice typing	+ on a hardware keyboard next to the Spacebar on the touch keyboard
To stop voice typing

Turn on voice typing

+ on a hardware keyboard

next to the Spacebar on the touch keyboard

To stop voice typing

Note: Press Windows logo key + Alt + H to navigate through the voice typing menu with your keyboard.

Install a voice typing language

You can use a voice typing language that's different than the one you've chosen for Windows. Here's how:

Select Start > Settings > Time & language > Language & region .

Find Preferred languages in the list and select Add a language .

Search for the language you'd like to install, then select Next .

Select Next or install any optional language features you'd like to use. These features, including speech recognition, aren't required for voice typing to work.

To see this feature's supported languages, see the list in this article.

Switch voice typing languages

To switch voice typing languages, you'll need to change the input language you use. Here's how:

Select the language switcher in the corner of your taskbar

Press Windows logo key + Spacebar on a hardware keyboard

Press the language switcher in the bottom right of the touch keyboard

Supported languages

These languages support voice typing in Windows 11:

Chinese (Simplified, China)
Chinese (Traditional, Hong Kong SAR)

Chinese (Traditional, Taiwan)

Dutch (Netherlands)
English (Australia)
English (Canada)
English (India)
English (New Zealand)
English (United Kingdom)
English (United States)
French (Canada)
French (France)

Italian (Italy)

Norwegian (Bokmål)

Portuguese (Brazil)

Portuguese (Portugal)
Romanian (Romania)
Spanish (Mexico)
Spanish (Spain)
Swedish (Sweden)
Tamil (India)

Dictation commands

Use dictation commands to tell you PC what to do, like “delete that” or “select the previous word.”

The following table tells you what you can say. If a word or phrase is in bold , it's an example. Replace it with similar words to get the result you want.


Clear a selection	Clear selection; unselect that
Delete the most recent dictation result or currently selected text	Delete that; strike that
Delete a unit of text, such as the current word	Delete
Move the cursor to the first character after a specified word or phrase	Go after that; move after ; go to the end of ; move to the end of that
Move the cursor to the end of a unit of text	Go after ; move after ; go to the end of that; move to the end of
Move the cursor backward by a unit of text	Move back to the previous ; go up to the previous
Move the cursor to the first character before a specified word or phrase	Go to the start of the
Move the cursor to the start of a text unit	Go before that; move to the start of that
Move the cursor forward to the next unit of text	Move forward to the ; go down to the
Moves the cursor to the end of a text unit	Move to the end of the ; go to the end of the
Enter one of the following keys: Tab, Enter, End, Home, Page up, Page down, Backspace, Delete	Tap ; press
Select a specific word or phrase	Select
Select the most recent dictation result	Select that
Select a unit of text	Select the ; select the
Turn spelling mode on and off	Start spelling; stop spelling

Dictating letters, numbers, punctuation, and symbols

You can dictate most numbers and punctuation by saying the number or punctuation character. To dictate letters and symbols, say "start spelling." Then say the symbol or letter, or use the ICAO phonetic alphabet.

To dictate an uppercase letter, say “uppercase” before the letter. For example, “uppercase A” or “uppercase alpha.” When you’re done, say “stop spelling.”

Here are the punctuation characters and symbols you can dictate.


@	at symbol; at sign
#	Pound symbol; pound sign; number symbol; number sign; hash symbol; hash sign; hashtag symbol; hashtag sign; sharp symbol; sharp sign
$	Dollar symbol; dollar sign; dollars symbol; dollars sign
%	Percent symbol; percent sign
^	Caret
&	And symbol; and sign; ampersand symbol; ampersand sign
*	Asterisk; times; star
(	Open paren; left paren; open parenthesis; left paren
)	Close paren; right paren; close parenthesis; right parenthesis
_	Underscore
-	Hyphen; dash; minus sign
~	Tilde
\	Backslash; whack
/	Forward slash; divided by
,	Comma
.	Period; dot; decimal; point
;	Semicolon
'	Apostrophe; open single quote; begin single quote; close single quote; close single quote; end single quote
=	Equal symbol; equal sign; equals symbol; equal sign
(space)	Space
\|	Pipe
:	Colon
?	Question mark; question symbol
[	Open bracket; open square bracket; left bracket; left square bracket
]	Close bracket; close square bracket; right bracket; right square bracket
{	Open curly brace; open curly bracket; left curly brace; left curly bracket
}	Close curly brace; close curly bracket; right curly brace; right curly bracket
+	Plus symbol; plus sign
<	Open angle bracket; open less than; left angle bracket; left less than
>	Close angle bracket; close greater than; right angle bracket; right greater than
"	Open quotes; begin quotes; close quotes; end quotes; open double quotes; begin double quotes; close double quotes; end double quotes

Dictation commands are available in US English only.

You can dictate basic text, symbols, letters, and numbers in these languages:

Simplified Chinese

English (Australia, Canada, India, United Kingdom)

French (France, Canada)

Spanish (Mexico, Spain)

To dictate in other languages, Use voice recognition in Windows .

Need more help?

Want more options.

Explore subscription benefits, browse training courses, learn how to secure your device, and more.

Microsoft 365 subscription benefits

Microsoft 365 training

Microsoft security

Accessibility center

Communities help you ask and answer questions, give feedback, and hear from experts with rich knowledge.

Ask the Microsoft Community

Microsoft Tech Community

Windows Insiders

Microsoft 365 Insiders

Find solutions to common problems or get help from a support agent.

Online support

Was this information helpful?

Thank you for your feedback.

This browser is no longer supported.

Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support.

What is speech to text?

3 contributors

Azure AI Speech service offers advanced speech to text capabilities. This feature supports both real-time and batch transcription, providing versatile solutions for converting audio streams into text.

Core Features

The speech to text service offers the following core features:

Real-time transcription: Instant transcription with intermediate results for live audio inputs.
Fast transcription : Fastest synchronous output for situations with predictable latency.
Batch transcription : Efficient processing for large volumes of prerecorded audio.
Custom speech : Models with enhanced accuracy for specific domains and conditions.

Real-time speech to text

Real-time speech to text transcribes audio as it's recognized from a microphone or file. It's ideal for applications requiring immediate transcription, such as:

Transcriptions, captions, or subtitles for live meetings : Real-time audio transcription for accessibility and record-keeping.
Diarization : Identifying and distinguishing between different speakers in the audio.
Pronunciation assessment : Evaluating and providing feedback on pronunciation accuracy.
Call center agents assist : Providing real-time transcription to assist customer service representatives.
Dictation : Transcribing spoken words into written text for documentation purposes.
Voice agents : Enabling interactive voice response systems to transcribe user queries and commands.

Real-time speech to text can be accessed via the Speech SDK, Speech CLI, and REST API, allowing integration into various applications and workflows. Real-time speech to text is available via the Speech SDK , the Speech CLI , and REST APIs such as the Fast transcription API .

Fast transcription (Preview)

Fast transcription API is used to transcribe audio files with returning results synchronously and faster than real-time audio. Use fast transcription in the scenarios that you need the transcript of an audio recording as quickly as possible with predictable latency, such as:

Quick audio or video transcription and subtitles : Quickly get a transcription of an entire video or audio file in one go.
Video translation : Immediately get new subtitles for a video if you have audio in different languages.

Fast transcription API is only available via the speech to text REST API version 2024-05-15-preview and later.

To get started with fast transcription, see use the fast transcription API (preview) .

Batch transcription API

Batch transcription is designed for transcribing large amounts of audio stored in files. This method processes audio asynchronously and is suited for:

Transcriptions, captions, or subtitles for prerecorded audio : Converting stored audio content into text.
Contact center post-call analytics : Analyzing recorded calls to extract valuable insights.
Diarization : Differentiating between speakers in recorded audio.

Batch transcription is available via:

Speech to text REST API : Facilitates batch processing with the flexibility of RESTful calls. To get started, see How to use batch transcription and Batch transcription samples .

Speech CLI : Supports both real-time and batch transcription, making it easy to manage transcription tasks. For Speech CLI help with batch transcriptions, run the following command:

Custom speech

With custom speech , you can evaluate and improve the accuracy of speech recognition for your applications and products. A custom speech model can be used for real-time speech to text , speech translation , and batch transcription .

A hosted deployment endpoint isn't required to use custom speech with the Batch transcription API . You can conserve resources if the custom speech model is only used for batch transcription. For more information, see Speech service pricing .

Out of the box, speech recognition utilizes a Universal Language Model as a base model that is trained with Microsoft-owned data and reflects commonly used spoken language. The base model is pretrained with dialects and phonetics representing various common domains. When you make a speech recognition request, the most recent base model for each supported language is used by default. The base model works well in most speech recognition scenarios.

Custom speech allows you to tailor the speech recognition model to better suit your application's specific needs. This can be particularly useful for:

Improving recognition of domain-specific vocabulary : Train the model with text data relevant to your field.
Enhancing accuracy for specific audio conditions : Use audio data with reference transcriptions to refine the model.

For more information about custom speech, see the custom speech overview and the speech to text REST API documentation.

For details about customization options per language and locale, see the language and voice support for the Speech service documentation.

Usage Examples

Here are some practical examples of how you can utilize Azure AI speech to text:

Use case	Scenario	Solution
	A virtual event platform needs to provide real-time captions for webinars.	Integrate real-time speech to text using the Speech SDK to transcribe spoken content into captions displayed live during the event.
	A call center wants to assist agents by providing real-time transcriptions of customer calls.	Use real-time speech to text via the Speech CLI to transcribe calls, enabling agents to better understand and respond to customer queries.
	A video-hosting platform wants to quickly generate a set of subtitles for a video.	Use fast transcription to quickly get a set of subtitles for the entire video.
	An e-learning platform aims to provide transcriptions for video lectures.	Apply batch transcription through the speech to text REST API to process prerecorded lecture videos, generating text transcripts for students.
	A healthcare provider needs to document patient consultations.	Use real-time speech to text for dictation, allowing healthcare professionals to speak their notes and have them transcribed instantly. Use a custom model to enhance recognition of specific medical terms.
	A media company wants to create subtitles for a large archive of videos.	Use batch transcription to process the video files in bulk, generating accurate subtitles for each video.
	A market research firm needs to analyze customer feedback from audio recordings.	Employ batch transcription to convert audio feedback into text, enabling easier analysis and insights extraction.

Responsible AI

An AI system includes not only the technology, but also the people who use it, the people who are affected by it, and the environment in which it's deployed. Read the transparency notes to learn about responsible AI use and deployment in your systems.

Transparency note and use cases
Characteristics and limitations
Integration and responsible use
Data, privacy, and security

Additional resources

Microsoft Sam TTS Generator is an online interface for part of Microsoft Speech API 4.0 which was released in 1998.

Select your voice. Note that BonziBUDDY voice is actually an "Adult Male #2" with a specific pitch and speed.
Select your pitch and speed. All voices have lower and upper pitch and speed limits.
Enter your text and press "Say it". Wait for generated audio appear in audio player. It should be done nearly instantly, as the interface tries to generate audio at x16777215 real-time.
To save generated audio, right click on audio player and press "Save audio as..."

Privacy Policy

This section is used to inform website visitors regarding policies with the collection, use, and disclosure of Personal Information if anyone decided to use this service.

We want to inform you that whenever you use this service, we collect information that your browser sends to us. This information includes information such as your computer’s Internet Protocol (“IP”) address, browser user-agent and the time and date of your visit. This information is collected by major web servers by default.

We use Google Analytics to understand how the site is being used in order to improve your user experience. User data is all anonymous. Find out more about Google Analytics' position on privacy at https://support.google.com/analytics/topic/2919631

Online Microsoft Sam TTS Generator

Google's former CEO says the tech giant is losing out to OpenAI and Anthropic because staff are working from home

OpenAI and Anthropic have been giving Google a run for its money in the AI race.
Ex-Google CEO Eric Schmidt says the company's work-from-home policy is hurting its competitiveness.
"The reason the startups work is because the people work like hell," Schmidt said.

Remote working has blunted Google's competitiveness in the AI race , says the company's former CEO and chairman, Eric Schmidt .

Schmidt was speaking to students at Stanford University during a lecture in April when he was asked about the lead that startups like OpenAI and Anthropic currently have over Google when it comes to AI .

A recording of Schmidt's lecture was published on Stanford Online's YouTube channel on Tuesday.

"Google decided that work-life balance and going home early and working from home was more important than winning," Schmidt said."And the reason the startups work is because the people work like hell."

"I'm sorry to be so blunt, but the fact of the matter is, if you all leave the university and go found a company, you're not going to let people work from home and only come in one day a week if you want to compete against the other startups."

Representatives for Google didn't immediately respond to a request for comment from Business Insider sent outside regular business hours.

Schmidt was Google's CEO and chairman from 2001 and 2011 before handing the reins back to the search giant's co-founder Larry Page .

He went on to serve as Google's executive chairman and technical advisor before finally departing the company in early 2020.

Schmidt isn't the only executive who thinks remote working has hurt businesses. JPMorgan CEO Jamie Dimon , for one, has been an outspoken advocate for staff to head back to the office.

"It doesn't work for younger kids in apprenticeships, it doesn't really work for creativity and spontaneity, it doesn't really work for management teams," Dimon told The Economist in an interview that aired in July 2023.

Watch: Sam Altman moves to Microsoft after OpenAI fires him as CEO

Main content

IMAGES

How to enable Text to Speech in Microsoft Word
Change Microsoft Text-to-Speech Voice Windows 10
Microsoft Word-Text to Speech
How to use speech-to-text on Microsoft Word to write and edit with your
Get Text-To-Speech
Enable Text to Speech or Speak in Microsoft Word

COMMENTS

Azure AI Speech
Azure AI Speech offers text to speech conversion with natural-sounding voices and speaking styles. You can also use Azure AI Speech for speech to text, speech translation, speech analytics, and more.
9 More Realistic AI Voices for Conversations Now Generally Available
Microsoft announces 9 new realistic voices for Text-to-Speech (TTS) applications, optimized for lifelike speech interactions. Learn more about the features, languages, and scenarios of these voices, and see examples of their conversational style.
Text to speech quickstart
With Azure AI Speech, you can run an application that synthesizes a human-like voice to read text. You can change the voice, enter text to be spoken, and listen to the output on your computer's speaker. Tip. You can try text to speech in the Speech Studio Voice Gallery without signing up or writing any code.
Azure AI Speech
Build voice-enabled generative AI apps confidently and quickly with the Azure AI Speech. Transcribe speech to text with high accuracy, produce natural-sounding text-to-speech voices, translate spoken audio, and use speaker recognition during conversations. Build faster with pre-built and customizable AI models in Azure AI Studio.
Text to speech overview
Learn how to use the text to speech feature of the Speech service, which converts text into human like synthesized speech. Explore the benefits, features, and options of prebuilt and custom neural voices, pricing, and sample code.
Introducing super realistic AI voices optimized for conversations
Learn how to use super realistic AI voices optimized for human-bot interactions with Azure OpenAI and Azure Speech. These new voices are powered by Large Language Models and neural Text-to-Speech techniques.
Text to speech with Azure OpenAI Service
In this article. In this quickstart, you use the Azure OpenAI Service for text to speech with OpenAI voices. The available voices are: alloy, echo, fable, onyx, nova, and shimmer. For more information, see Azure OpenAI Service reference documentation for text to speech.
Speech Studio
Text to speech. Build apps and services that speak naturally with more than 400 voices across 140 languages and dialects. Create a customized voice to differentiate your brand and use various speaking styles to bring a sense of emotion to your spoken content. Learn more about text to speech.
Microsoft's new neural text-to-speech service helps machines speak like
Microsoft has reached a milestone in text-to-speech synthesis with a production system that uses deep neural networks to make the voices of computers nearly indistinguishable from recordings of people. ... Our team demonstrated our neural-network powered text-to-speech capability at the Microsoft Ignite conference in Orlando, ...
Microsoft previews neural network text-to-speech
With these updates, Speech Services Neural Text-to-Speech capability offers the most natural-sounding voice experience for your users in comparison to the traditional and hybrid system approaches. You can use this capability starting today with two pre-built neural voices in English - meet Jessa and Guy. Hear what they sound like.
Announcing new voices and emotions to Azure Neural Text to Speech
Learn how Azure Neural Text to Speech, a speech synthesis capability of Azure Cognitive Services, can convert text to lifelike speech using AI. Discover the new voice styles and emotional tones added to the US-English market and beyond.
Use the Speak text-to-speech feature to read text aloud
Learn how to use Speak, a built-in feature of Word, Outlook, PowerPoint, and OneNote, to have text read aloud in your language. Find out how to add Speak to the Quick Access Toolbar and how to use it with different text formats and languages.
Text to speech documentation
Learn how to use the Speech service to convert text into human-like synthesized speech. Find tutorials, API reference, custom voice, pricing, and support options.
What is the Speech service?
Learn how to use the Speech service to convert text to speech, create custom voices, and translate spoken audio. The Speech service also offers speech to text, speaker recognition, and other features for various scenarios.
Read aloud
Open Microsoft Edge, start Read aloud then select Voice options in the toolbar and change the speed to a faster or slower speech pace. Read aloud offers a variety of voices and accents to provide you with a variety of reading experiences. Start Read aloud, then select Voice options to choose a voice. Voice packages are available as well to your ...
Introducing Azure text to speech avatar public preview
Prebuilt text to speech avatar; Microsoft offers prebuilt text to speech avatars as out of box products on Azure for its subscribers. These avatars can speak different languages and voices based on the text input. Customers can select an avatar from a variety of options and use it to create video content or interactive applications with real ...
Speech Studio
Steps for creating the best audio. 1 Create a Speech resource at go.microsoft.com. 2 Create a new tuning file or upload your texts. 3 Choose a language and voices for your texts. 4 Customize, and fine tune, the speech output. 5 Download the audio, or get the SSML code, to embed to your applications.
Speech Studio
Welcome to the Custom Neural Voice portal. Custom Neural Voice (CNV) lets you create a natural-sounding synthetic voice that is trained on human voice recordings. Your custom voice can adapt across languages and speaking styles, and is perfect for adding a one-of-a-kind voice to your text to speech solutions. Learn more about Custom Neural Voice.
Azure AI milestone: New Neural Text-to-Speech models more closely
Neural Text-to-Speech—along with recent milestones in computer vision and question answering—is part of a larger Azure AI mission to provide relevant, meaningful AI solutions and services that work better for people because they better capture how people learn and work—with improved vision, knowledge understanding, and speech capabilities. At the center of these efforts is XYZ-code, […]
Use voice typing to talk instead of type on your PC
How to start voice typing. To use voice typing, you'll need to be connected to the internet, have a working microphone, and have your cursor in a text box. Once you turn on voice typing, it will start listening automatically. Wait for the "Listening..." alert before you start speaking. to navigate through the voice typing menu with your keyboard.
Reduce latency for speech-to-text and text-to-speech
4. Speech Synthesis Latency in speech synthesis can be a bottleneck, especially in real-time applications. Here are some recommendations to reduce latency: 4.1 Use Asynchronous Methods Instead of using speak_text_async for speech synthesis, which blocks the streaming until the entire audio is processed, switch to the start_speaking_text_async ...
The Best Text-to-Speech Apps and Tools for Every Type of User
Microsoft Office applications have a built-in text-to-speech feature, and the quality of the voices is fantastic. In any document, click the View tab, select Immersive Reader, and then press the ...
Speech to text overview
Core Features. Real-time speech to text. Fast transcription (Preview) Batch transcription API. Show 4 more. Azure AI Speech service offers advanced speech to text capabilities. This feature supports both real-time and batch transcription, providing versatile solutions for converting audio streams into text.
Nuance Text-To-Speech Demo: Transferred To Microsoft?
Welcome to Microsoft Community. You can try out the Text-To-Speech demo through Microsoft's Azure Speech Service. The Speech Studio Voice Gallery allows you to test various voices and features without needing to sign up or write any code: Text to speech quickstart - Speech service - Azure AI services | Microsoft Learn. Yuhao L
Online Microsoft Sam TTS Generator
Generate text to speech audio with Microsoft Sam voice and other options. This is an online interface for Microsoft Speech API 4.0, released in 1998.
Announcing a new OpenAI feature for developers on Azure
Pricing . We will make pricing for this feature available soon. Please bookmark the Azure OpenAI Service pricing page.. Learn more about the future of AI. We've been rolling out several new models recently, and we understand it can be a lot to keep up with. This flurry of activity is all about empowering developer innovation.
Watch: Sam Altman moves to Microsoft after OpenAI fires him as CEO
Google also began tracking office badge attendance and using it as a metric in performance reviews, CNBC reported in June 2023, citing internal memos it had seen. "Of course, not everyone believes ...
Generating text-to-speech using Audition
The Generate Speech tool enables you to paste or type text, and generate a realistic voice-over or narration track. The tool uses the libraries available in your Operating System. Use this tool to create synthesized voices for videos, games, and audio productions.