Trying to understand your clumsy colleague? Can’t understand what the support representative on the other end of the phone is saying? Technology comes to the rescue. It turns out that listening to an utterance can greatly increase your cognitive load (and thus the amount of energy you spend trying to understand someone). Sayso tries to solve this problem by providing developers with an API that can convert spoken English from one pronunciation to another in near real time.
As someone who speaks with an accent, I have mixed feelings about this technique. I like the small variety of voices around, and it’s easy to see how this technology can be misused; For example, it would be nice if everyone who speaks with a certain accent was automatically “corrected” to the same pronunciation. On the other hand, people prefer to use Zoom Wallpapers and TikTok filters, and with the right handling, it’s easy enough to see how you can reduce the presence of “cosmetics” heavily accented for reasons of accessibility or readability. And there is no shortage of people who cannot use the speech recognition system because of the pronunciation. Funny memes and people driving off with their cars are a real problem.
Many speech-to-text technologies use natural language processing (NLP) to accurately approximate what a person is saying. The Sayso technology doesn’t care about the actual words; It takes individual sounds and modifies them to make them more readable.
“We don’t do anything with words and sentences. Instead, we are dealing with direct editing of signals – we are dealing with complex speech elements. I mean there are things like voice, intonation, speech, content, pronunciation, we can work with fillers like mm and ah. And we can change a component or multiple components at the same time, and if we want we can change it in real time,” explains Ganna Timeko, founder and CEO of Sasso. “When we started, our goal was to help people understand each other more easily. But then that vision apparently expanded into a connection to technology. It’s a big, broad vision where speech recognition and smart speaker technologies are speaker-specific.”
The company says it approaches speech organically; The way the mouth, tongue and lips sound, and the way the vocal cords add spice to the mix.
“Articulatory gestures are just groups of sounds. Interestingly, this is independent of language and pronunciation. Our mouth can only produce a certain number of sounds, no matter what language we speak. Our voices are filtered by these artistic gestures and the result is much more complex. We take this sound wave and cut it into very small pieces that are milliseconds long,” explains Timeco. “It is suitable for real-time processing. We match speech with one utterance with another utterance. So we have parallel data and we are teaching our system to see what the speaker’s sound wave will look like compared to the speaker’s sound wave. And then we change the shape of the sound wave to associate it more with the desired pronunciation. Its advantage is that it is universal. So it is, it does not depend on the pronunciation.
The company began to match some pairs of accents. Sayso began teaching his system with Hindi-English and American-English accent pairs, but then expanded it to include Chinese, Spanish, and Japanese accents. The system does not take into account cadence, word choice, intonation, and stress. In fact, he prides himself on being able to change the sound as little as possible; Just outline some sounds to make it easier to read the pronunciation. It may seem politically incorrect (rather than inexplicably boring) to change everyone’s voices to Brad Pitt or Angelina Jolie, but the founder assured me that there’s a lot more nuance to it. With a futuristic version of the company’s tech, if I prefer everyone I talk to to look like me, with a quirky Dutch accent, then that’s possible. It would also be possible to match all pronunciations with the one everyone is more familiar with – meaning that everyone on the phone can hear a different pronunciation similar to them.
“Diversity, inclusion and access are at the heart of everything I do here. I started this because I have an accent and people don’t understand it. I worked for a very large company here in Silicon Valley,” explains Timco, declining to name the company. “I made a video for him. I used my voice to do the voiceover. They liked the video and didn’t want to change anything, but they said my voice wasn’t right. I’m like, hey, what’s wrong with my voice? I was wondering if there is any software that would allow me to change the pronunciation. There wasn’t, and they had to hire an actor and redo everything. But it made me think very deeply about it.”
The company says that people who are used to each other’s accents understand each other better. While in New Zealand, understanding other Kiwis is easier than speaking with a Scottish accent.
“We really want people to understand each other more easily, and the one who is easiest to understand is the one we are most familiar with. We start with something relatively versatile as an MVP,” Timko explains. But we can turn anything into anything. And the goal is that when you listen to someone, you choose the one that seems easy to you. I love the accents and don’t want to remove them.”
While the shift in emphasis may turn out to be a morally and/or morally hellish scenario, Sasso’s technique may have more technical reasons. For example, if I am interviewing business owners, I will record my interviews and use a transcription service to make sure I have a written transcript of the interview. There is a very strong correlation between how close the founder’s pronunciation is to standard Hollywood English and how good the transcription is. For someone with a strong Dutch or Indian accent, the transcription is pretty bad – processing the audio with a Sayso-like filter before transcribing the audio file can result in a better transcription.
I[transcription] Part of our business strategy,” explains Timco. “For example, automatic subtitles may disappear. It often surprises me how bad they are and no one checks them manually. Our technology is certainly applicable to transcription.”
The company provided a demo to show a snapshot of what the modified speech sounds like: