Are Speech Technologies and AI Replacements to Human Captioners?

A screen filled with code in front and a female doll is behind it

I attended an event about speech technologies a month ago to which I was invited by an organizer who assumed that I wouldn’t need a live CART human captioner to follow panelists and a sign language interpreter for a networking part (that I asked her for) – despite her knowing me for many years as an experienced accessibility consultant, a book author, a public speaker, and seeing me use human communication access service providers at events.

Eventually the event organizers realized about limitations of speech technologies and informed me that live captions would be projected on a screen by a human CART steno captioner and a sign language interpreter would be arranged for me. There were networking parts before and after the event. I was talking to one of panelists who is deaf. He didn’t know sign language, so he communicated with me through a sign language interpreter and read live captions of my signs that the interpreter was voicing. Those captions were streaming remotely on his mobile device by a human steno captioner. I am familiar with that service as I have used it for remote captioning sometimes. When I asked him why he was not using speech recognition to communicate with people during networking, he told me that it was not good enough – which didn’t surprise me. However, he claimed that YouTube auto captions are great. It is something that many deaf people would disagree with – they even call auto captions “craptions”. The problem is not with YouTube auto captions, but with the lack of accountability of video producers and their understanding about captioning quality.

The panel was interesting – various panelists discussed various topics related to speech technologies such as voice commands, text to speech services and vice versa. The deaf panelist shared limitations of speech recognition for live captions and explained why live captions projected on the screen was provided at the event by an on-site human steno captioner that was sitting just next to the screen. Many attendees didn’t even realize that it was not machine generated captions! However, he also claimed how great YouTube auto captions were. During the Q&A session I raised my hand to agree that live captioning can only be provided by a trained human professional. I commended YouTube for offering a great captioning tool that makes it easier and faster for video producers to create captions while also explaining that this feature is not meant to show videos with auto generated captions as there are many quality captioning guidelines that only humans can follow. That’s why video producers need to ensure that captions are accurate and to clean up auto captions or hire someone to create good quality captions (especially for professional videos). All audience members surprisingly agreed with me by clapping and even some panelists admitted that they often think only about technical side of speech technologies, but not the user experience side.

Sadly, that event was not the only example of the lack of awareness about difference in quality of speech to text translation between machines and humans. There is still a misconception about speech technologies as a “perfect solution” for deaf and hard of hearing people to access aural information. At a recent conference by Google, for example, many attendees were exclaiming how “great” the AI was at providing live captioning that was projected on screens during presentations – without realizing that they were actually provided by human steno captioners! The attendees made those assumptions because Google is among organizations that work on speech technologies.

Not just this, there’s a new hype going around lately about how AI can “beat” humans at lipreading and provide captions based on lipreading. For the same reasons with understanding human speech, machines cannot be better than humans at lipreading. Lipreading is not an exact science and gives only around 30% of visual information and the rest is guesswork and depends on various factors. Some people are better at lipreading than others, but even experienced lipreaders cannot catch everything 100%. I say this also based on my personal experience as a deaf person and experiences of many other deaf people sharing the same sentiments.

Captioning is more than just showing words. Good quality text translation also includes proper grammar, spelling, punctuation, speaker identifications, sound descriptions and many other parts of quality speech to text guidelines. You would also need to be able to deal with accents, background noises, poor audio, and so on. Good quality live captions have 98% accuracy minimum. Good quality video captions are 100% accurate. Trying to understand bad captions is like trying to understand an unedited book with many grammatical mistakes or being frustrated with bad audio. Research states that caption error rate of more than 3% decreases comprehension of content. Many speech technologies cannot reach the accuracy minimum requirements and for those reasons are far from the point of beating or replacing humans for years to come.

Does it mean that speech technologies are a bad idea and are to be discouraged? No – I actually think it’s a great technology, but only if used in the right way. For example, I would have loved to use this for informal conversations with families or friends or anyone in person. Lipreading is not easy and communicating in writing may be too slow. I would not care about quality of text produced by speech technologies in those situations as long as I can get as many words as possible and as long as I have the opportunity to ask conversation participants to repeat or clarify certain words. It would also facilitate my lipreading. Interestingly, many hearing people seem to worry more about perfect grammar, spelling, punctuation when communicating with me in person via writing or typing while they don’t seem to care about this when including captions to their videos!

For formal settings such as professional videos or formal events, however, speech technologies are not appropriate – it’s where good quality human communication access providers such as live or video captioners or sign language interpreters are needed. It is also mandated by the disability laws. I find it sad that developers seem to focus more on how to replace human providers with speech technologies for formal settings instead of paying more attention to communication access needs of many deaf and hard of hearing people during informal conversations with people. We want to be able to strike up a personal conversation with anyone using speech technologies (regardless of how perfect they are) – it’s when we need them most to facilitate our lipreading and listening with hearing devices. There are applications out there like Dragon or Siri, but they often do not work well in real time. That’s why we would like developers to focus more on improving access during informal personal conversations.

To get a better idea of speech technology shortcomings, I would suggest you check out a couple videos: Scottish Elevator – Eleven! and Caption Fail: Jamaican Vacation Hoax.

I hope you consider quality when adding captions to your videos or thinking of how to make your events accessible. To learn more how to do this, contact us for consulting services, training sessions, workshops.