Virtual Humans: Communication

Chapter 5 deals with what we've defined as symbolic and non-symbolic communication. Symbolic communication is essentially any communication using some form of well codified and structured language, so speech or writing, but also sign-language. Non-symbolic communication includes more unstructured and uncodified methods such as gesture, expression, body language and the like.

In this page we'll summarise work around the first two below, but then we have separate pages on the other three as they are so key to VH work, and our own interests!

Speech Recognition (Speech to Text)
Speech Generation/Synthesis (Text to Speech)

Natural Language Understanding and Communication
Natural Language Generation
Internal Dialogue

We treat the first two relatively lightly as we think that any flexible virtual human system should just be able to take advantage of whatever ASR and TTS systems are available through an API, they aren't key and are being driven by lots of other use cases. We do recognise though that particularly for speech recognition performance can be improved by a tight feedback loop between the audio detection and the user intent so far derived from the conversation.

Speech Recognition (ASR)

Interesting links include:

Google Web Speech API demo - also GitHub of demos
IBM Watson ASR demo
Microsoft Bing ASR demo
Javascript Speech Recognition How-To
Speech Recognition Anywhere - Chrome extension, works pretty well

References of note include:

Besacier, L., Barnard, E., Karpov, A., & Schultz, T. (2014). Automatic speech recognition for under-resourced languages: A survey. Speech Communication,56, 85-100.
Cutajar, M., Gatt, E., Grech, I., Casha, O., & Micallef, J. (2013). Comparative study of automatic speech recognition techniques. Signal Processing, IET, 7(1), 25-46.
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), 82-97.
Johnson, M., Lapkin, S., Long, V., Sanchez, P., Suominen, H., Basilakis, J., & Dawson, L. (2014). A systematic review of speech recognition technology in health care. BMC medical informatics and decision making,14(1), 94.
Kelly, S. D. (2001). Broadening the units of analysis in communication: Speech and nonverbal behaviours in pragmatic comprehension. Journal of Child Language, 28(2), 325-349.
Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech: a review. International journal of speech technology, 15(2), 99-117.

Speech Generation/Synthesis

Interesting links and demos include:

SSML: A speech synthesis markup language and on Alexa
Sitepal - good range of "traditional" voices with easy to use Javascript API
Lyrebird - voice "cloning" - so so results at the moment (Nov 18)
VocalID - voice "cloning"
iSpeech - TTS service

(we'll add to this or post in the blog as we find new ones)

References of note include:

Black, A. W., Bunnell, H. T., Dou, Y., Muthukumar, P. K., Metze, F., Perry, D., ... & Vaughn, C. (2012). Articulatory features for expressive speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 4005-4008). IEEE.
Kisner, J. (2015, Jan 23). The technology giving voice to the voiceless. The Guardian. Available online https://www.theguardian.com/news/2018/jan/23/voice-replacement-technology-adaptive-alternative-communication-vocalid
Taylor, P. (2009). Text-to-speech synthesis. Cambridge: Cambridge University Press.
Yamagishi, J., Veaux, C., King, S., & Renals, S. (2012). Speech synthesis technologies for individuals with vocal disabilities: Voice banking and reconstruction. Acoustical Science and Technology, 33(1), 1-5.

Communication - Symbolic

Speech Recognition (ASR)

Speech Generation/Synthesis