deepglugs June 23, 2021 Are AI voices in games the future?

Are AI voices in games the future?

Over a year ago, I developed a VN that was voiced by using an AI voice actress. With the permission of a real-life voice actress, I carefully prepared dozens of minutes of audio, meticulously transcribed them, and set out to train an AI model with the hopes that I could use it in my game. After weeks of training, I finally had a voice that sounded similar to the original human voice actress. It was the first adult VN to use the technique and although the voice wasn’t very good, I felt the technique had advantages. Later, I used a better, newer AI voice framework and after more months of trial and error, figured out how to produce decent quality AI voices for my current game. By adding several other voice actresses/actors, I am now able to produce a VN that is over 90% voiced with either AI or real voices. By using both techniques in a single-game, I feel I can expound on the pro’s and con’s of the method.

Pros

  • Rapid prototyping
  • Can closely resemble human voice and emotion
  • Much better than traditional text-to-speech
  • Potentially Cheaper

Cons

  • Massive up-front work
  • Limited emotions
  • Occasional bad voice production

Pro: Rapid Prototyping

    ## The version of the game.

define config.version = "0.5.0"
define config.auto_voice = "voices/{id}.ogg"

With each build of the game, I extract the dialog using renpy tools and generate new voiced audio files for each line. I then use renpy’s auto-voice file capability to automatically play the voice files during the game. The result is that I can rapidly prototype dialog changes and the voices by just changing the renpy dialog.

In a traditional method, I’d have to write the renpy dialog code, extract it, send it to a voice actress, wait for the voice actress to finish and then manually process the files how they need to be for them to work in renpy. If I need to change the dialog, I need to repeat the whole process again.

Pro: Can closely resemble human voice and emotion

Here’s a few samples from my VN. Some of the voices are real, some generated by AI. Can you tell the difference?

Mixed AI and Real voice samples from the VN Euryale’s Gambit

I currently use an open-source framework developed called “flowtron“. It’s probably state-of-the-art as far as AI voices are concerned. It produces pretty clear audio and can even produce “emotive” samples (although, I haven’t currently had much luck with it).

Pro: Potentially Cheaper

I paid the voice actors/actresses I worked with for the data used to train the model as well as for permission to use the “likeness” of their voice in my game. For characters who have a lot of dialog, it was more economical to use AI voices than traditional voices. This is not always true. It doesn’t make economical sense to pay for all the training data (dozens of minutes) just for 1 minute of dialog. It’s cheaper to just pay the voice actress to just do the character’s voice. I also pay for real voices when I need more emotion that the AI voice can deliver. There usually aren’t a lot of these situations, so it isn’t overly expensive.

Con: Massive up-front work

This cannot be overstated enough. Preparing hours of audio data, making sure the transcriptions are correct, and then training was a lot of work. I’ve been working with flowtron for almost a year and only with my next release do I feel like I’ve been able to get the quality of voices that would do justice to the great voice actresses I work with. After preparing the data, it will take several days of training before you get good results. Most of the time, the results are garbage and you need to tweak the learning rate or use only a subset of data. I found that using data samples that are more than 5 seconds long produce the best results, but again, it took almost a year of trial and error to figure that out.

There is a faster way, which is what I used from vesions 0.1 to 0.4 and that is “few-shot” fine-tuning. With this method, you use an existing pre-trained voice model (I used the one provided by flowtron) and just fine-tune the speaker embedding (the primary part of the network that determines who a voice sounds like). This produced generally mistake-free voice but it only vaguely resembled the original voice actresses.

Either way you do it, the result will be a trained AI voice model you can use for as many games as you can. In the long-run it can be worth the effort.

Con: Limited emotions

It is still very difficult for me to get good emotional voice audio out of my models. For now I would say that if you need emotion or a very specific way of saying a thing (ie sarcastic, pauses in the right places, etc), use a real voice actress. I see no reason that you can’t mix AI and real voice actresses so long as the AI one is patterned after the real. I believe it can still produce consistency within the game.

Con: Occasional bad voice production

Sometimes the voices the AI actress produces have stutters or sound generally bad. This is something that can really take away from the player experience. One way to work around this is to change the dialog by rewording or rephrasing in a way the AI voice can more easily produce quality results.

Conclusion

Are AI voices the future? If the technology continues to improve, I don’t see why not, but probably not for every situation. There are still many situations that require a human touch, either for technical or sentimental reasons. I believe that will always be true no matter how good we make AI.

What do you think? Are AI voices good enough? Do you think AI voices in games have a future? Let me know in the comments below.

Subscribe
Notify of
guest
2 Comments
Newest
Oldest
Inline Feedbacks
View all comments
Batman
Batman
27 days ago

This approach is very interesting, but my first rule of playing adult games (or porn in general) is to mute the sound before I start, so it doesn’t really matter to me if there’s any voices, only that there are subtitles.