What can AI do in other domains?
Generative AI models have demonstrated impressive capabilities across modalities including sound, music, video, and 3D shapes. In the audio domain, models can synthesize natural speech, generate original music compositions, and even mimic a speaker’s voice and the patterns of rhythm and sound (prosody). Speech-to-text systems can convert spoken language into text [Automatic Speech Recognition (ASR)]. For video, AI systems can create photorealistic footage from text prompts and perform sophisticated editing like object removal. 3D models learned to reconstruct scenes from images and generate intricate objects from textual descriptions.
The following table summarizes some recent models in these domains:
Model |
Organization |
Year |
Domain |
Architecture |
Performance |
3D-GQN |
DeepMind |
2018 |
3D |
Deep, iterative, latent variable density models |
3D scene generation from 2D images |
Jukebox |
OpenAI |
2020 |
Music |
VQ-VAE + transformer |
High-fidelity music generation in different styles |
Whisper |
OpenAI |
2022 |
Sound/speech |
Transformer |
Near human-level speech recognition |
Imagen Video |
|
2022 |
Video |
Frozen text transformers + video diffusion models |
High-definition video generation from text |
Phenaki |
Google & UCL |
2022 |
Video |
Bidirectional masked transformer |
Realistic video generation from text |
TecoGAN |
U. Munich |
2022 |
Video |
Temporal coherence module |
High-quality, smooth video generation |
DreamFusion |
|
2022 |
3D |
NeRF + Diffusion |
High-fidelity 3D object generation from text |
AudioLM |
|
2023 |
Sound/speech |
Tokenizer + transformer LM + detokenizer |
High linguistic quality speech generation maintaining speaker’s identity |
AudioGen |
Meta AI |
2023 |
Sound/speech |
Transformer + text guidance |
High-quality conditional and unconditional audio generation |
Universal Speech Model (USM) |
|
2023 |
Sound/speech |
Encoder-decoder transformer |
State-of-the-art multilingual speech recognition |
Table 1.1: Models for audio, video, and 3D domains
Underlying many of these innovations are advances in deep generative architectures like GANs, diffusion models, and transformers. Leading AI labs at Google, OpenAI, Meta, and DeepMind are pushing the boundaries of what’s possible.