2024 is the Year of Multimodal AI

19 Dec 2023 writing

Retroactive Grade: Correct. 2024 will be remembered as the year when multimodal AI became mainstream, with all frontier chatbots and APIs integrating multimodal inputs and outputs.

2023: A Year of Triumph for Natural Language Interfaces

Last year, I predicted that 2023 would be the year of natural language interfaces and in retrospect it looks like a crazy conservative guess. The tech world has been abuzz with the proliferation of Copilots, “ChatGPT for X”, and the launch of ChatGPT’s GPT App Store. It's been a rollercoaster since ChatGPT came onto the scene about 13 months ago - a blink of an eye in tech years, yet truly a leap in technological evolution.

The Dawn of Multimodal AI in 2024

Imagine designing a building, composing a symphony, or planning a health regimen, all facilitated by just a few photos or spoken words. This isn't sci-fi; it's the imminent reality powered by multimodal AI. My prediction for 2024 is that multimodal generative AI will go mainstream. In a nutshell, multimodal AI is about moving beyond text in generative AI to embrace inputs and outputs in varied formats like images, video, and audio.

Recent Multimodal Explorations

In recent months, giants like ChatGPT, Bard, and Bing have infused their chatbots with multimodal functionalities and I’ve been playing with it a lot since then. Some of my favorite use cases so far:

App Development: While aiding a friend with their app, I input UX wireframes and a brief description – voilà, out came a UML diagram for the required data model.
Creative Assistance: Working on a holiday card, I fed the AI a color palette and design brief, and it suggested an array of compatible colors and palette alterations.
Real-World Perception: Unsure of what to buy for a new recipe, I uploaded a recipe screenshot and a fridge photo – the AI listed the ingredients I needed.
Visual Understanding: At a French restaurant, I snapped a photo of the menu along with my dietary preferences, and the AI recommended the perfect dishes.

The Unleashing of Multimodal Foundation Models

The real game-changer will be the widespread availability of multimodal foundation models via APIs in early 2024, paving the way for fine-tuning by mid-year. As a product builder and software developer, I anticipate 4 widespread use cases for these APIs:

Image Classification: Input an image corpus with labels, get text classifications for each image.
Design Advice: Submit a photo of a bedroom and remodeling instructions, receive actionable interior design suggestions.
Creative Augmentation: Feed a low-fidelity CAD file with a brief to envision a building, and get multiple floor plan variations.
Real World Intelligence: Input a work site photo with a request to identify safety hazards, and receive a list of potential risks.

Beyond the Text Box: A Renaissance in Human-Computer Interaction

The evolution towards more conversational, domain-specific interactions marks a Renaissance era in human-computer interaction and product design. This shift from mere 'natural language' to a more holistic conversational model mirrors the depth of human professional interactions.

A Cambrian Explosion of Creativity

In the same vein as text-to-text generative AI, which amplified creativity across numerous fields, I envision multimodal generative AI sparking a Cambian explosion of creative applications. Its impact will ripple across a broader spectrum of industries and disciplines, redefining the boundaries of what's possible.