2024: the year of multimodal AI

Retroactive Grade (February 2025): Correct My prediction was spot on. 2024 became the year when multimodal AI went mainstream, with every major chatbot and API now handling images, video, and audio as naturally as text.

image

I called 2023 the year of natural language interfaces. I was wrong – it was bigger. ChatGPT spawned thousands of copycats and specialized tools. The tech world transformed in 13 months.

Today, I saw something bigger coming: AI that works with images, video, and sound as naturally as it does with text.

I tested the early versions:

  • I showed an AI my fridge and a recipe. It told me exactly what to buy.
  • I gave it rough wireframes for an app. It drew me a complete data model.
  • At a French restaurant, I photographed the menu and listed my allergies. It found the safe dishes.

I predict that by mid-2024, every developer would have access to these capabilities. They'd build tools that:

  • Look at photos of your house and suggest specific renovation ideas
  • Watch construction sites for safety violations in real-time
  • Turn rough sketches into detailed architectural plans
  • Analyze entire image libraries and find patterns humans miss

Remember how ChatGPT changed writing and coding? This shift will be broader. Architects, designers, doctors, and builders will all have AI assistants that can see and hear the world as they do.

We're moving from reading and writing with AI to showing and seeing. The keyboard was just the beginning.