A multimodal model is an AI that doesn't just understand text — it can also see images, hear audio, or even watch videos. For example, you can show a photo to Claude and ask it to describe it, or send a chart to GPT-4 and ask it to analyze it. It's as if the AI has multiple senses instead of just one.