Nguyen Le PhongNguyen Le Phong

Multimodal AI: Text, Vision, and Audio

A practical introduction to multimodal AI: how text, images, audio, and video change product workflows, what becomes possible, and why evaluation, privacy, accessibility, and human review matter more as inputs become richer.

A teammate once sent a bug report with a screenshot, a short screen recording, and one tired sentence: this flow feels wrong after checkout. A few years ago, that report would still require a person to watch, interpret, write steps, guess the broken state, and translate the feeling into a ticket. Now it is increasingly possible for AI to read the text, inspect the image, follow the video, and produce a first structured explanation of what might be happening.

That is the simple promise behind multimodal AI. Instead of treating language as the only input, the system can work with several forms of information: text, images, audio, video, diagrams, documents, screenshots, charts, and sometimes sensor-like data. The world does not arrive to us as clean text. Work arrives as a messy mixture. A customer sends a voice note. A QA tester records a video. A warehouse worker takes a photo. A doctor reads an image. A designer reviews a screen. Multimodal AI tries to meet the work closer to its natural shape.

The shift matters because many business processes lose information when everything must be converted into text first. A screenshot contains layout, spacing, visual state, and error messages. A voice note contains hesitation, sequence, and sometimes emotion. A chart contains relationships that are hard to describe line by line. When AI can inspect these inputs directly, the workflow can become less dependent on a human doing the first translation step.

For product teams, this opens useful possibilities. Support systems can summarize a customer video into reproduction steps. QA workflows can compare expected and actual UI states. Field teams can document damage, inventory, or installation issues with photos. Learning tools can explain a diagram. Accessibility tools can describe visual content. Meeting tools can connect audio, transcript, slides, and decisions. The value is not that AI becomes magical. The value is that context becomes richer.

But richer context also means richer risk. Text prompts can already leak sensitive data. Images and audio can leak much more: faces, screens, locations, background documents, customer names, private conversations, or details nobody meant to include. A multimodal workflow needs clear rules about what can be uploaded, where it is processed, how long it is retained, who can view outputs, and whether the model provider can use the data for improvement. Privacy becomes more concrete when the input is no longer only words.

Evaluation also becomes harder. With text, we can often compare an answer against a source document. With images or audio, correctness can be more subtle. Did the model read the small number correctly? Did it understand the speaker's intent or only transcribe the words? Did it miss the important part of the screenshot because the error was visually small? A fluent explanation can still be wrong. Multimodal AI needs tests that match the actual task, not only demos that look impressive.

There is another practical constraint: multimodal AI changes the review habit. A human reviewer may need to check not only the final answer, but also whether the model looked at the right evidence. In a support workflow, the model might summarize the video but miss the step before the failure. In a design workflow, it might identify visual inconsistency but misunderstand the product rule. In a safety-sensitive workflow, that difference matters. Human review should be placed where the cost of being wrong is high.

I also think multimodal AI can help non-technical people participate more naturally. Not everyone explains problems well in written form. Some people point, record, draw, speak, or show. If a system can accept those forms, it may reduce the communication tax. A warehouse operator should not need to write a perfect ticket to report a recurring issue. A customer should not need to learn internal vocabulary to show that a checkout step is confusing. The interface can listen more broadly.

For engineers, the design question is not only which model supports vision or audio. It is how the whole workflow behaves. What input formats are allowed? How are files normalized? What metadata is captured? What is stored and what is discarded? How are outputs linked back to source evidence? How do we test performance across accents, lighting, screen sizes, noisy recordings, and ambiguous images? The model is one piece. The operating system around it is the product.

Multimodal AI feels important because it moves AI closer to the way people actually experience work. We do not think only in paragraphs. We see, hear, point, sketch, compare, and remember. The opportunity is to build tools that respect that reality without becoming careless with privacy or truth.

The more input types AI can understand, the more disciplined we need to be about consent, verification, and human judgment. Richer context should make work clearer, not less accountable. If you are already using screenshots, voice notes, or recordings in an AI workflow, I would be interested in what became easier and what became riskier than expected.

Qu'en avez-vous pensé ?