Be part of the occasion trusted by enterprise leaders for practically 20 years. VB Remodel brings collectively the individuals constructing actual enterprise AI technique. Be taught extra
Laptop imaginative and prescient tasks hardly ever go precisely as deliberate, and this one was no exception. The thought was easy: Construct a mannequin that would have a look at a photograph of a laptop computer and establish any bodily injury — issues like cracked screens, lacking keys or damaged hinges. It appeared like a simple use case for picture fashions and enormous language fashions (LLMs), however it rapidly changed into one thing extra sophisticated.
Alongside the best way, we bumped into points with hallucinations, unreliable outputs and pictures that weren’t even laptops. To resolve these, we ended up making use of an agentic framework in an atypical method — not for process automation, however to enhance the mannequin’s efficiency.
On this put up, we are going to stroll via what we tried, what didn’t work and the way a mix of approaches ultimately helped us construct one thing dependable.
The place we began: Monolithic prompting
Our preliminary method was pretty commonplace for a multimodal mannequin. We used a single, massive immediate to cross a picture into an image-capable LLM and requested it to establish seen injury. This monolithic prompting technique is straightforward to implement and works decently for clear, well-defined duties. However real-world knowledge hardly ever performs alongside.
We bumped into three main points early on:
- Hallucinations: The mannequin would generally invent injury that didn’t exist or mislabel what it was seeing.
- Junk picture detection: It had no dependable approach to flag photographs that weren’t even laptops, like footage of desks, partitions or individuals sometimes slipped via and obtained nonsensical injury stories.
- Inconsistent accuracy: The mix of those issues made the mannequin too unreliable for operational use.
This was the purpose when it turned clear we would wish to iterate.
First repair: Mixing picture resolutions
One factor we seen was how a lot picture high quality affected the mannequin’s output. Customers uploaded all types of photographs starting from sharp and high-resolution to blurry. This led us to check with analysis highlighting how picture decision impacts deep studying fashions.
We educated and examined the mannequin utilizing a mixture of high-and low-resolution photographs. The thought was to make the mannequin extra resilient to the wide selection of picture qualities it could encounter in apply. This helped enhance consistency, however the core problems with hallucination and junk picture dealing with continued.
The multimodal detour: Textual content-only LLM goes multimodal
Inspired by latest experiments in combining picture captioning with text-only LLMs — just like the method coated in The Batchthe place captions are generated from photographs after which interpreted by a language mannequin, we determined to present it a attempt.
Right here’s the way it works:
- The LLM begins by producing a number of potential captions for a picture.
- One other mannequin, known as a multimodal embedding mannequin, checks how effectively every caption suits the picture. On this case, we used SigLIP to attain the similarity between the picture and the textual content.
- The system retains the highest few captions based mostly on these scores.
- The LLM makes use of these high captions to write down new ones, attempting to get nearer to what the picture really exhibits.
- It repeats this course of till the captions cease bettering, or it hits a set restrict.
Whereas intelligent in principle, this method launched new issues for our use case:
- Persistent hallucinations: The captions themselves generally included imaginary injury, which the LLM then confidently reported.
- Incomplete protection: Even with a number of captions, some points have been missed fully.
- Elevated complexity, little profit: The added steps made the system extra sophisticated with out reliably outperforming the earlier setup.
It was an attention-grabbing experiment, however in the end not an answer.
A artistic use of agentic frameworks
This was the turning level. Whereas agentic frameworks are often used for orchestrating process flows (suppose brokers coordinating calendar invitations or customer support actions), we puzzled if breaking down the picture interpretation process into smaller, specialised brokers may assist.
We constructed an agentic framework structured like this:
- Orchestrator agent: It checked the picture and recognized which laptop computer parts have been seen (display, keyboard, chassis, ports).
- Element brokers: Devoted brokers inspected every part for particular injury varieties; for instance, one for cracked screens, one other for lacking keys.
- Junk detection agent: A separate agent flagged whether or not the picture was even a laptop computer within the first place.
This modular, task-driven method produced far more exact and explainable outcomes. Hallucinations dropped dramatically, junk photographs have been reliably flagged and every agent’s process was easy and targeted sufficient to manage high quality effectively.
The blind spots: Commerce-offs of an agentic method
As efficient as this was, it was not good. Two fundamental limitations confirmed up:
- Elevated latency: Working a number of sequential brokers added to the entire inference time.
- Protection gaps: Brokers may solely detect points they have been explicitly programmed to search for. If a picture confirmed one thing surprising that no agent was tasked with figuring out, it could go unnoticed.
We wanted a approach to steadiness precision with protection.
The hybrid answer: Combining agentic and monolithic approaches
To bridge the gaps, we created a hybrid system:
- The agentic framework ran first, dealing with exact detection of recognized injury varieties and junk photographs. We restricted the variety of brokers to essentially the most important ones to enhance latency.
- Then, a monolithic picture LLM immediate scanned the picture for the rest the brokers might need missed.
- Lastly, we fine-tuned the mannequin utilizing a curated set of photographs for high-priority use instances, like ceaselessly reported injury situations, to additional enhance accuracy and reliability.
This mixture gave us the precision and explainability of the agentic setup, the broad protection of monolithic prompting and the boldness enhance of focused fine-tuning.
What we discovered
Just a few issues turned clear by the point we wrapped up this mission:
- Agentic frameworks are extra versatile than they get credit score for: Whereas they’re often related to workflow administration, we discovered they may meaningfully enhance mannequin efficiency when utilized in a structured, modular method.
- Mixing totally different approaches beats counting on only one: The mix of exact, agent-based detection alongside the broad protection of LLMs, plus a little bit of fine-tuning the place it mattered most, gave us much more dependable outcomes than any single technique by itself.
- Visible fashions are vulnerable to hallucinations: Even the extra superior setups can soar to conclusions or see issues that aren’t there. It takes a considerate system design to maintain these errors in verify.
- Picture high quality selection makes a distinction: Coaching and testing with each clear, high-resolution photographs and on a regular basis, lower-quality ones helped the mannequin keep resilient when confronted with unpredictable, real-world pictures.
- You want a approach to catch junk photographs: A devoted verify for junk or unrelated footage was one of many easiest modifications we made, and it had an outsized impression on general system reliability.
Closing ideas
What began as a easy thought, utilizing an LLM immediate to detect bodily injury in laptop computer photographs, rapidly changed into a a lot deeper experiment in combining totally different AI methods to sort out unpredictable, real-world issues. Alongside the best way, we realized that a few of the most helpful instruments have been ones not initially designed for one of these work.
Agentic frameworks, typically seen as workflow utilities, proved surprisingly efficient when repurposed for duties like structured injury detection and picture filtering. With a little bit of creativity, they helped us construct a system that was not simply extra correct, however simpler to grasp and handle in apply.
Shruti Tiwari is an AI product supervisor at Dell Applied sciences.
Vadiraj Kulkarni is a knowledge scientist at Dell Applied sciences.