A new training method has been introduced by researchers from MIT and the MIT-IBM Watson AI Lab to teach vision-language models to localize personalized objects in a scene.
For instance, identifying a specific French Bulldog, Bowser, among other dogs at the dog park can be challenging for generative AI models like GPT-5. These models excel at recognizing general objects but struggle with personalized objects.
The method utilizes video-tracking data where the same object is tracked across multiple frames, forcing the model to focus on contextual clues rather than relying on prior knowledge.
The model could fail at this basic task, but with the new method, it can learn to identify personalized objects like Bowser the French Bulldog.
The researchers designed the dataset to improve the model's ability to recognize personalized objects, addressing a significant shortcoming in vision-language models.
Author's summary: Researchers introduce a new method to teach AI models to locate personalized objects.