With the increasing usage of AI tools for both home and work, there has been an increase in creative thinking about how we can apply large learning models to new frontiers. Mostly everyone is able to use ChatGPT or, on the rise, Microsoft’s Copilot to do standard functions; rewrite an email for tone, put a bit of code together for Powershell, find a recipe for banana waffles, and just about any advanced search task. Though it’s still not a perfect solution, Google’s AI search tool suggested that eating rocks is part of a balanced diet based on an article by The Onion, it is becoming an important part of daily workflow and daily life.
The more researchers work with AI tools, the more questions come up about how to implement them. One particular function that is advancing quickly is specific to video analysis. In general, for AI to understand what is happening in a particular video clip an actual person will have had to go through and annotate the video, breaking it into sections describing what actions are taking place. At MIT and the MIT-IBM Watson AI Lab, they have been working on a model to make this process more efficient. It has a fancy, sci-fi sounding name – Spatio-Temporal Grounding. The model MIT has built uses this analysis based on just the plain video and the automatically generated transcripts (yet another, independent AI function). By ‘viewing’ a video looking for subtle information about where objects are located (spatial data) and the overall scene to figure out ‘when’ an action is taking place (temporal data), they have been able to be more accurate in breaking down the whats and whens of longer videos with multiple activities. One particular area where this technology will be vital is when it comes to instructional videos; medical procedures for teaching hospitals, chefs learning to make a particular pastry and even vehicle mechanics are all potential real world applications.
With this new potential, comes a new set of problems to be solved though. Using a mechanic’s shop as an example, if more than one vehicle is in frame the AI will need to understand which one is being worked on. If a video has a section on changing the oil filter, when does the process of changing the filter actually start? If the mechanic talks about changing the oil filter, but doesn’t do the action until later in the video, how does it reconcile the misaligned timelines? All of these can be complicated problems but using a mix of annotations and instructions to assist in analyzing tasks that have multiple steps have streamlined the process further.
So coming soon, you will not only be able to ask AI tools to find you a recipe for banana waffles; you will be given step-by-step instructions and coaching to make a delicious breakfast from start to finish.
If you have any questions about how your business can utilize AI tools in your own work, reach out to us for a consultation!