技術摘要(英)
This technology is built upon a single-image (image-based) Vision-Language Pretraining Model (VLP or VLM), integrating a Large Language Model (LLM) and temporal sequence logic deduction. It is applied in action recognition tasks specifically for Standard Operating Procedure (SOP) execution from a first-person perspective (egocentric). In contrast to the video-based VLM cube embedding discussed in the literature, our approach significantly reduces computational requirements, addressing practical challenges associated with costly computational resources.