Towards Generalizing to Unseen Domains with Few Labels

KEPP: Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

Mohamed bin Zayed University of Artificial Intelligence¹, NEC Laboratories, USA,² University of Auckland, ³ Weizmann Institute of Science⁴
CVPR 2024

Overview

In our project, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the agent's capabilities by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision

Methodology

Overview of our methodology.

We introduce KEPP, a Knowledge-Enhanced Procedure Planning system for instructional videos, leveraging a Probabilistic Procedural Knowledge Graph (P²KG). KEPP breaks down procedure planning into two parts: predicting initial and final steps from visual states, and crafting a procedure plan based on the procedural knowledge retrieved from P²KG, conditioned on the predicted first and last action steps. Then predict the path plans KEPP requires minimal annotations and enhances planning effectiveness.

Results

-->

Performance of our method in comparison to existing baselines for CrossTask dataset

Example of a sub-graph in our probabilistic procedure knowledge graph for CrossTask dataset

Expert trajectories of the "Make Jello Shots" task, involving task-sharing steps and thus out-of-task step transitions.

Qualitative analysis of the "Make Jello Shots" task

Qualitative analysis of the "Change a Tire" task

BibTeX

@InProceedings{Nagasinghe_2024_CVPR, author = {Nagasinghe, Kumaranage Ravindu Yasas and Zhou, Honglu and Gunawardhana, Malitha and Min, Martin Renqiang and Harari, Daniel and Khan, Muhammad Haris}, title = {Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18816-18826} }