This repository gives the official implementation of KEPP:Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos (CVPR 2024)
In our project, we explore the capability of an agent to construct a logical sequence of action steps, thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome, as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets, such as heavy intermediate visual observations, procedural names, or natural language step-by-step instructions, for features or supervision signals. However, the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked, we propose to enhance the agent’s capabilities by infusing it with procedural knowledge. This knowledge, sourced from training procedure plans and structured as a directed weighted graph, equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP, a novel Knowledge-Enhanced Procedure Planning system, which harnesses a probabilistic procedural knowledge graph extracted from training data, effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior, state-of-the-art results while requiring only minimal supervision. The main architecture of our model is as follows.
1) Setup 2) Data Preparation 3) Train Step model 4) Generate paths from procedure knowlege graph 5) Inference
In a conda env with cuda available, run:
pip install -r requirements.txt
cd {root}/dataset/crosstask
bash download.sh
{root}/dataset/crosstask/crosstask_release/
mv *.json crosstask_release
mv actions_one_hot.npy crosstask_release
cd {root}/dataset/coin
bash download.sh
cd {root}/dataset/NIV
bash download.sh
args.py
. Set the --json_path_train
, and --json_path_val
in args.py
as the dataset json file paths.
cd {root}/step
python loading_data.py
Dimensions for different datasets are listed below:
Dataset | observation_dim | action_dim | class_dim |
---|---|---|---|
CrossTask | 1536(how) 9600(base) | 105 | 18 |
COIN | 1536 | 778 | 180 |
NIV | 1536 | 48 | 5 |
python main_distributed.py --multiprocessing-distributed --num_thread_reader=8 --cudnn_benchmark=1 --pin_memory --checkpoint_dir=whl --resume --batch_size=256 --batch_size_val=256 --evaluate
The trained models will be saved in {root}/step/save_max.
--json_path_val
, --steps_path
, and --step_model_output
arguments in args.py
to generate step predicted dataset json file paths for train and test datasets seperately. Run following command for train and test datasets seperately by modifying as afore mentioned.python inference.py --multiprocessing-distributed --num_thread_reader=8 --cudnn_benchmark=1 --pin_memory --checkpoint_dir=whl --resume --batch_size=256 --batch_size_val=256 --evaluate > output.txt
cd {root}/PKG
python graph_creation.py
Select mode "train_out_n"
Trained graphs for CrossTask, COIN, NIV datasets are available on cd {root}/PKG/graphs
. Change (L13) graph_save_path
of graph_creation.py
to load procedure knowledge graphs trained on different datasets.
graph_creation.py
as the output of step model (--step_model_output
).graph_creation.py
to set the output path for the generated procedure knowlwdge graph conditioned train and test dataset json files.graph_creation.py
file as afore mentioned.
python graph_creation.py
Select mode "validate"
json_path_train
and json_path_val
arguments of args.py
in plan model as the outputs generated from procedure knowlwdge graph for train and test data respectively.Modify the parameter --num_seq_PKG
in args.py
to match the generated amount of PKG conditions. (Modify --num_seq_LLM
to the same number as well if LLM conditions are not used seperately.)
cd {root}/plan
python main_distributed.py --multiprocessing-distributed --num_thread_reader=8 --cudnn_benchmark=1 --pin_memory --checkpoint_dir=whl --resume --batch_size=256 --batch_size_val=256 --evaluate
For Metrics Modify the max checkpoint path(L339) as the evaluated model in inference.py and run:
python inference.py --multiprocessing-distributed --num_thread_reader=8 --cudnn_benchmark=1 --pin_memory --checkpoint_dir=whl --resume --batch_size=256 --batch_size_val=256 --evaluate > output.txt
Results of given checkpoints:
dataset | SR | mAcc | MIoU |
---|---|---|---|
Crosstask_T=4 | 21.02 | 56.08 | 64.15 |
COIN_T=4 | 15.63 | 39.53 | 53.27 |
NIV_T=4 | 22.71 | 41.59 | 91.49 |
Here we present the qualitative examples of our proposed method. Intermediate steps are padded in the step model because it only predicts the start and end actions.
Checkpoint links will be uploaded soon
@InProceedings{Nagasinghe_2024_CVPR,
author = {Nagasinghe, Kumaranage Ravindu Yasas and Zhou, Honglu and Gunawardhana, Malitha and Min, Martin Renqiang and Harari, Daniel and Khan, Muhammad Haris},
title = {Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {18816-18826}
}
In case of any query, create issue or contact ravindunagasinghe1998@gmail.com