which stage to train for videos?

I only have videos with description and I want to ado video understanding. can I directly start fine tuning from stage 3? if yes where will we get the weights till stage 2? with this particular command, no weights are downloaded..only some files are downloaded.
python scripts/convert_hf_checkpoint.py --model_path DAMO-NLP-SG/VideoLLaMA3-7B --save_path weights/videollama3_7b_local