VideoLLaVA

class grid.model.perception.vlm.video_llava.VideoLLaVA(*args, **kwargs)

VideoLLaVA: Visual Question Answering Model

This class implements a wrapper for the Video-LLaVA model, which answers questions about visual media (videos) using the Video-LLaVA framework.

Credits:

https://github.com/PKU-YuanGroup/Video-LLaVA

License:

This code is licensed under the Apache 2.0 License.

__init__()
Return type:

None

preprocess_video(video_path)

Preprocess a video file into frames.

Parameters:

video_path (str) -- Path to the video file to be processed.

Returns:

An array of processed video frames of shape (num_frames, height, width, 3).

Return type:

np.ndarray

run(query, image=None, video=None)

Give a media (image/video) and an accompanying query, return answer to the query. Please note that the video needs to be preprocessed using the preprocess_video method before passing it to this method.

Parameters:
  • image (np.ndarray) -- the image we are interested in

  • video (np.ndarray) -- the video we are interested in

  • query (str) -- task instruction

Returns:

response to the query of the image

Return type:

str

Example

>>> from grid.model.perception.vlm.video_llava import VideoLLaVA
>>> video_llava = VideoLLaVA()
>>> video = video_llava.preprocess_video("path/to/video.mp4")
>>> outputs = video_llava.run(query="What is happening in the video", video=video)