OpenVLA

class grid.model.perception.vla.openvla.OpenVLA(*args, **kwargs)

OpenVLA: Visual Language Action Model

This class implements a wrapper for the OpenVLA model, which predicts actions based on visual input and text prompts.

Credits:

https://github.com/OpenVLA/OpenVLA

License:

This code is licensed under the MIT License.

__init__()

Initializes the OpenVLA model and processor.

Loads the model and processor from the Hugging Face Hub, configured for efficient memory usage with 4-bit quantization.

Return type:

None

run(image, query)

Given an image and a query regarding the contents of the image, return a predicted action.

The action is represented as a 7-DoF vector that needs to be un-normalized for BridgeData V2.

Parameters:
  • image (np.ndarray) -- The image we are interested in.

  • query (str) -- Task instruction.

Returns:

Predicted action based on the query and image, represented as a 7-DoF vector.

Return type:

List[float]

Example

>>> openvla = OpenVLA()
>>> outputs = openvla.run(img, "What action should the robot take to close the laptop?")
>>> print(outputs)  # Action: [-0.00826106, 0.01349755, -0.01063425, -0.03462297, 0.04744966, 0.0756878, 0.99607843]