GRIT: GPU workload checkpointing and restoration

GRIT is a prototype designed to automate the GPU workload migration in a Kubernetes cluster. It enables users to checkpoint the states of a GPU workload and restore them at a later time in a different node with no impact on the final result of the workload.

Its key features include:

Least-intrusive to Kubernetes core components - Currently, only containerd is slightly changed to support the new workflow for Pod start.
No application code changes - Applications can be checkpointed and restored without altering their source codes.
Pod based migration – GRIT supports the migration of all containers in a Pod.
Efficient checkpoint distribution – Checkpoints are distributed using custom Persistent Volumes (PVs), offering flexibility and efficiency compared to OCI-based image checkpoints.
NVIDIA GPU workload support – GRIT leverages CRIU and cuda-checkpoint to enable checkpointing and restoration of NVIDIA GPU states.

Architecture

The above diagram shows the architecture of GRIT. The main components are:

GRIT-Manager: The control-plane component that orchestrates all checkpointing and restoration workflows. It includes controllers and admission webhooks required for lifecycle management.
GRIT-Agent: It runs as a Job Pod created by the GRIT-manager. It is responsible for upload/download checkpoint data and communication with GRIT-runtime.
Containerd(shim): A modified containerd (diff) and a new containerd-shim, receiving control plane signal from GRIT-Agent, ultimately calling CRIU tools to checkpoint and restore the container process.

Note: GRIT only works for NVidia GPUs for now. We will add support for AMD GPUs in the future. In addition, GRIT will not preserve Pod IP during migration hence the workload needs to tolerate IP change. Job type computation intensive workloads are good candidates for migration.

Quick start

After installing GRIT CRDs and controller, you can use the following commands to checkpoint and restore your workloads.

First, create a pv to store the checkpoint data. In this example, Azure file cloud storage is used:

$ cat examples/checkpoint-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ckpt-store
  namespace: default
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: azurefile-csi-premium
  resources:
    requests:
      storage: 256Gi

$ kubectl apply -f examples/checkpoint-pvc.yaml

Then start making the checkpoint:

$ cat examples/checkpoint.yaml

apiVersion: kaito.sh/v1alpha1
kind: Checkpoint
metadata:
  name: demo
  namespace: default
spec:
  autoMigration: false
  podName: $YOUR_POD
  volumeClaim:
    claimName: "ckpt-store"

$ kubectl apply -f examples/checkpoint.yaml

After checkpointing the target pod, the status of the CheckPoint CR is set to Checkpointed.

When the original Pod is deleted, the newly created Pod will be associated with a Restore custom resource (created manually or automatically by the GRIT manager) and annotated with a special annotation. The GRIT agent will identify the Pod based on the annotation and restore the Pod from the checkpoint data. See the demo below for a better understanding about the workflow.

Live Demo

This demo shows how to use GRIT to migrate a Katio finetuning job from one GPU node to another GPU node without disrupting the tuning job execution.

License

See MIT LICENSE.

Contact

"Kaito devs" kaito-dev@microsoft.com

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.github		.github
charts/grit-manager		charts/grit-manager
cmd		cmd
contrib/containerd		contrib/containerd
docker		docker
docs		docs
examples		examples
hack		hack
pkg		pkg
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GRIT: GPU workload checkpointing and restoration

Architecture

Quick start

Live Demo

License

Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

kaito-project/grit

Folders and files

Latest commit

History

Repository files navigation

GRIT: GPU workload checkpointing and restoration

Architecture

Quick start

Live Demo

License

Contact

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages