Skip to main content
Observe infrastructure alerts such as GPU failures, thermal violations, and more during machine learning experiments you log to W&B. When you run on a supported CoreWeave Kubernetes Service (CKS) cluster and satisfy the prerequisites on this page, CoreWeave Mission Control monitors your compute infrastructure during a W&B run.

Prerequisites

The following must be true for this integration to work end to end.
PrerequisiteDetails
CoreWeave platformAvailable only on CoreWeave Kubernetes Service (CKS) clusters. Not available on CoreWeave bare metal clusters or CoreWeave Classic.
SUNK (Slurm on Kubernetes)SUNK is deployed in a CKS cluster and ties into the observability systems built into CKS. Training jobs that run through SUNK on CKS therefore satisfy the cluster requirement above. See About SUNK in the CoreWeave documentation.
W&B Python SDKFor training jobs, use the wandb package version 0.20.1 or later when you log a run.
W&B Server (Dedicated Cloud or Self-Managed)If using a W&B Dedicated Cloud or W&B Self-Managed deployment, use W&B Server version 0.73.0 or later. Set the SERVER_FLAG_ENABLE_CORE_WEAVE_OBSERVABILITY environment variable on the W&B app pod so the server can accept CoreWeave observability data.
This feature is in Preview. Contact your W&B representative for access.
If an error occurs, CoreWeave sends that information to W&B. W&B populates infrastructure information onto your run’s plots in your project’s workspace. CoreWeave attempts to automatically resolve some issues, and W&B surfaces that information in the run’s page.