- Updated: January 31, 2026
- 6 min read
NVIDIA Open GPU Kernel Modules Issue #971: nvidia‑smi Hang After 66 Days – Analysis & Workarounds
The NVIDIA open‑gpu‑kernel‑modules issue #971 causes nvidia‑smi to hang indefinitely after roughly 66 days of continuous uptime on systems running driver 570.133.20 with Linux kernel 6.6.0, specifically affecting B200 GPUs.
Why This NVIDIA Issue Matters to System Administrators and GPU Developers
In high‑performance compute environments, a frozen nvidia‑smi command can cripple monitoring, automation, and billing pipelines. The problem surfaced on the GitHub issue #971 and quickly became a hot topic among DevOps teams that rely on uninterrupted GPU telemetry. This article breaks down the bug, its technical roots, real‑world impact, and actionable steps you can take today—while also showing how the UBOS homepage can help you automate remediation and keep your AI workloads humming.
Issue Summary: What the Reported Bug Looks Like
The original report, filed by user zheng199512 on 22 Nov 2025, describes a scenario where nvidia‑smi becomes unresponsive after the host has been up for ≈66 days 12 hours. The environment details are:
- GPU model: B200
- Driver version:
570.133.20(OpenRM) - Linux kernel:
6.6.0‑100.SMP(stable release) - OS: openEuler 2.0 (LTS‑SP2)
The failure manifests in kernel logs as repeated “Failed to update Rx Detect Link mask!” messages from the NVRM subsystem, followed by a system uptime of over 66 days and a permanently blocked nvidia‑smi call. The issue does not appear when using the proprietary NVIDIA driver of the same version, confirming that it is specific to the open‑source kernel modules.
# cat /proc/driver/nvidia/params
ResmanDebugLevel: 4294967295
...
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
...
The reproducibility steps listed in the issue are straightforward:
- Deploy a system with the specified driver and kernel.
- Run the workload continuously for >66 days.
- Observe
nvidia‑smihanging and kernel logs flooding with the Rx mask errors.
Technical Analysis: Why Does the Hang Occur?
The core of the problem lies in the interaction between the OpenRM driver and the NVLink topology discovery code. After a prolonged period, the driver’s internal counters overflow, causing the knvlinkUpdatePostRxDetectLinkMask_IMPL routine to repeatedly fail. This failure blocks the nvidia‑smi query path, which relies on a successful NVLink status check before returning any data.
From a systems‑admin perspective, the impact is threefold:
- Monitoring blind spots: Tools that poll
nvidia‑smi(Prometheus exporters, custom scripts, or UBOS‑based agents) will stop receiving metrics, leading to gaps in dashboards. - Automation failures: Any workflow that triggers scaling or alerts based on GPU utilization will stall, potentially causing resource exhaustion.
- Operational risk: In environments where GPUs are billed per‑hour (cloud or on‑premise), a hung monitoring process can mask over‑usage, inflating costs.
The issue also highlights a broader concern for open‑source GPU drivers: the need for rigorous long‑run testing. While the proprietary driver includes a watchdog that resets the NVLink state, the open‑source counterpart currently lacks this safeguard.
Reproducing the Bug & Community Feedback
The community has converged on a minimal reproducible workflow:
# Install driver 570.133.20
sudo apt-get install nvidia-open-gpu-kernel-modules=570.133.20
# Verify kernel version
uname -r # should be 6.6.0-100
# Start a long‑running GPU job (e.g., a TensorFlow training loop)
python train.py &
# After ~66 days, run:
nvidia-smi
# → hangs forever
Several contributors have posted workarounds on the issue thread:
- Periodically restart the
nvidia-persistenceddaemon to clear the stale NVLink state. - Schedule a cron job that forces a driver reload every 48 hours.
- Switch to the proprietary driver for production clusters while the open‑source team investigates.
NVIDIA’s internal team has opened an NVBug (as indicated by the “NV‑Triaged” label) and is tracking the root cause. Until a fix lands, the community consensus is to implement proactive monitoring and automated remediation.
What You Can Do Right Now
Below is a practical checklist for system administrators, DevOps engineers, and GPU developers to mitigate the impact of this bug while maintaining operational continuity.
1. Implement Automated Health Checks
Use a lightweight script that runs nvidia‑smi -q every hour and alerts if the command does not return within a configurable timeout. The Workflow automation studio can orchestrate this check and trigger a restart of the driver service automatically.
2. Schedule Periodic Driver Reloads
A simple systemctl restart nvidia-persistenced executed via cron every 48 hours prevents the internal counters from overflowing. Example cron entry:
0 */48 * * * root systemctl restart nvidia-persistenced
3. Leverage UBOS for Centralized Monitoring
The UBOS homepage offers a unified dashboard that can ingest GPU metrics via the Telegram integration on UBOS. Pair this with the OpenAI ChatGPT integration to receive natural‑language alerts when anomalies are detected.
4. Adopt a Resilient Architecture
For mission‑critical workloads, consider deploying a mixed‑driver strategy: keep the open‑source driver for development and testing, but run the proprietary driver on production nodes. The Enterprise AI platform by UBOS can abstract the underlying driver choice, presenting a consistent API to your applications.
5. Explore UBOS Templates for Quick Start
If you need to spin up a monitoring stack fast, the UBOS templates for quick start include a pre‑configured nvidia‑smi exporter with auto‑restart logic. Deploy it in minutes and integrate it with your existing Prometheus/Grafana stack.
6. Keep an Eye on Pricing and Licensing
Review the UBOS pricing plans to ensure that any added monitoring agents stay within budget, especially for SMBs that may be sensitive to cost spikes caused by hidden GPU usage.
7. Contribute Back to the Community
If you discover a reliable workaround or a patch, share it on the GitHub issue thread. Community contributions accelerate the fix timeline and improve the open‑source driver ecosystem for everyone.
AI‑Powered Automation: Turning a Bug into an Opportunity
The NVIDIA driver hiccup underscores the value of AI‑driven ops. UBOS’s AI marketing agents are not limited to marketing—they can be repurposed as intelligent watchdogs that learn normal GPU behavior and flag deviations before a hang occurs.
For example, you can build a custom AI Chatbot template that answers “Why is my GPU idle?” by querying recent nvidia‑smi logs and suggesting a driver restart. Similarly, the AI SEO Analyzer can be adapted to scan your infrastructure code for patterns that may trigger the bug.
Leveraging the Web app editor on UBOS, you can prototype these solutions without writing extensive code—drag‑and‑drop components, bind them to system metrics, and deploy instantly.
Conclusion
The open‑gpu‑kernel‑modules issue #971 is a reminder that even mature open‑source drivers can exhibit long‑run stability quirks. By instituting proactive health checks, automating driver reloads, and harnessing the power of the UBOS platform overview, you can safeguard your GPU fleets against unexpected hangs and keep your AI workloads on track.
Ready to future‑proof your infrastructure? Explore the UBOS partner program for dedicated support, or dive into the UBOS portfolio examples to see how other organizations have automated GPU management at scale.
Stay ahead of driver bugs—let AI do the heavy lifting so you can focus on innovation.