New NVIDIA AI platform cuts downtime for supercomputing datacentres

The new NVIDIA Mellanox UFM CyberAI platform minimises downtime in InfiniBand datacentres by harnessing AI-powered analytics to detect security threats and operational issues, as well as predict network failures.

It applies AI to learn a datacentre’s operational cadence and network workload patterns, drawing on both real-time and historic telemetry and workload data. Against this baseline, it tracks the system’s health and network modifications, and detects performance degradations, usage and profile changes.

The new platform provides alerts of abnormal system and application behaviour, and potential system failures and threats, as well as performs corrective actions. It also delivers security alerts in cases of attempted system hacking to host undesired applications, such as cryptocurrency mining.

The result is reduced data centre downtime, which according to ITIC, can cost more than US$300,000 an hour.

“The UFM Cyber-AI platform determines a datacentre’s unique vital signs and uses them to identify performance degradation, component failures and abnormal usage patterns,” said Gilad Shainer, Senior Vice President of Marketing for Mellanox Networking at NVIDIA.

“It allows system administrators to quickly detect and respond to potential security threats and address upcoming failures, saving cost and ensuring consistent service to customers,” he added.

NVIDIA has also added a third member of the UFM family, the UFM Telemetry platform. This tool captures real-time network telemetry data, which is streamed to an on-premises or cloud-based database to monitor network performance and validate the network configuration.