Monitoring GPU temperatures with nvidia-smi and Check MK (OMD)

The Nvidia monitoring setup described at https://elwe.rhrk.uni-kl.de/howto/ worked in Check MK 1.2.8, but fails in 1.4. After some modification things now work – it required some modification of the check script /omd/yoursite/local/share/check_mk/checks/nvidia_smi. The two modifications needed were:

Remove the grouping of nvidia_smi.errors1 and 2 (I can live with this as our GTX1070 doesn’t report this anyway).

Remove the unicode degree characters from the temperature output, as this seems to cause the system to choke on the textual output.

Needed to delete and recreate the host to get it to work properly – possibly unicode characters hanging around in the generated graph definitions or similar?

#!/usr/bin/python
# -*- encoding: utf-8; py-indent-offset: 4 -*-
# +------------------------------------------------------------------+
# | ____ _ _ __ __ _ __ |
# | / ___| |__ ___ ___| | __ | \/ | |/ / |
# | | | | '_ \ / _ \/ __| |/ / | |\/| | ' / |
# | | |___| | | | __/ (__| < | | | | . \ | # | \____|_| |_|\___|\___|_|\_\___|_| |_|_|\_\ | # | | # | Copyright Mathias Kettner 2012 mk@mathias-kettner.de | # +------------------------------------------------------------------+ # # This file is part of Check_MK. # The official homepage is at http://mathias-kettner.de/check_mk. # # check_mk is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation in version 2. check_mk is distributed # in the hope that it will be useful, but WITHOUT ANY WARRANTY; with- # out even the implied warranty of MERCHANTABILITY or FITNESS FOR A # PARTICULAR PURPOSE. See the GNU General Public License for more de- # ails. You should have received a copy of the GNU General Public # License along with GNU Make; see the file COPYING. If not, write # to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, # Boston, MA 02110-1301 USA. ####################################### # Check developed by ####################################### # Dr. Markus Hillenbrand # University of Kaiserslautern, Germany # hillenbr@rhrk.uni-kl.de ####################################### # Tweaked by Jamie Scott # University of Glasgow # Jamie.Scott@glasgow.ac.uk ####################################### # the inventory functions def inventory_nvidia_smi_fan(info): inventory = [] for line in info: if line[2] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_gpuutil(info): inventory = [] for line in info: if line[3] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_memutil(info): inventory = [] for line in info: if line[4] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors1(info): inventory = [] for line in info: if line[5] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors2(info): inventory = [] for line in info: if line[6] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_temp(info): inventory = [] for line in info: if line[7] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_power(info): inventory = [] for line in info: if line[8] != 'N/A' and line[9] != "N/A": inventory.append( ("GPU"+line[0], "", None) ) return inventory # the check functions def check_nvidia_smi_fan(item, params, info): for line in info: if "GPU"+line[0] == item: value = int(line[2]) perfdata = [('fan', value, 90, 95, 0, 100 )] if value > 95:
return (2, "CRITICAL - %s fan speed is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s fan speed is %d%%" % (line[1], value), perfdata)
else:
return (0, "OK - %s fan speed is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_gpuutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[3])
perfdata = [('gpuutil', value, 100, 100, 0, 100 )]
return (0, "OK - %s utilization is %s%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_memutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[4])
perfdata = [('memutil', value, 100, 100, 0, 100 )]
if value > 95:
return (2, "CRITICAL - %s memory utilization is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s memory utilization is %d%%" % (line[1], value), perfdata)
else:
return (0, "OK - %s memory utilization is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors1(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[5])
if value > 500:
return (2, "CRITICAL - %s single bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s single bit error counter is %d" % (line[1], value))
else:
return (0, "OK - %s single bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors2(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[6])
if value > 500:
return (2, "CRITICAL - %s double bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s double bit error counter is %d" % (line[1], value))
else:
return (0, "OK - %s double bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_temp(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[7])
perfdata = [('temp', value, 80, 90, 0, 95 )]
if value > 90:
return (2, "CRITICAL - %s temperature is %dC" % (line[1], value), perfdata)
elif value > 80:
return (1, "WARNING - %s temperature is %dC" % (line[1], value), perfdata)
else:
return (0, "OK - %s temperature is %dC" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_power(item, params, info):
for line in info:
if "GPU"+line[0] == item:
draw = float(line[8])
limit = float(line[9])
value = draw * 100.0 / limit
perfdata = [('power', draw, limit * 0.8, limit * 0.9, 0, limit )]
if value > 90:
return (2, "CRITICAL - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
elif value > 80:
return (1, "WARNING - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
else:
return (0, "OK - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

# declare the check to Check_MK

check_info['nvidia_smi.fan'] = (check_nvidia_smi_fan, "%s fan speed" , 1, inventory_nvidia_smi_fan)
check_info['nvidia_smi.gpuutil'] = (check_nvidia_smi_gpuutil, "%s utilization" , 1, inventory_nvidia_smi_gpuutil)
check_info['nvidia_smi.memutil'] = (check_nvidia_smi_memutil, "%s memory" , 1, inventory_nvidia_smi_memutil)
#check_info['nvidia_smi.errors1'] = (check_nvidia_smi_errors1, "%s errors single" , 0, inventory_nvidia_smi_errors1)
#check_info['nvidia_smi.errors2'] = (check_nvidia_smi_errors2, "%s errors double" , 0, inventory_nvidia_smi_errors2)
check_info['nvidia_smi.temp'] = (check_nvidia_smi_temp, "%s temperature" , 1, inventory_nvidia_smi_temp)
check_info['nvidia_smi.power'] = (check_nvidia_smi_power, "%s power" , 1, inventory_nvidia_smi_power)

#checkgroup_of['nvidia_smi.errors1'] = 'hw_errors'
#checkgroup_of['nvidia_smi.errors2'] = 'hw_errors'

Notes on getting Ubuntu 16.04 to work with NIS

Note – this only sets up the system to use user and group logons, not automounting home directories. I haven’t figured out how to make this work in Ubuntu 16.

Install package nis

Probably a good idea to set network address statically in /etc/network/interfaces (NetworkManager should recognise this and then leave it alone)

Probably also a good idea to check that /etc/hosts has the domain name for the system, i.e.

127.0.1.1 domain.name.machinename machinename

Add yp server to /etc/yp.conf

Edit /etc/nsswitch.conf to add nis for passwd, group and shadow. Note that compat should include nis by default.

Add a dependency to make the rpcbind service start at boot

systemctl add-wants multi-user.target rpcbind.service

(See this Debian bug report or this Ubuntu one)

Note that this is not a complete fix – it is reported that if the network does not come up fast enough things still break.

For users that need to log on to the system, create home directories

mkhomedir_helper <username>

Remember to reboot to check everything is working:

yptest

if that fails check if the bind services are running

systemctl status rpcbind
systemctl status ypbind

Raspberry Pi networking notes for connecting directly to laptop

If running normal raspbian edit the file /etc/network/interfaces to add in two blocks:

iface eth0 inet dhcp

and

iface eth0 inet static
    address 169.254.210.230
    netmask 255.255.0.0

Comment out the block not in use. For networks that have a dhcp server running use the first block, for connecting to a Windows computer via a private network use the second (If Windows connects to a network, is set to obtain an address automatically and can’t get one then it self-assigns a 169.254.x.y address). Static address can be anything of the form 169.254.x.y.

Full example file:

# For static IP, consult /etc/dhcpcd.conf and 'man dhcpcd.conf'

# Include files from /etc/network/interfaces.d:
source-directory /etc/network/interfaces.d

auto lo
iface lo inet loopback

auto eth0
allow-hotplug eth0

# Comment out this block when connected to laptop
#iface eth0 inet dhcp

# Comment out this block when connected to building network (or any network with dhcp)
iface eth0 inet static
   address 169.254.210.230
   netmask 255.255.0.0


allow-hotplug wlan0
iface wlan0 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

allow-hotplug wlan1
iface wlan1 inet manual
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

Proxmox clustering and multicast (also, DNS)

After much problems with getting a new Proxmox cluster up and running two things have helped:

Putting the SAN IP addresses in the hosts file, avoiding DNS dependancies (especially when one of the systems isn’t in there yet…). This put me on track: http://blog.rhavenindustrys.com/2013/04/curious-proxmox-clustering-fix.html

Important bit:

127.0.0.1 localhost.localdomain localhost
169.254.0.1 proxmox1.local proxmox1 pvelocalhost
169.254.0.2 proxmox2.local proxmox2

10.10.5.101 proxmox1.example.com
10.10.5.102 proxmox2.example.com

This associates the short aliases with the network used by corosync, while leaving the long addresses to the outside world.

Then tried tests detailed at: https://pve.proxmox.com/wiki/Troubleshooting_multicast,_quorum_and_cluster_issues

The multicast test failed – i.e. running

omping -c 10000 -i 0.001 -F -q <list of all nodes>

on both nodes at the same time resulted in 100% loss. Fixed this by disabling IGMP snooping on the SAN VLAN. ExtremeOS command is:

disable igmp snooping <vlanname>

Hey presto, after getting the second node to join properly:

pvecm add 192.168.xxx.xxx -force

it gets quorum immediately. I suspect this issue was causing a lot of the historical issues with getting quorum to work on this switch.