Monitoring GPU temperatures with nvidia-smi and Check MK (OMD)

The Nvidia monitoring setup described at worked in Check MK 1.2.8, but fails in 1.4. After some modification things now work – it required some modification of the check script /omd/yoursite/local/share/check_mk/checks/nvidia_smi. The two modifications needed were:

Remove the grouping of nvidia_smi.errors1 and 2 (I can live with this as our GTX1070 doesn’t report this anyway).

Remove the unicode degree characters from the temperature output, as this seems to cause the system to choke on the textual output.

Needed to delete and recreate the host to get it to work properly – possibly unicode characters hanging around in the generated graph definitions or similar?

# -*- encoding: utf-8; py-indent-offset: 4 -*-
# +------------------------------------------------------------------+
# | ____ _ _ __ __ _ __ |
# | / ___| |__ ___ ___| | __ | \/ | |/ / |
# | | | | '_ \ / _ \/ __| |/ / | |\/| | ' / |
# | | |___| | | | __/ (__| < | | | | . \ | # | \____|_| |_|\___|\___|_|\_\___|_| |_|_|\_\ | # | | # | Copyright Mathias Kettner 2012 | # +------------------------------------------------------------------+ # # This file is part of Check_MK. # The official homepage is at # # check_mk is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation in version 2. check_mk is distributed # in the hope that it will be useful, but WITHOUT ANY WARRANTY; with- # out even the implied warranty of MERCHANTABILITY or FITNESS FOR A # PARTICULAR PURPOSE. See the GNU General Public License for more de- # ails. You should have received a copy of the GNU General Public # License along with GNU Make; see the file COPYING. If not, write # to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, # Boston, MA 02110-1301 USA. ####################################### # Check developed by ####################################### # Dr. Markus Hillenbrand # University of Kaiserslautern, Germany # ####################################### # Tweaked by Jamie Scott # University of Glasgow # ####################################### # the inventory functions def inventory_nvidia_smi_fan(info): inventory = [] for line in info: if line[2] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_gpuutil(info): inventory = [] for line in info: if line[3] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_memutil(info): inventory = [] for line in info: if line[4] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors1(info): inventory = [] for line in info: if line[5] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors2(info): inventory = [] for line in info: if line[6] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_temp(info): inventory = [] for line in info: if line[7] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_power(info): inventory = [] for line in info: if line[8] != 'N/A' and line[9] != "N/A": inventory.append( ("GPU"+line[0], "", None) ) return inventory # the check functions def check_nvidia_smi_fan(item, params, info): for line in info: if "GPU"+line[0] == item: value = int(line[2]) perfdata = [('fan', value, 90, 95, 0, 100 )] if value > 95:
return (2, "CRITICAL - %s fan speed is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s fan speed is %d%%" % (line[1], value), perfdata)
return (0, "OK - %s fan speed is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_gpuutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[3])
perfdata = [('gpuutil', value, 100, 100, 0, 100 )]
return (0, "OK - %s utilization is %s%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_memutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[4])
perfdata = [('memutil', value, 100, 100, 0, 100 )]
if value > 95:
return (2, "CRITICAL - %s memory utilization is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s memory utilization is %d%%" % (line[1], value), perfdata)
return (0, "OK - %s memory utilization is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors1(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[5])
if value > 500:
return (2, "CRITICAL - %s single bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s single bit error counter is %d" % (line[1], value))
return (0, "OK - %s single bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors2(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[6])
if value > 500:
return (2, "CRITICAL - %s double bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s double bit error counter is %d" % (line[1], value))
return (0, "OK - %s double bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_temp(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[7])
perfdata = [('temp', value, 80, 90, 0, 95 )]
if value > 90:
return (2, "CRITICAL - %s temperature is %dC" % (line[1], value), perfdata)
elif value > 80:
return (1, "WARNING - %s temperature is %dC" % (line[1], value), perfdata)
return (0, "OK - %s temperature is %dC" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_power(item, params, info):
for line in info:
if "GPU"+line[0] == item:
draw = float(line[8])
limit = float(line[9])
value = draw * 100.0 / limit
perfdata = [('power', draw, limit * 0.8, limit * 0.9, 0, limit )]
if value > 90:
return (2, "CRITICAL - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
elif value > 80:
return (1, "WARNING - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
return (0, "OK - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

# declare the check to Check_MK

check_info[''] = (check_nvidia_smi_fan, "%s fan speed" , 1, inventory_nvidia_smi_fan)
check_info['nvidia_smi.gpuutil'] = (check_nvidia_smi_gpuutil, "%s utilization" , 1, inventory_nvidia_smi_gpuutil)
check_info['nvidia_smi.memutil'] = (check_nvidia_smi_memutil, "%s memory" , 1, inventory_nvidia_smi_memutil)
#check_info['nvidia_smi.errors1'] = (check_nvidia_smi_errors1, "%s errors single" , 0, inventory_nvidia_smi_errors1)
#check_info['nvidia_smi.errors2'] = (check_nvidia_smi_errors2, "%s errors double" , 0, inventory_nvidia_smi_errors2)
check_info['nvidia_smi.temp'] = (check_nvidia_smi_temp, "%s temperature" , 1, inventory_nvidia_smi_temp)
check_info['nvidia_smi.power'] = (check_nvidia_smi_power, "%s power" , 1, inventory_nvidia_smi_power)

#checkgroup_of['nvidia_smi.errors1'] = 'hw_errors'
#checkgroup_of['nvidia_smi.errors2'] = 'hw_errors'

ResourceSpace cron and database notes

Ran into a couple of issues today:

Note: system setup is Debian 9 with standard options (Apache 2.4, PHP 7.0, MariaDB 10.1)


The documentation implies you should run cron_copy_hitcount.php as a cron job. However, the new correct way seems to be to run batch/cron.php, which runs a bunch of sub-jobs. I’ve got this set up in cron.daily as:

wget -q -r http://localhost/resourcespace/batch/cron.php

We’ll see if this works. Certainly running it directly by browsing to it seems to work.


Trying to activate the simpleldap plugin threw up two problems:

php-ldap wasn’t installed – easy enough. Note apache needs a restart after installing…

Second error was a problem with the database – the plugin couldn’t create a table, with error

Specified key was too long; max key length is 767 bytes

This seems to be because when I created the database the character set used was utf8mb4_general_ci, which in the worst case uses 4 bytes per character. If you try to create a index key with 255 characters you run into this limit.

The solution was to change the database to use utf8_general_ci. This allowed the plugin to create the simpleldap_groupmap table with utf8_general_ci. The rest of the database is still utf8mb4_general_ci, but as it has been created already without an issue we should be ok.

Notes on getting Ubuntu 16.04 to work with NIS

Note – this only sets up the system to use user and group logons, not automounting home directories. I haven’t figured out how to make this work in Ubuntu 16.

Install package nis

Probably a good idea to set network address statically in /etc/network/interfaces (NetworkManager should recognise this and then leave it alone)

Probably also a good idea to check that /etc/hosts has the domain name for the system, i.e. machinename

Add yp server to /etc/yp.conf

Edit /etc/nsswitch.conf to add nis for passwd, group and shadow. Note that compat should include nis by default.

Add a dependency to make the rpcbind service start at boot

systemctl add-wants rpcbind.service

(See this Debian bug report or this Ubuntu one)

Note that this is not a complete fix – it is reported that if the network does not come up fast enough things still break.

For users that need to log on to the system, create home directories

mkhomedir_helper <username>

Remember to reboot to check everything is working:


if that fails check if the bind services are running

systemctl status rpcbind
systemctl status ypbind

WordPress login time with the wpDirAuth plugin

The WordPress wpDirAuth plugin currently has a hard coded session time of 1 hour for directory authenticated (LDAP etc.) users. Hopefully at some point in the future this will become configurable. Discussion here.

On a related note, inserting

define( 'AUTOSAVE_INTERVAL', 60 ); // Seconds

in wp-config.php changes the autosave interval (default is 60 seconds).

Edit: Fixed in V1.9.3 thanks to patch submitted by Sean Leavey – time is now configurable.

User missing from login screen – OSX with FileVault

Situation: new MacBook with OSX Sierra. Set up with an admin account, enable FileVault (taking note of recovery key obviously!) and install the necessary. Create account for end user and give it to them. All is well (after getting some USB-A to USB-C converters…)

User restores all his stuff from a Time Machine backup to the account on the new system – this overwrites all the current user settings. After rebooting the system, his account has disappeared from the login screen.

Solution: Log on as the other administrative user (luckily we have one!) and open the Settings – Security & Privacy – FileVault. A notice at the bottom of the dialog box appears informing you that there are some users that are not enabled to use FileVault, with a button to enable the users. This brings up a list showing the missing user. To enable the user their password needs to be entered.

Checking out SVN in a new directory and getting a ‘working copy too old’ (or similar) error

Had a situation today where we were trying to check out a SVN repository and kept getting

Check Out: Cleanup with an older 1.7 client before upgrading with this client

both with SmartSVN and the OSX command line svn – into a new clean directory.

The problem turned out to be an old .svn metadata folder in the directory above which should have been deleted when rearranging folders. This seemed not to affect existing working copies below this, but it looks like it did cause problems with creating new working copies. Deleting the rogue .svn directory made things work.

Opening Dell P2241Hb TFT monitor

Note that you get into this via the front bezel (there’s no handy pry gaps or slots unfortunately). The grey surround and the back are not meant to come apart.

Note that the electronics box is attached to the lcd by a couple of bits of tape only. It’s attached to the back by four screws.