Monitoring GPU temperatures with nvidia-smi and Check MK (OMD)

The Nvidia monitoring setup described at https://elwe.rhrk.uni-kl.de/howto/ worked in Check MK 1.2.8, but fails in 1.4. After some modification things now work – it required some modification of the check script /omd/yoursite/local/share/check_mk/checks/nvidia_smi. The two modifications needed were:

Remove the grouping of nvidia_smi.errors1 and 2 (I can live with this as our GTX1070 doesn’t report this anyway).

Remove the unicode degree characters from the temperature output, as this seems to cause the system to choke on the textual output.

Needed to delete and recreate the host to get it to work properly – possibly unicode characters hanging around in the generated graph definitions or similar?

#!/usr/bin/python
# -*- encoding: utf-8; py-indent-offset: 4 -*-
# +------------------------------------------------------------------+
# | ____ _ _ __ __ _ __ |
# | / ___| |__ ___ ___| | __ | \/ | |/ / |
# | | | | '_ \ / _ \/ __| |/ / | |\/| | ' / |
# | | |___| | | | __/ (__| < | | | | . \ | # | \____|_| |_|\___|\___|_|\_\___|_| |_|_|\_\ | # | | # | Copyright Mathias Kettner 2012 mk@mathias-kettner.de | # +------------------------------------------------------------------+ # # This file is part of Check_MK. # The official homepage is at http://mathias-kettner.de/check_mk. # # check_mk is free software; you can redistribute it and/or modify it # under the terms of the GNU General Public License as published by # the Free Software Foundation in version 2. check_mk is distributed # in the hope that it will be useful, but WITHOUT ANY WARRANTY; with- # out even the implied warranty of MERCHANTABILITY or FITNESS FOR A # PARTICULAR PURPOSE. See the GNU General Public License for more de- # ails. You should have received a copy of the GNU General Public # License along with GNU Make; see the file COPYING. If not, write # to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, # Boston, MA 02110-1301 USA. ####################################### # Check developed by ####################################### # Dr. Markus Hillenbrand # University of Kaiserslautern, Germany # hillenbr@rhrk.uni-kl.de ####################################### # Tweaked by Jamie Scott # University of Glasgow # Jamie.Scott@glasgow.ac.uk ####################################### # the inventory functions def inventory_nvidia_smi_fan(info): inventory = [] for line in info: if line[2] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_gpuutil(info): inventory = [] for line in info: if line[3] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_memutil(info): inventory = [] for line in info: if line[4] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors1(info): inventory = [] for line in info: if line[5] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_errors2(info): inventory = [] for line in info: if line[6] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_temp(info): inventory = [] for line in info: if line[7] != 'N/A': inventory.append( ("GPU"+line[0], "", None) ) return inventory def inventory_nvidia_smi_power(info): inventory = [] for line in info: if line[8] != 'N/A' and line[9] != "N/A": inventory.append( ("GPU"+line[0], "", None) ) return inventory # the check functions def check_nvidia_smi_fan(item, params, info): for line in info: if "GPU"+line[0] == item: value = int(line[2]) perfdata = [('fan', value, 90, 95, 0, 100 )] if value > 95:
return (2, "CRITICAL - %s fan speed is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s fan speed is %d%%" % (line[1], value), perfdata)
else:
return (0, "OK - %s fan speed is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_gpuutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[3])
perfdata = [('gpuutil', value, 100, 100, 0, 100 )]
return (0, "OK - %s utilization is %s%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_memutil(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[4])
perfdata = [('memutil', value, 100, 100, 0, 100 )]
if value > 95:
return (2, "CRITICAL - %s memory utilization is %d%%" % (line[1], value), perfdata)
elif value > 90:
return (1, "WARNING - %s memory utilization is %d%%" % (line[1], value), perfdata)
else:
return (0, "OK - %s memory utilization is %d%%" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors1(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[5])
if value > 500:
return (2, "CRITICAL - %s single bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s single bit error counter is %d" % (line[1], value))
else:
return (0, "OK - %s single bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_errors2(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[6])
if value > 500:
return (2, "CRITICAL - %s double bit error counter is %d" % (line[1], value))
if value > 100:
return (1, "WARNING - %s double bit error counter is %d" % (line[1], value))
else:
return (0, "OK - %s double bit error counter is %d" % (line[1], value))
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_temp(item, params, info):
for line in info:
if "GPU"+line[0] == item:
value = int(line[7])
perfdata = [('temp', value, 80, 90, 0, 95 )]
if value > 90:
return (2, "CRITICAL - %s temperature is %dC" % (line[1], value), perfdata)
elif value > 80:
return (1, "WARNING - %s temperature is %dC" % (line[1], value), perfdata)
else:
return (0, "OK - %s temperature is %dC" % (line[1], value), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

def check_nvidia_smi_power(item, params, info):
for line in info:
if "GPU"+line[0] == item:
draw = float(line[8])
limit = float(line[9])
value = draw * 100.0 / limit
perfdata = [('power', draw, limit * 0.8, limit * 0.9, 0, limit )]
if value > 90:
return (2, "CRITICAL - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
elif value > 80:
return (1, "WARNING - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
else:
return (0, "OK - %s power utilization is %d%% of %dW" % (line[1], value, limit), perfdata)
return (3, "UNKNOWN - GPU %s not found in agent output" % item)

# declare the check to Check_MK

check_info['nvidia_smi.fan'] = (check_nvidia_smi_fan, "%s fan speed" , 1, inventory_nvidia_smi_fan)
check_info['nvidia_smi.gpuutil'] = (check_nvidia_smi_gpuutil, "%s utilization" , 1, inventory_nvidia_smi_gpuutil)
check_info['nvidia_smi.memutil'] = (check_nvidia_smi_memutil, "%s memory" , 1, inventory_nvidia_smi_memutil)
#check_info['nvidia_smi.errors1'] = (check_nvidia_smi_errors1, "%s errors single" , 0, inventory_nvidia_smi_errors1)
#check_info['nvidia_smi.errors2'] = (check_nvidia_smi_errors2, "%s errors double" , 0, inventory_nvidia_smi_errors2)
check_info['nvidia_smi.temp'] = (check_nvidia_smi_temp, "%s temperature" , 1, inventory_nvidia_smi_temp)
check_info['nvidia_smi.power'] = (check_nvidia_smi_power, "%s power" , 1, inventory_nvidia_smi_power)

#checkgroup_of['nvidia_smi.errors1'] = 'hw_errors'
#checkgroup_of['nvidia_smi.errors2'] = 'hw_errors'

ResourceSpace cron and database notes

Ran into a couple of issues today:

Note: system setup is Debian 9 with standard options (Apache 2.4, PHP 7.0, MariaDB 10.1)

Cron

The documentation implies you should run cron_copy_hitcount.php as a cron job. However, the new correct way seems to be to run batch/cron.php, which runs a bunch of sub-jobs. I’ve got this set up in cron.daily as:

#!/bin/sh
wget -q -r http://localhost/resourcespace/batch/cron.php

We’ll see if this works. Certainly running it directly by browsing to it seems to work.

LDAP

Trying to activate the simpleldap plugin threw up two problems:

php-ldap wasn’t installed – easy enough. Note apache needs a restart after installing…

Second error was a problem with the database – the plugin couldn’t create a table, with error

Specified key was too long; max key length is 767 bytes

This seems to be because when I created the database the character set used was utf8mb4_general_ci, which in the worst case uses 4 bytes per character. If you try to create a index key with 255 characters you run into this limit.

The solution was to change the database to use utf8_general_ci. This allowed the plugin to create the simpleldap_groupmap table with utf8_general_ci. The rest of the database is still utf8mb4_general_ci, but as it has been created already without an issue we should be ok.

WordPress login time with the wpDirAuth plugin

The WordPress wpDirAuth plugin currently has a hard coded session time of 1 hour for directory authenticated (LDAP etc.) users. Hopefully at some point in the future this will become configurable. Discussion here.

On a related note, inserting

define( 'AUTOSAVE_INTERVAL', 60 ); // Seconds

in wp-config.php changes the autosave interval (default is 60 seconds).

Edit: Fixed in V1.9.3 thanks to patch submitted by Sean Leavey – time is now configurable.

Checking out SVN in a new directory and getting a ‘working copy too old’ (or similar) error

Had a situation today where we were trying to check out a SVN repository and kept getting

Check Out: Cleanup with an older 1.7 client before upgrading with this client

both with SmartSVN and the OSX command line svn – into a new clean directory.

The problem turned out to be an old .svn metadata folder in the directory above which should have been deleted when rearranging folders. This seemed not to affect existing working copies below this, but it looks like it did cause problems with creating new working copies. Deleting the rogue .svn directory made things work.

Office 365 install without activating

If you need to install Office 365 on a user’s laptop but not activate it (because you can only do it five times yourself), read on…

Get the deployment tool (link for Office 365 2016)

https://www.microsoft.com/en-us/download/details.aspx?id=49117

Run the executable – this will extract the setup.exe and configuration.xml files. Put these in a handy directory.

The default configuration.xml should be ok for downloading the installation sources (32-bit Office and Visio). Run setup.exe /download – this will download the office sources to a subdirectory Office\Data (there’s about 1.25 Gb of files). Note there is no progress indicator!

To install, edit the configuration.xml file to uncomment and change the Display option as follows;

<Display Level="None" AcceptEULA="FALSE" />

Run the setup tool again to install:

setup.exe /configure

(If no other options are given it will use the configuration.xml in the same directory and the install data sources in the Office\Data subdirectory).

Note: no progress is shown, except for the setup program eventually exiting.

This should result in a default install of Office and Visio, ready to be activated on first run.

If you are doing this for a system image, do not run any of the Office programs! Even if you cancel the activation dialog and exit, it still generates a unique ID for the install that you probably don’t want to clone.

To see the various other things the tool can do see the documentation linked from the download page. (One interesting thing – you apparently can install a version that allows multiple people on a system to use Office without it counting against their 5-system limit, such as on a terminal server. In this case we’d probably use the Office 2016 install activated against the KMS server, but someone might find this useful).

Volume licenced Office 2016 Repeated Activation Prompts

Had a situation that seems to be fairly common – get laptop, uninstall “Get Office” program, install Office 2016 and activate using KMS server, which seems to work fine. Then when you fire up one of the programs you get the “Lets get started” screen. You can close this and it works ok, but it is annoying.

To fix this see https://support.microsoft.com/en-us/kb/3170450 (deleting a couple of registry keys that make Office think it’s still in OEM mode).

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\Microsoft\Office\16.0\Common\OEM

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office\16.0\Common\OEM

Dell OMSA install on Server 2012 R2 issues

  • Did install first as network admin – couldn’t get page to display and update patch (from 7.4.0 to 7.4.0.2) didn’t complete. Scrubbed and reinstalled as local admin – no problems.
  • firewall exception may be required for remote access.
  • https listener warning isn’t relevant for the https web interface!
  • send test email button didn’t work in ie11, does in chrome.