How to monitor CPU and GPU usage amongst users?
I have several users (with user accounts as per /etc/passwd) who SSH onto an Ubuntu machine that I manage. This is my system info:
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
The HW is a two-socket Intel Xeon E3 (16 cores in total) with x2 NVidia GTX 970 GPU cards. There is approx 6 TB in the machine of internal HDD space.
Each user can use tmux to execute a process that persists after they've logged off. Please note, I don't have anything sophisticated such as a job manager like SLURM; I'm far from that so please don't suggest.
Could anyone recommend software to monitor the CPU and GPU usage by user and to report e.g., 1000 CPU/GPU hours etc., over a given period of time? The software must be able to record the User, CPU and GPU, and if possible the process/software executed.
Solution 1:
Something like sar
and sadf
(part of the sysstat package) can do full cpu accounting.
However there are currently few or no tools that can give good gpu accounting. Slurm can only do it by restricting and tightly scheduling gpu access, not by measuring actual use.
If one were to write a system like this, it would need to use the nvidia nvml libraries. The API for gpu montioring has completely changed several times in the last few years, so such a tool would require frequent rewrites to keep up with the changes in the nvidia driver and nvml library.