24 Commits

Author SHA1 Message Date
Ingo Oppermann
d6ce3a5891
Add option to include child processes in observation 2025-06-25 16:53:27 +02:00
Ingo Oppermann
5a90c3ce20
Add more efficient way to find children of a process 2025-06-25 14:35:54 +02:00
Ingo Oppermann
f241f3f531
Accumulate cpu and memory usage of child processes 2025-06-12 14:52:58 +02:00
Ingo Oppermann
d54512d927
Improve error message 2025-03-24 10:16:34 +01:00
Ingo Oppermann
893f8c2b1f
Choose the GPU with the least overall usage 2024-12-10 15:47:07 +01:00
Ingo Oppermann
64a2136501
Fix nvidia-smi parsing 2024-12-09 16:21:41 +01:00
Ingo Oppermann
a79004388f
Fix potential CPU leak 2024-10-31 22:11:53 +01:00
Ingo Oppermann
abc821fe4b
Create GPU index in actual driver 2024-10-31 15:23:24 +01:00
Ingo Oppermann
d591a2383e
Fix GPU index numbering, promote the GPU ID 2024-10-31 14:59:22 +01:00
Ingo Oppermann
aa3a5b4978
Prevent panic if index is out of bounds 2024-10-31 12:17:53 +01:00
Ingo Oppermann
55015bcf6f
Read out GPU specs at util start 2024-10-30 17:12:29 +01:00
Ingo Oppermann
de9a30a108
Add internal mock modules 2024-10-29 14:55:55 +01:00
Ingo Oppermann
2ee7fa7e41
Make resources the only direct user of psutil 2024-10-29 12:25:39 +01:00
Ingo Oppermann
fbf62bf7e5
Remove Start() function, rename Stop() to Cancel() 2024-10-28 17:12:31 +01:00
Ingo Oppermann
412fbedea3
Make psutil a submodule of resources, remove default psutil 2024-10-28 16:13:13 +01:00
Ingo Oppermann
2dbe5b5685
Add GPU support 2024-10-24 15:08:26 +02:00
Ingo Oppermann
644185dd50
Merge branch 'vod' into psutil_gpu 2024-08-19 12:43:47 +02:00
Ingo Oppermann
d391e274d7
Fix wrong memory limit, add total memory, add cpu and memory consumed by core itself to node resources 2024-07-25 21:13:49 +02:00
Ingo Oppermann
7fa47a962a
Add basic nvidia-smi parser 2024-07-16 08:14:19 +02:00
Ingo Oppermann
480dbb7f53
Refactor cluster node code 2024-07-09 12:26:02 +02:00
Ingo Oppermann
022c5c1a6d
Emit warnings 2023-09-11 14:42:46 +02:00
Ingo Oppermann
51d8b30e8f
Fix MaxCPU and MaxMemory semantics
If a limit of 0 (or negative) is given for both cpu and memory, then
no limiting will be triggered. If any value between 1 and 100 (inclusive)
is given, then limiting will be triggered when that limit is reached.

I.e. giving a limit of 100 doesn't not mean unlimited.
2023-07-12 11:53:39 +02:00
Ingo Oppermann
519f39b217
Fix returning wrong value for HasLimits 2023-07-12 11:38:40 +02:00
Ingo Oppermann
fc03bf73a2
Make resource manager a main module and expose more details 2023-06-06 21:28:08 +02:00