.. (לתיקייה המכילה) | ||
When copying memory to the gpu i got error message CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES | |
This is happening due to the way the gpu code is compiled. Every block has a limited amount of registers (hardware property). When the code of a thread is compiled to be using a lot of registers, when trying to launch the code using too many threads (the requirement in this hw is 1024, which is the maximal amount), then there are not enough registers for all of the threads. Solution is either write a code that will be compiled to less registers (shorter and simpler), or to limit the amount of registers per thread : @cuda.jit(max_registers=xxx) |
Sometimes my code succeed to run, and sometimes it fails | |
There is more than one type of gpu in the server, a code that uses a lot of register per thread might work on the stronger gpu and not on the weaker ones. |
I got error that looks like this: SLURM_NNODES environment variable conflicts with allocated node count (2 != 1). | |
There are two ways to run your code on the server: First is to ask for resources to work with, and then execute you code using those resources - srun -c<xxx> --gres=gpu:<xxx> --pty bash Second option is to insert your code to a job queue managed by the server - srun -K -c<xxx> --gres=gpu:<xxx> --pty python3 <main-file>.py This error happens when you mix the two options: require for resources using the first option, and then trying to enter your code to the job queue. |