Good morning Johannes,
Lately I have been a bit annoying, since I'm having a lot of issues with my model as it is quite expensive, so I want to apologize, and thank you for always replying fast!
Currently, I have decided to downgrade my model's precision to at least get a posterior and understand how to better constrain my model to get results in reasonable times. For this reason I decreased the number of live points to 400, yet this produces a small number of effective samples, so I increased the min_ess to 10.000.
I assume this is the correct way to go about this, since this will then increase the number of live points near the posterior region, rather than sampling from the prior.
My issue is that I'm running my model in a cluster using 128 nodes each with 128 cores for a total of 16384 cores.
Each node has 256 Gb or ram, and this is plenty for the sampling, but during the save of the results the code is always killed by what I assumed was the OOM, what surprised me was that it worked just fine if I decreased the number of cores, so I went into the code and realized that the issue was in a mpi_comm.gather, I then removed the gather and used the simple send and recv, this worked but then it crashed during the bcast, so I also changed the bcast into a send and recv loop of 512MB chunks to obey the MPI memory limit, once again this worked, but then it still crashed.
So, to prevent every core of having the full saved_logwt_bs and logZ_bs, I gathered them in rank 0, and tried to bcast the results dictionary from rank 0 as I assumed this would be lighter.
However, during what I assume is the rank order test, it iteratively generates the results dictionary and stores it into memory, which is mostly fine if the results dictionary is small, the issue is that when I use 1 core, the results dictionary is relatively light, as well as saved_logwt_bs and logZ_bs, but when I run with 16384 cores, these files are in the order of the GB, which then result in massive dictionaries, and in the code crashing.
I assume the information must then be duplicated across the cores, because I'm resuming an already concluded run, and with 1 to 128 cores, everything is fine, but when I start to go into the thousand cores things get funky on the ram usage and crash.
Once again, sorry for troubling you, but I was wondering if this is something you could help me look into since I'm not as familiar in what the code is actually doing.
With the best regards,
João
Good morning Johannes,
Lately I have been a bit annoying, since I'm having a lot of issues with my model as it is quite expensive, so I want to apologize, and thank you for always replying fast!
Currently, I have decided to downgrade my model's precision to at least get a posterior and understand how to better constrain my model to get results in reasonable times. For this reason I decreased the number of live points to 400, yet this produces a small number of effective samples, so I increased the min_ess to 10.000.
I assume this is the correct way to go about this, since this will then increase the number of live points near the posterior region, rather than sampling from the prior.
My issue is that I'm running my model in a cluster using 128 nodes each with 128 cores for a total of 16384 cores.
Each node has 256 Gb or ram, and this is plenty for the sampling, but during the save of the results the code is always killed by what I assumed was the OOM, what surprised me was that it worked just fine if I decreased the number of cores, so I went into the code and realized that the issue was in a mpi_comm.gather, I then removed the gather and used the simple send and recv, this worked but then it crashed during the bcast, so I also changed the bcast into a send and recv loop of 512MB chunks to obey the MPI memory limit, once again this worked, but then it still crashed.
So, to prevent every core of having the full saved_logwt_bs and logZ_bs, I gathered them in rank 0, and tried to bcast the results dictionary from rank 0 as I assumed this would be lighter.
However, during what I assume is the rank order test, it iteratively generates the results dictionary and stores it into memory, which is mostly fine if the results dictionary is small, the issue is that when I use 1 core, the results dictionary is relatively light, as well as saved_logwt_bs and logZ_bs, but when I run with 16384 cores, these files are in the order of the GB, which then result in massive dictionaries, and in the code crashing.
I assume the information must then be duplicated across the cores, because I'm resuming an already concluded run, and with 1 to 128 cores, everything is fine, but when I start to go into the thousand cores things get funky on the ram usage and crash.
Once again, sorry for troubling you, but I was wondering if this is something you could help me look into since I'm not as familiar in what the code is actually doing.
With the best regards,
João