Duplicated information in netiter.combine_results

Good morning Johannes,

Lately I have been a bit annoying, since I'm having a lot of issues with my model as it is quite expensive, so I want to apologize, and thank you for always replying fast!


Currently, I have decided to downgrade my model's precision to at least get a posterior and understand how to better constrain my model to get results in reasonable times. For this reason I decreased the number of live points to 400, yet this produces a small number of effective samples, so I increased the **min_ess** to 10.000.

I assume this is the correct way to go about this, since this will then increase the number of live points near the posterior region, rather than sampling from the prior.

My issue is that I'm running my model in a cluster using 128 nodes each with 128 cores for a total of 16384 cores.
Each node has 256 Gb or ram, and this is plenty for the sampling, but during the save of the results the code is always killed by what I assumed was the **OOM**, what surprised me was that it worked just fine if I decreased the number of cores, so I went into the code and realized that the issue was in a **mpi_comm.gather**, I then removed the gather and used the simple **send** and **recv**, this worked but then it crashed during the **bcast**, so I also changed the **bcast** into a **send** and **recv** loop of 512MB chunks to obey the MPI memory limit, once again this worked, but then it still crashed.

So, to prevent every core of having the full **saved_logwt_bs** and **logZ_bs**, I gathered them in rank 0, and tried to **bcast** the results dictionary from rank 0 as I assumed this would be lighter.

However, during what I assume is the rank order test, it iteratively generates the results dictionary and stores it into memory, which is mostly fine if the results dictionary is small, the issue is that when I use 1 core, the results dictionary is relatively light, as well as **saved_logwt_bs** and **logZ_bs**, but when I run with 16384 cores, these files are in the order of the GB, which then result in massive dictionaries, and in the code crashing.



I assume the information must then be duplicated across the cores, because I'm resuming an already concluded run, and with 1 to 128 cores, everything is fine, but when I start to go into the thousand cores things get funky on the ram usage and crash.


Once again, sorry for troubling you, but I was wondering if this is something you could help me look into since I'm not as familiar in what the code is actually doing.

With the best regards,

João

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated information in netiter.combine_results #173

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Duplicated information in netiter.combine_results #173

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions