I’m testing on a large ish (not my target, but wanted to test in this before i bit the bullet on something “really big”) server (20 core +HT = 40 logical cores / 256GB ram).
WM defaults to 40 workers on setup but i have to change the setup to tell it to use 90% of the ram (instead of the default of 70%) as soon as i do that the UI refreshes and caps at 31 threads and it’s impossible to change, could you fix the UI (i’m pretty sure it’s only a UI and not an internal bug) so that it doesn’t scale from 1 to 31 but from 1 to say detected logical cores times 2 so that everyone can set it at whatever they feel is appropriate?
After more testing this is more than an UI issue, i found where the setting is and set it to 40/40 Inside the file, then WM used up to 32 threads but no more, is there a hard limit in the code and is it easy to lift? I’m assuming this isn’t by design since we talked long ago and you didn’t think there’d be an issue on a massive 80core server.
note: all of this is on the last dev version.
Does that even work at all? My project uses like 10/15 threads first, but once the slow ones start to appear which are only handled single threaded the rest stalls. Ofcourse that might be able to be optimized in my project, but still.
Throwing more cores certainly does start hitting Amdahl’s Law after a while.
The extent to which WM can extract parallelism depends on:
- What types of devices are used in your network
- How many parallel chains exist in your network. WM can work simultaneously on different devices that don’t depend on each others’ inputs, so a “fat, short” network is much faster than a “long, skinny” one.
- If you’re using tiled or normal builds. Tiled builds are inherently embarassingly parallel.
Hi Stephen,
I just looked at the Amdahl’s law mentioned above. The theory is sound, but seems like the problem is still the heterogeneous structure of the software core. Some devices are parallel, some are not. Some work with tiling, some do not. What’s the problem with making all devices, even filters, parallel? Even a millisecond saved would be worth it. I don’t know how complex this to code, but could it be done?
Off topic, I just used a renderer called Redshift. It uses all resources of the system like processors, gpus, gpu memory and system memory simultaneously on a single render. Could it be implemented for World machine devices?
I’m not Stephen, but I am a professional SW developer with more than two decades of multiprocessor/multicore experience. The short answer is “It can be impractically difficult for some devices.” Some things are trivially easy to parallelize. Some things have no known (useful) parallel algorithm, which means you start with a PhD-level research program in mathematics and spent a couple of years on that, before you ever get to coding. Lots of things are between those extremes, where there are serious practical limits on how much effective parallelism you can throw at the problem, and where you for damned sure can’t just wave a CUDA library at it and say “problem solved, let the HW do it.”
So, no, not really.
I know about the problems with parallel computing when there is no algorithm previously written for a particular problem. But terrain simulation has been around long enough, and some places have indeed ported it to the gpu. I just don’t know much about how stephen coded the wm devices. That’s why I was asking him if it could be done. So my question still stands stephen, could it be done?
I guess my point was that “terrain simulation” is not a uniform topic, so there’s no sensible answer for all of “terrain simulation”, only sensible answers for specific devices (or related classes of device).
I’m currently working on a geological fault (crack) generator based on a grid, which superficially looks like something quite easy to parallelize. Unfortunately, 1) the effects of creating or extending a crack can propagate arbitrarily far, 2) the semi-random nature of the process means cracks starting at different seed points (by different threads) won’t agree when their render extents reach each other, and 3) simple blending across render extents/tiles produces grossly visible artifacts. Those factors are totally different from what you get with a Perlin noise generator or a blur filter. But it’s certainly included in “terrain simulation”.
I just tried dev 3 and things have improved substantially, there’s no longer a “hard cap” on threads but some devices seem to have an internal cap (for example if i take the default world and use 128 threads, Advanced perlin will only use 48 but terrace will use all 128). Any luck this can get looked at / be another hard limit coded into the device?