Showing revision 1.8

Release notes for MCX 0.4.9

MCX version 0.4.9 serves as the beta release for the next major milestone, v0.5, and is a release full of exciting new features. The software becomes more and more easy-to-use, robust, and increasingly validated. Listening to the feedback from the early adopters, the author of the software had made several major changes, and added new features such as saving detected photon info and accepting photon numbers as input.

In this release, you can find 5 pre-compiled MCX executable in the bin folder. You need to choose the right one for your simulations, depending on your purposes. The following list summarizes the differences between these binaries and their suited tasks:

  • mcx (Vanilla MCX) - This is the original flavor of MCX. It by default uses a Logistic-Lattice (LL) RNG, and non-atomic global-memory-write. It is the fastest one among all the variants. However, as you have already recognized, this one does not save photon partial-path-length info at the detectors; when running a large number of threads (>10000), the accuracy of the solution near the source may degrade due to the influence of non-atomic write. It is recommended to run this binary with less than 5000 threads. However, if you have a Fermi GPU, small thread numbers may not give you the highest speed-up.

  • mcx_det (Detective MCX) - This is a new variant of MCX, which supports saving partial path length info at detectors. As the top requested feature from the users, saving path info enables fast rescaling of the measurement without rerunning MCX with new absorption values. Please use "-d 1" to trigger mcx_det to dump a .mch (MC history) file for all the captured photons (loadmch.m is under utils/ dir). The default buffer length in mcx_det can hold up to 10^6 photons; if you need more, please use the "-H" option.

  • mcx_cached (Cachebox MCX) - This is another new variant of MCX. It serves as a replacement to the slow and non-scalable atomic variant, and targets at high accuracy near the source. In mcx_cached, we use shared memory to record a sub-zone near the source, with a half-edge length determined by the "-R" option. When a photon propagates inside this sub-zone, atomic operations are used to record fluence in the shared memory. This is significantly (>5x) faster than running atomics in the global memory. You can run over 40000 threads with mcx_cached and the accuracy near the source is almost intact.

  • mcx_det_cached (Cachebox Detective MCX) - If you need both the detected photons and care about the region very close (<4mm) to the source, you should use this variant. However, you should be aware this carries the overheads from both additions as well.

  • mcx_atomic (Atomic MCX) - This is the original fully atomic version of MCX. However, given mcx_cached, the full atomic version is not really needed in most cases. It remains to be the most accurate (even slightly) one, but of course, the slowest of all.

Generally speaking, mcx_det is about 20% slower than mcx; mcx_cached is also about 20% slower than mcx; mcx_det_cached is roughly 30% slower than mcx when both '-d' and '-R' are turned on. Although 20~30% does seem to be a big discount in speed, given the high speed-up of mcx, overall it is acceptable. I am sure this will be improved with future advancements of hardware and software.

To give some ideas about speed, for the standard benchmark example (60x60x60 mm domain,mua=0.005/mm,mus=1/mm,g=0.01, tmax=10 ns with no reflection, see example/quicktest/), Vanilla MCX can simulate 12000 photon/ms with a GTX 470, 6300 photon/ms with one core of GTX 295 and over 3000 photon/ms with an 8800GT card. It can even run 220 photon/ms on an entry level Quadro NVS 295 card. In comparison, tMCimg can run 20 photon/ms with a single core of Xeon 5120, and 41 photon/ms with a single core of Xeon E5520.

One should also note that Vanilla MCX can be used on the first generation CUDA devices (compute_10); mcx_det and mcx_atomic requires at least "compute_11" devices such as the good-old 8800GT; mcx_cached and mcx_det_cached needs atomic on shared memory, this requires at least "compute_12" devices such as GTX 2xx; of course, if you have a higher mode, such as GTX 275/285/295 (compute_13) or a GTX 470/480 (compute_20), all of them should work with no problem. To find out the capability of your card, please read Appendix A in [1], or run "mcx -L".

In addition to a number of new variants, there are other new features as well:

  1. MCX now accepts "-n photonnumber" to accurately specific total simulated photons; the old way, i.e. "-m moves-per-thread" is disabled. The "-m" approach had been advocated as "as effective as the photon number", many users still prefer to report total photon number in their papers and the "-n" option enables comparisons to other codes.
  2. A simple automatic load-balancing scheme, i.e. the autopilot mode, via the "-A" option is implemented. A simple formula is used to determine thread/block size automatically. It works ok if you have a dedicated GPU; it does not work on non-dedicated GPUs, where you need to try different "-r repeatcount" to deduce a single kernel call to 5~10 seconds. This strategy shall be refined in the future.
  3. A random-number is used to determine transmission/reflection at the boundary. This makes MCX 10x faster than the earlier version with reflections, where the reflected photon are always kept alive.
  4. MCX was more rigorously validated. In a recent paper [2], it was compared with analytical solutions in heterogeneous media (a 3D sphere inclusion). Although this specific example discusses the limitation of MCX when representing curved surface with voxels, the overall accuracy elsewhere for MCX is indeed quite satisfactory.
  5. The graphics interface, MCX Studio, had also been updated to allow users to select the new features and twig parameters [3].

For developers, the core kernel of MCX was also modified significantly and hopefully has improved readability.

The pre-compiled binaries for 32bit/64bit Linux and 32bit (compatible with 64bit) Windows were uploaded to the download site. All of the binaries were compiled against CUDA 2.3, and should be compatible with the later 3.0 and 3.1 libraries. Also uploaded is the latest source package for the released software.

The author of MCX is greatly appreciated if you can download, test the software and share your feedback for future improvement of MCX. The final release of v0.5 is expected to be announced in the next few months once all the features are solid and bug-free.

Enjoy the new software and let us know your thoughts and experience.

Powered by Habitat