Showing revision 1.0

Release notes for MCX 0.4.9

MCX version 0.4.9 serves as the beta release for the next major milestone, v0.5, and is a release full of exciting new features. The software becomes more and more easy-to-use, more robust, and increasingly validated. Listening to the feedbacks from the early adopters, the author of the software made several major changes, and added new features such as saving detected photon info and accepting photon numbers as input.

In this release, you will find 5 pre-compiled MCX executables in the bin folder. You need to choose the right one for your simulations, depending on the your emphasis. The following list summarizes the differences between these binaries and their suited tasks:

mcx (Vanilla MCX) - This is the original flavor of MCX. It by default uses a Logistic-Lattice (LL) RNG, and non-atomic global memory access. It is the fastest one among all the variants. However, as you have already recognized, this one does not save photon partial path length info at the detectors; when running a large number of threads (>10000), the accuracy of the solution near the source may degrade due to the influence of non-atomic write. It is recommended to run this code with less than 5000 threads. However, if you have a Fermi GPU, small thread numbers may not give you the highest speed-up.

mcx_det (Detective MCX) - This is a new variant of MCX, which supports saving partial path length info at detectors. As the top-of-the-list requested features from the users, saving path info enables fast rescaling of the measurement without rerunning MCX at new absorption values. Please use "-d 1" to trigger mcx_det to dump a .mch (MC history) file for all the captured photons. The default buffer length in mcx_det can hold 10^6 photons; if you need more, please use -H option to set to a higher limit.

mcx_cached (Cachebox MCX) - This is another new variant of MCX. It serves as a replacement to the slow and non-scalable atomic variant, and targets at high accuracy near the source. In mcx_cached, we used shared memory to record a sub-zone, with a half-edge length determined by the "-R" option, near the source. When a photon propagates inside this sub-zone, atomic operations are used to accumulate fluence in the shared memory. This is significantly (100x) faster than running atomics in the global memory. You can run over 40000 threads with mcx_cached and the accuracy near the source is almost intact.

mcx_det_cached (Cachebox Detective MCX) - If you want both get the detected photons and care about the region very close (<4mm) to the source, you should use this variant. However, you should be aware this carries the overheads from both additional computations as well.

mcx_atomic (Atomic MCX) - This is the original fully atomic version of MCX. However, given mcx_cached, the full atomic version is not really needed in most cases. It remains to be the most accurate (even slightly) one, but of course, the slowest of all.

Generally speaking, mcx_det is about 20% slower than mcx; mcx_cached is also about 20% slower than mcx; mcx_det_cached is roughly 30% slower than mcx when both '-d' and '-R' are turned on. Although 20~30% does seem to be a big discount in speed, given the high speed-up of mcx, overall it is not bad at all. I am sure this will be improved with further advance of hardware and software.

One should also note that Vanilla MCX can be used even on the first generation CUDA devices (compute_10); mcx_det and mcx_atomic requires at least "compute_11" devices such as the good-old 8800GT; mcx_cached and mcx_det_cached needs atomic on shared memory, this requires at least "compute_12" devices such as GTX 2xx; of course, if you have a higher mode, such as GTX 275/285/295 (compute_13) or a GTX 470/480 (compute_20), all of them should work with no problem. To find out the capability of your card, please read Appendix A in [1].

In addition to a number of new variants, there are other new features as well:

MCX now accepts "-n photonnumber" to accurately specific total simulated photons; the old way, i.e. "-m moves-per-thread" is disabled. The "-m" approach had been advocated as "as effective as the photon number", many users still use to report total photon number in their work and the "-n" option enables comparisons to other software.
A simple automatic load-balancing scheme, i.e. the autopilot mode, via the "-A" option is implemented. A simple formula is used to determine thread/block size automatically. It works ok if you have a dedicated GPU; it does not work non-dedicated GPU, where you need to try different "-r repeatcount" to deduce a single kernel call to 5~10 seconds. This strategy shall be refined in the future.
A random-number is used to determine transmission/reflection at the boundary. This makes MCX 20x faster than the earlier version, where the reflected photon are always kept alive.
MCX was more rigorously validated. In a recent paper [2], it was compared with analytical solutions in heterogeneous media (a 3D sphere inclusion). Although this specific example discusses the limitation of MCX when representing curved surface with voxels, the overall accuracy elsewhere for MCX is indeed quite satisfactory.
The graphics interface, MCX Studio, had also been updated to allow users to select the new features and twig parameters.

For developers, the core kernel of MCX was also modified significantly and hopefully has improved readability.

The pre-compiled binaries for 32bit/64bit Linux and 32bit (compatible with 64bit) Windows were uploaded to the download site. All of the binaries were compiled against CUDA 2.3, and should be compatible with the later 3.0 and 3.1 libraries. Also uploaded is the latest source package for the released software.

The author of MCX is greatly appreciated if you can download, test the software and share your feedback for future improvement of MCX. The final release of v0.5 is expected to be announced in the next few months once all the features are solid and bug-free.

Enjoy the new software and let us know your thoughts and experience.

[1] CUDA Programming Guide 3.0, Appendix A. [2] Fang, Biomed. Opt. Express 1, 165-175 (2010), http://www.opticsinfobase.org/boe/abstract.cfm?uri=boe-1-1-165