MCX Speed Challenge (Round 2)

The Round 2 of the MCX Speed Challenge is now open for submissions. The submission window will close on October 8, 2017.

: 1. Introduction
: 2. How to participate
: 3. Rules
: 4. Terms
: 5. Guidance for code optimization

1. Introduction

MCX is an open-source software package for simulating photon propagation inside arbitrary 3D complex media. It is widely used among the bio-photonics community for developing innovative optical imaging methods. Because of the accuracy and generality of the Monte Carlo (MC) algorithm, MCX is one of the most accurate and quantitative photon simulators and is often used to provide the "gold-standard" solutions.

Since the initial publication in 2009, the original MCX paper has accrued over 390 citations. It has been successfully applied in studies of human brains functions, cancer imaging, as well as the development of novel imaging methods, such as photoacoustic imaging, fluorescence tomography, laminar optical tomography, diffuse optical tomography and more. Moreover, MCX has been used as a benchmark in GPU architecture research for compiler development and optimization (Paper 1, Paper 2).

High computational efficiency is vital to the wide-spread utility of MCX. With the GPU acceleration, MCX is one of the fastest photon transport simulators known to-date. Yet, with the rapid advance in GPU architecture, we strongly believe that significant enhancement of computational efficiency is possible.

In this challenge, we invite all CUDA programmers and GPU researchers around the world to help us find inefficient codes and further improve MCX computational efficiency. We encourage innovations in both algorithms and programming strategies, particularly those tailored to achieve high performance on NVIDIA GPUs.

For each round of the challenge, we publish a baseline code. Anyone who can improve the code simulation speed by significant portion (>50%) will receive cash award.

2. How to participate

Please fork our Github branch, entitled "speedchallenge_base2" in the mcx git repository, as the baseline code.

https://github.com/fangq/mcx/tree/speedchallenge_base2

To fork, please click on the "Fork" button on the right-top of the above page. You may then modify/optimize the mcx code inside your forked branch. Once you have completed your code optimization and verified the speed improvement, you may submit your changes to the upstream authors by either a "pull request", or submitting an Issue tracker entitled "MCX Speed Challenge - your name".

The upstream developers will review & test your updated software. If the updated software satisfies all the requirements, as specified in the Rules section below, a cash award will be announced and mailed to the participant.

In the event that a valid improvement is accepted, the upstream author may update the base code to the optimized software, and start a new round of the challenge.

We provided 3 benchmark problems to serve as the reference in simulation speed. To run these standard benchmarks, you need to first compile MCX to generate executable. If you are running Linux, you may simply type "make" inside the mcx/src folder. Once the binary is generated, you can then navigate into the mcx/example/benchmark folder, and run the benchmark scripts run_benchmark{1,2,3}.sh.

3. Rules

The rules to determine the validity of a submission include

The accelerated code must produce the numerically accurate outputs when running on the same GPU. One can use the "absorption fraction", printed at the bottom of the command line output, to measure the accuracy.
A winning submission must be able to improve MCX simulation speed (in the unit of photon/millisecond) by at least 50% in any one of the 3 provided standard benchmarks;
The over 50% speed improvement must be demonstrated between the baseline and the updated software using the same GPU (or GPUs);
In Round 1, valid GPU architectures include Fermi, Kepler and Maxwell;
The updated codes should not significantly sacrifice the readability and future maintainability of the software;
The average simulation speed is determined by running each benchmark script (run_benchmark*.sh) for >3 times and take the median.
Speed improvements due to different compiler versions, driver versions, compiler settings and other issues that are not directly related to the MCX code implementation are not considered.

The cash award will be made with the following rules:

The cash award starts at $500 (for speed improvement equal to or greater than 50%);
Cash award will increase with an increment of $250, corresponding to every 25% additional speed-up;
The maximum award per round is $2,000 (for 2x or higher speed acceleration);
A maximum of two awards can be made for each specific award amount ($500+N*$250, N=0,1,2,...); if more than two submissions are considered for the same speed-up bracket, priorities will be given to the ones with better readability and portability.

The Round 1 baseline code simulation speed for the 3 provide benchmarks are listed below for your reference:

GPU	benchmark 1	benchmark 2	benchmark 3
GTX 1080Ti	59207 †	24938	30665
GTX 1080	42790 †	22873	28514
GTX 980Ti	29146 †	18591	23810

Desired % Absorption	17.7%	27.0%	18.7%

† speed is measured in photon/ms, reported near the bottom of the MCX command line log.

4. Terms

The participants of this challenge must accept the below terms in order to receive the award:

The participant, either individuals or teams, must be the copyright owner of the updated software;
The updated software must be released in the form of open-source software with a license that is compatible to the upstream MCX software license, GNU Public License Version 3 (GPL v3) or later.

The upstream authors have the following responsibilities:

The upstream authors will acknowledge the contribution from the participant, if the participant's submission is incorporated into the upstream source codes;
If a participant's work is accepted into the upstream, if a publication is derived from (or partially from) the updated software, the participant's contribution should be properly acknowledged (or, if significant, listed as co-authors).
The project maintainer reserves the rights to interpret the rules and change the rules without notice.

5. Guidance for code optimization

MCX uses a complex, compute-bound, register-demanding monolithic CUDA kernel for running photon simulations in a Persistent Thread (PT) structure. This kernel is contained in a global function, called mcx_main_loop in the mcx_core.cu unit:

https://github.com/fangq/mcx/blob/speedchallenge_base1/src/mcx_core.cu#L460-L765

Each thread contains a while-loop (lines#537-742) for continuously launching, propagating and depositing photon packet (a group of photons) energy into a volume stored in the global memory. The overall MCX algorithm flow chart can be found at

http://mcx.sourceforge.net/cgi-bin/index.cgi?Doc/Workflow

The simulations in each thread is largely independent to other threads, except that when depositing photon packet energy loss to the global memory, atomic operations is required.

MCX uses a voxelated domain to represent 3D heterogeneous random media. The main computation in MCX focuses on performing a "random-walk" of the photons inside a uniform voxel-space.

In each propagation step, MCX runs a ray-box ray-tracing algorithm (in function hitgrid) to determine the entry position to the next voxel. If the photon's scattering path is shorter than the distance to the entry position of the next voxel, the photon will be moved to the end of the scattering path, and then determines a new propagation direction/distance; otherwise, photon will be moved to the entry-position of the next voxel. In either case, a repeated ray-box ray-tracing calculation is performed to iteratively advance the photon across the domain - until a photon escapes the domain, or exceeds the maximum simulation time window.

Although MCX has shown poor global memory load efficiency (~3%) due to the random-walk nature of the algorithm, the latency due to memory access is marginal, because the latency is well hidden by the large parallel threads. Moreover, the hitgrid function accounts for ~10-15% overall latency; about 1/3 of the latency in hitgrid is a result of execution dependency. A profiler output of an earlier version of MCX can be found in the below attached screenshot (Attachment 1).

In the baseline code, a random number generator (RNG) based on the xorshift128+ algorithm is used. The RNG is self-contained in each thread and consumes about 8 32bit registers. Another built-in RNG is the Logistic-Lattice 5-ring RNG, as described in the Appendix of this paper. To enable the Logistic lattice RNG instead, one needs to recompile mcx binary using "make maxwell".

In the 3 provided benchmarks, we meant to show representative computations for many practical simulations. Here are the main algorithm aspects tested in each benchmark

in Benchmark 1, the computation focuses on the ray-box ray-tracing using the hitgrid function;
in Benchmark 2, we enables boundary reflection calculations. The extended run-time is partly because of the additional reflection/transmission computation, by mostly due to extended photon propagation path due to internal reflection (this is also indicated by the increased absorption fraction from 17% to 27%);
in Benchmark 3, we test the efficiency of the launchphoton function for simulating "wide-field" illuminations - photons launched from outside of the cubic voxelated domain.

As long as you can ensure the accuracy of the simulation and the continuation of the main features, you may optimize all aspects of the kernel, including, but not limited to, memory allocation, use of math functions, RNGs, code control flows, thread/block/grid configurations, compiler optimization options, GPU features. Use of duplicated kernels and assembly codes are acceptable, but not encouraged due to the increased maintenance cost (unless the speed enhancement is significant).

To submit your patches to the upstream repository, you MUST set your code editor to use a Tab width of 8 spaces. If misaligned indentations are found in the submitted code revisions, the upstream authors may request the submitter to reformat the code.

Be aware, when compiling MCX with CUDA 7.5 using "make", the generated binary show 10x slow-down on Maxwell GPUs. This is recently discovered as a bug in CUDA 7.5/driver. An updated driver is on the way. As a work-around, one can use CUDA 6.5 or CUDA 7.0 to compile MCX.