Showing revision 1.3

uMCX - Shortest GPU-accelerated 3D Monte Carlo Photon Simulator

μMCX

The shortest GPU-accelerated 3D Monte Carlo photon transport simulator — in <1000 lines of portable C++11

v0.5 NVIDIA GPU AMD GPU CPU Multi-core OpenMP 4.5 OpenACC 2.0 C++11 JSON I/O GPL v3 NIH Funded

🔬 What is uMCX?

uMCX (μMCX) is a miniaturized, maximally portable Monte Carlo photon transport simulator. It simulates light propagation in 3D voxelated turbid media such as biological tissues, with GPU acceleration — all in under 1000 lines of clean, rule-formatted C++11 code.

🏆 MCX Project

Part of the MCX project, supported by US NIH grant R01-GM114365. uMCX is backward-compatible with MCX JSON input files — existing MCX simulations run with minimal modification.

<1000

Lines of Code

24×

Fewer lines than MCX

GPU Vendors

GitHub vs MCX MCX Project NeuroJSON.io

💡 Photon Transport Physics

uMCX implements the gold-standard stochastic Monte Carlo method for simulating photon migration in turbid media. Each photon is tracked from launch to absorption or escape.

🎯 Scattering

Henyey-Greenstein phase function
Per-voxel anisotropy factor g
xorshift128+ RNG (one per thread)

🔀 Boundaries

Refractive index mismatch
Fresnel reflection / refraction
Configurable reflective boundaries

📈 Output

Fluence rate / fluence / energy deposition
Time-gated 4D output (Nx×Ny×Nz×Nt)
Detected photon partial path lengths

💡 Sources

Pencil, isotropic, cone beam
Disk (top-hat), planar widefield
Widefield launch from outside domain

⚡ Performance

uMCX achieves competitive performance on all hardware via OpenMP 4.5 GPU offloading. Benchmarked on cube60 with 100M photons:

Hardware	Build	Speed
NVIDIA RTX 5090	`make nvc`	175075 photon/ms (vs mcxcl: 204499 p/ms, mcx: 201613 p/ms)
NVIDIA RTX 4090	`make nvc`	133894 photon/ms (vs mcxcl: 166945 p/ms, mcx: 168350 p/ms)
NVIDIA RTX 3090	`make nvc`	63331 photon/ms (vs mcxcl: 123001 p/ms, mcx: 114679 p/ms)
AMD Radeon VII (gfx906)	`make amdclang`	~1780 photon/ms
AMD Threadripper 3990X (128 threads)	`make multi`	~3034 photon/ms

📝 Forward-compatible NVIDIA builds

GCC/Clang NVIDIA builds embed PTX (virtual ISA). The CUDA driver JIT-compiles PTX at runtime, so a binary compiled with SM=sm_50 runs correctly on any newer NVIDIA GPU generation.

Key Features

Readable · Portable · GPU-accelerated · MCX-compatible

📄

Shortest Implementation

Under 1000 lines of clean C++11 after astyle formatting — 24× fewer than MCX while covering its core functionality.

🆕

Multi-Vendor GPU

Single source compiles for NVIDIA (nvc++, GCC nvptx, Clang), AMD (ROCm clang++), and CPU via OpenMP 4.5 / OpenACC 2.0.

🔄

MCX-Compatible JSON I/O

Reads MCX JSON input files unchanged. Outputs fluence volumes and detected photon data as Binary JData (BNII / JDB) files.

📊

Time-Gated Simulation

Configurable time windows produce 4D fluence volumes (Nx×Ny×Nz×Nt) for time-domain optical imaging.

📶

Photon Detection

Circular detectors record partial path lengths per medium, exit position, and exit direction for DRS/DCS analysis.

🌐

NeuroJSON.io Cloud

Browse and download MCX simulation datasets from neurojson.io directly with umcx -N colin27.

👀

Shape-Based Domain

Build 3D tissue models from geometric primitives: sphere, box, cylinder, X/Y/Z layer slabs — or supply a pre-computed label volume.

🔧

CMake + Makefile

Both build systems supported. One-line compile for each target. Auto-formatting with astyle ensures canonical line count.

Comparison with MCX

Same physics core — 24× fewer lines of code

843

uMCX lines

after astyle formatting

20,188

MCX core lines

C/CUDA source

24×

Code reduction

same core physics

GPU vendors

vs NVIDIA-only MCX

📄 Code size by module

MCX file	MCX lines	uMCX equivalent	uMCX lines	Reduction
`mcx.c`	90	`main()`	8	11×
`mcx_shapes.c/.h`	940	`MCX_userio::initdomain()`	43	21×
`mcx_tictoc.c/.h`	313	`MCX_clock`	8	32×
`mcx_utils.c/.h`	6036	`MCX_userio`	247	24×
`mcx_core.cu/.h`	4875	`MCX_kernel / MCX_photon / MCX_detect`	382	13×
`mcx_rand_xorshift128p.cu`	140	`MCX_rand`	29	5×
Total (core)	20,188	Total	843	24×

✓ Feature coverage

Feature MCX uMCX

3D voxelated domain✔✔

Time-resolved simulation✔✔

Detected photon data✔✔

Boundary reflection✔✔

JSON input / BJData output✔✔

Shape-based media✔✔

Widefield source launch✔✔

NVIDIA GPU✔✔

AMD / Intel GPUmcxcl✔

Source types✔5/15

Det. photon output fields✔4/8

Built-in benchmarks✔7/10

Multi-GPU simulation✔—

Photon replay / sharing✔—

Patterned / multi-source✔—

Polarized light / SVMC✔—

MATLAB / Python bindings✔trivial

Legend

✔ fully supported p/n partial (fraction of total variants) trivial easily implementable but omitted to minimize code length — not implemented

Usage

Multiple ways to run a simulation

# Run a built-in benchmark (default: 1M photons)
umcx cube60
umcx -Q skinvessel

# Override photon count
umcx -Q cube60 -n 1e7

# Print benchmark configuration as JSON (no simulation)
umcx --bench cube60 --dumpjson

# Available benchmarks:
#   cube60       60^3 homogeneous cube, pencil source
#   cube60b      same + Fresnel boundary reflection
#   cube60planar same + planar 40x40 mm widefield source
#   cubesph60b   60^3 cube with embedded sphere (r=15)
#   sphshells    three concentric spherical shells
#   spherebox    sphere in cube with 0.1 ns time gate
#   skinvessel   200^3 realistic skin + cylindrical vessel

# Run from a JSON input file
umcx myinput.json

# Minimal JSON structure:
{
  "Session":  { "ID": "mysim", "Photons": 1000000 },
  "Forward":  { "T0": 0, "T1": 5e-9, "Dt": 5e-9 },
  "Domain": {
    "Dim": [60, 60, 60],
    "LengthUnit": 1,
    "Media": [
      { "mua": 0.00, "mus": 0.0, "g": 1.00, "n": 1.00 },
      { "mua": 0.02, "mus": 9.0, "g": 0.89, "n": 1.37 }
    ]
  },
  "Optode": {
    "Source": { "Type": "pencil", "Pos": [30,30,0], "Dir": [0,0,1] }
  },
  "Shapes": [ { "Grid": { "Tag": 1, "Size": [60,60,60] } } ]
}

# Outputs: mysim.bnii  (fluence volume)
#          mysim_detp.jdb  (detected photon data)

umcx -f input.json          # load JSON file
umcx -Q cube60              # built-in benchmark
umcx -n 1e7                 # photon count
umcx -s myresult            # output file name prefix
umcx -b 1                   # enable boundary reflection
umcx -d 0                   # disable detected photon saving
umcx -U 0                   # disable output normalization
umcx -O f                   # output type: f=fluence, x=fluence-rate, e=energy
umcx -G 2                   # select GPU device ID (1-based)
umcx -t 200000 -T 64        # total threads and block size
umcx -w 53                  # savedetflag: 1+4+16+32
umcx -E 12345               # RNG seed

# Override JSON fields inline
umcx -Q cube60 -j '{"Session":{"Photons":5000000}}'
umcx myinput.json --json '{"Optode":{"Source":{"Type":"isotropic"}}}'

# Requires curl to be installed

# List all available MCX simulations on NeuroJSON.io
umcx -N

# Download and immediately run a simulation from the cloud
umcx -N colin27
umcx -N skinvessel

# Download, override photon count, and run
umcx -N colin27 -n 5e6

# Fetches from https://neurojson.io:7777/mcx/<name>
# Executed locally — no data upload.

# 1. Compile (CPU multi-core, default)
cd src && make

# 2. Inspect a built-in benchmark as JSON
./bin/umcx --bench cube60 --dumpjson > mycube.json

# 3. Edit mycube.json (geometry, media, source, photon count)

# 4. Run the simulation
./bin/umcx mycube.json -n 1e7

# Console output example:
#   simulated energy 10000000, speed 3245 photon/ms,
#   duration 3082 ms, normalizer 2e-10,
#   detected 4123, absorbed 17.3%

# 5. Read outputs in MATLAB (JData toolbox)
#    vol = loadjson('mycube.bnii');       % fluence-rate volume
#    det = loadjson('mycube_detp.jdb');   % detected photon data

# 6. Read outputs in Python (jdata package)
#    import jdata; vol = jdata.load('mycube.bnii')

Hardware Support

Tested on Linux x86-64 — OpenMP 4.5 / OpenACC 2.0

Make target	Compiler	Hardware	Status	Notes
`make` / `make multi`	`g++` ≥12	CPU (multi-core)	✔ Works	Standard OpenMP threading; default build
`make single`	`g++` ≥12	CPU (single-core)	✔ Works	No threading; useful for debugging
`make nvc`	`nvc++`	NVIDIA GPU (OpenMP)	✔ Works	Best NVIDIA perf; requires CUDA driver at runtime
`make nvc ACC=on`	`nvc++`	NVIDIA GPU (OpenACC)	✔ Works	OpenACC path; similar performance to nvc
`make nvidia`	`g++` ≥12	NVIDIA GPU	✔ Works	GCC nvptx; PTX forward compat; CPU fallback if no GPU
`make nvidiaclang`	`clang++` ≥16	NVIDIA GPU	✔ Works	Clang nvptx64; set `CUDA_PATH`
`make amdclang`	ROCm `clang++` ≥17	AMD GPU	✔ Works	ROCm ≥6.1; set `GFX=gfx906` etc.
`make amd`	`g++` ≥12	AMD GPU	✘ Broken	GCC 13 libgomp-plugin-amdgcn runtime bug

🔄 NVIDIA GPU

make nvc (NVIDIA HPC SDK): full GPU support. Requires libcuda.so at runtime — no CPU fallback if the CUDA driver is absent.

make nvidia / make nvidiaclang: embed PTX (NVIDIA virtual ISA). The driver JIT-compiles PTX at runtime → forward-compatible with newer GPU generations. Falls back to CPU if no GPU detected.

Architecture: make nvidia SM=sm_86 Default: SM=sm_50

🔄 AMD GPU

make amdclang (ROCm clang++ ≥17): full AMD GPU support via OpenMP target offload. Override compiler with AMDCXX=.

make amd (GCC): broken — GCC 13 libgomp-plugin-amdgcn1 crashes even a trivial kernel. Use make amdclang instead.

No forward compat: gfx906 binary will not run on gfx1010 or newer. Set: make amdclang GFX=gfx1100 Default: GFX=gfx906

💻 CPU

make or make multi: parallel execution via OpenMP on all CPU cores. No GPU required.

Works on any x86-64 Linux / macOS / Windows with GCC ≥12 or compatible compiler.

Benchmarked: ~3034 photon/ms on 128-thread AMD Threadripper 3990X (10M photons, cube60).

🌎 Portability Notes

Unlike OpenCL (JIT at runtime), OpenMP/OpenACC compiles AOT to a specific ISA. A single binary targets one GPU architecture per vendor.

GCC supports fat binaries with multiple -foffload= targets to cover several GPU generations in one binary.

Standards: OpenMP 4.5 (pointer attachment de facto; formal in 5.0) · OpenACC 2.0

Compilation

All builds start from the src/ directory

💻 CPU (default)

GCC ≥12 with OpenMP. Runs on any x86-64 Linux / macOS / Windows.

sudo apt-get install g++-14 cd src && make

🔄 NVIDIA GPU — nvc++

Download NVIDIA HPC SDK from developer.nvidia.com/hpc-sdk, then:

export PATH=/opt/nvidia/hpc_sdk/.../bin:$PATH cd src && make nvc

🔄 NVIDIA GPU — GCC

Requires GCC with nvptx offload libraries:

sudo apt-get install g++-14 gcc-14-offload-nvptx cd src && make nvidia

🔄 AMD GPU — ROCm clang++

Requires ROCm ≥6.1. Specify your GPU architecture:

cd src make amdclang GFX=gfx1100 # RDNA 3 (RX 7000) make amdclang GFX=gfx906 # Radeon VII / Vega 20 # rocminfo | grep gfx # find your arch

⚙️ CMake

CMake ≥3.5 is supported as an alternative build system:

cd src cmake -B ../build -DBACKEND=OMP cmake --build ../build # NVIDIA via nvc++: cmake -B ../build \ -DCMAKE_CXX_COMPILER=nvc++ -DBACKEND=NVC # AMD via ROCm clang++: cmake -B ../build \ -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DBACKEND=AMD_CLANG -DGFX=gfx1100

🗎️ From GitHub

Clone the repository and compile:

git clone https://github.com/fangq/umcx.git cd umcx/src make # CPU make nvc # NVIDIA GPU (nvc++) make amdclang # AMD GPU (ROCm)

Binary placed in ../bin/umcx

Get uMCX

Open source · Single file · No dependencies beyond a C++11 compiler

0.5

Current Version

<1000

Lines of Code

24×

Fewer than MCX

GPU Vendors

Download Latest Release GitHub Repository MCX Project

Supported by NIH Grant R01-GM114365 · GPL v3 License