Showing revision 1.3
uMCX - Shortest GPU-accelerated 3D Monte Carlo Photon Simulator

μMCX

The shortest GPU-accelerated 3D Monte Carlo photon transport simulator — in <1000 lines of portable C++11

v0.5 NVIDIA GPU AMD GPU CPU Multi-core OpenMP 4.5 OpenACC 2.0 C++11 JSON I/O GPL v3 NIH Funded

🔬 What is uMCX?

uMCX (μMCX) is a miniaturized, maximally portable Monte Carlo photon transport simulator. It simulates light propagation in 3D voxelated turbid media such as biological tissues, with GPU acceleration — all in under 1000 lines of clean, rule-formatted C++11 code.

🏆 MCX Project

Part of the MCX project, supported by US NIH grant R01-GM114365. uMCX is backward-compatible with MCX JSON input files — existing MCX simulations run with minimal modification.

<1000
Lines of Code
24×
Fewer lines than MCX
3
GPU Vendors

💡 Photon Transport Physics

uMCX implements the gold-standard stochastic Monte Carlo method for simulating photon migration in turbid media. Each photon is tracked from launch to absorption or escape.

🎯 Scattering

  • Henyey-Greenstein phase function
  • Per-voxel anisotropy factor g
  • xorshift128+ RNG (one per thread)

🔀 Boundaries

  • Refractive index mismatch
  • Fresnel reflection / refraction
  • Configurable reflective boundaries

📈 Output

  • Fluence rate / fluence / energy deposition
  • Time-gated 4D output (Nx×Ny×Nz×Nt)
  • Detected photon partial path lengths

💡 Sources

  • Pencil, isotropic, cone beam
  • Disk (top-hat), planar widefield
  • Widefield launch from outside domain

⚡ Performance

uMCX achieves competitive performance on all hardware via OpenMP 4.5 GPU offloading. Benchmarked on cube60 with 100M photons:

HardwareBuildSpeed
NVIDIA RTX 5090make nvc175075 photon/ms (vs mcxcl: 204499 p/ms, mcx: 201613 p/ms)
NVIDIA RTX 4090make nvc133894 photon/ms (vs mcxcl: 166945 p/ms, mcx: 168350 p/ms)
NVIDIA RTX 3090make nvc63331 photon/ms (vs mcxcl: 123001 p/ms, mcx: 114679 p/ms)
AMD Radeon VII (gfx906)make amdclang~1780 photon/ms
AMD Threadripper 3990X (128 threads)make multi~3034 photon/ms

📝 Forward-compatible NVIDIA builds

GCC/Clang NVIDIA builds embed PTX (virtual ISA). The CUDA driver JIT-compiles PTX at runtime, so a binary compiled with SM=sm_50 runs correctly on any newer NVIDIA GPU generation.

Key Features

Readable · Portable · GPU-accelerated · MCX-compatible

📄

Shortest Implementation

Under 1000 lines of clean C++11 after astyle formatting — 24× fewer than MCX while covering its core functionality.

🆕

Multi-Vendor GPU

Single source compiles for NVIDIA (nvc++, GCC nvptx, Clang), AMD (ROCm clang++), and CPU via OpenMP 4.5 / OpenACC 2.0.

🔄

MCX-Compatible JSON I/O

Reads MCX JSON input files unchanged. Outputs fluence volumes and detected photon data as Binary JData (BNII / JDB) files.

📊

Time-Gated Simulation

Configurable time windows produce 4D fluence volumes (Nx×Ny×Nz×Nt) for time-domain optical imaging.

📶

Photon Detection

Circular detectors record partial path lengths per medium, exit position, and exit direction for DRS/DCS analysis.

🌐

NeuroJSON.io Cloud

Browse and download MCX simulation datasets from neurojson.io directly with umcx -N colin27.

👀

Shape-Based Domain

Build 3D tissue models from geometric primitives: sphere, box, cylinder, X/Y/Z layer slabs — or supply a pre-computed label volume.

🔧

CMake + Makefile

Both build systems supported. One-line compile for each target. Auto-formatting with astyle ensures canonical line count.

Comparison with MCX

Same physics core — 24× fewer lines of code

843
uMCX lines
after astyle formatting
20,188
MCX core lines
C/CUDA source
24×
Code reduction
same core physics
3
GPU vendors
vs NVIDIA-only MCX

📄 Code size by module

MCX fileMCX linesuMCX equivalentuMCX linesReduction
mcx.c90main()811×
mcx_shapes.c/.h940MCX_userio::initdomain()4321×
mcx_tictoc.c/.h313MCX_clock832×
mcx_utils.c/.h6036MCX_userio24724×
mcx_core.cu/.h4875MCX_kernel / MCX_photon / MCX_detect38213×
mcx_rand_xorshift128p.cu140MCX_rand29
Total (core)20,188Total84324×

✓ Feature coverage

Feature MCX uMCX
3D voxelated domain
Time-resolved simulation
Detected photon data
Boundary reflection
JSON input / BJData output
Shape-based media
Widefield source launch
NVIDIA GPU
AMD / Intel GPUmcxcl
Source types5/15
Det. photon output fields4/8
Built-in benchmarks7/10
Multi-GPU simulation
Photon replay / sharing
Patterned / multi-source
Polarized light / SVMC
MATLAB / Python bindingstrivial

Legend

fully supported    p/n partial (fraction of total variants)    trivial easily implementable but omitted to minimize code length    not implemented

Usage

Multiple ways to run a simulation

# Run a built-in benchmark (default: 1M photons)
umcx cube60
umcx -Q skinvessel

# Override photon count
umcx -Q cube60 -n 1e7

# Print benchmark configuration as JSON (no simulation)
umcx --bench cube60 --dumpjson

# Available benchmarks:
#   cube60       60^3 homogeneous cube, pencil source
#   cube60b      same + Fresnel boundary reflection
#   cube60planar same + planar 40x40 mm widefield source
#   cubesph60b   60^3 cube with embedded sphere (r=15)
#   sphshells    three concentric spherical shells
#   spherebox    sphere in cube with 0.1 ns time gate
#   skinvessel   200^3 realistic skin + cylindrical vessel
# Run from a JSON input file
umcx myinput.json

# Minimal JSON structure:
{
  "Session":  { "ID": "mysim", "Photons": 1000000 },
  "Forward":  { "T0": 0, "T1": 5e-9, "Dt": 5e-9 },
  "Domain": {
    "Dim": [60, 60, 60],
    "LengthUnit": 1,
    "Media": [
      { "mua": 0.00, "mus": 0.0, "g": 1.00, "n": 1.00 },
      { "mua": 0.02, "mus": 9.0, "g": 0.89, "n": 1.37 }
    ]
  },
  "Optode": {
    "Source": { "Type": "pencil", "Pos": [30,30,0], "Dir": [0,0,1] }
  },
  "Shapes": [ { "Grid": { "Tag": 1, "Size": [60,60,60] } } ]
}

# Outputs: mysim.bnii  (fluence volume)
#          mysim_detp.jdb  (detected photon data)
umcx -f input.json          # load JSON file
umcx -Q cube60              # built-in benchmark
umcx -n 1e7                 # photon count
umcx -s myresult            # output file name prefix
umcx -b 1                   # enable boundary reflection
umcx -d 0                   # disable detected photon saving
umcx -U 0                   # disable output normalization
umcx -O f                   # output type: f=fluence, x=fluence-rate, e=energy
umcx -G 2                   # select GPU device ID (1-based)
umcx -t 200000 -T 64        # total threads and block size
umcx -w 53                  # savedetflag: 1+4+16+32
umcx -E 12345               # RNG seed

# Override JSON fields inline
umcx -Q cube60 -j '{"Session":{"Photons":5000000}}'
umcx myinput.json --json '{"Optode":{"Source":{"Type":"isotropic"}}}'
# Requires curl to be installed

# List all available MCX simulations on NeuroJSON.io
umcx -N

# Download and immediately run a simulation from the cloud
umcx -N colin27
umcx -N skinvessel

# Download, override photon count, and run
umcx -N colin27 -n 5e6

# Fetches from https://neurojson.io:7777/mcx/<name>
# Executed locally — no data upload.
# 1. Compile (CPU multi-core, default)
cd src && make

# 2. Inspect a built-in benchmark as JSON
./bin/umcx --bench cube60 --dumpjson > mycube.json

# 3. Edit mycube.json (geometry, media, source, photon count)

# 4. Run the simulation
./bin/umcx mycube.json -n 1e7

# Console output example:
#   simulated energy 10000000, speed 3245 photon/ms,
#   duration 3082 ms, normalizer 2e-10,
#   detected 4123, absorbed 17.3%

# 5. Read outputs in MATLAB (JData toolbox)
#    vol = loadjson('mycube.bnii');       % fluence-rate volume
#    det = loadjson('mycube_detp.jdb');   % detected photon data

# 6. Read outputs in Python (jdata package)
#    import jdata; vol = jdata.load('mycube.bnii')

Hardware Support

Tested on Linux x86-64 — OpenMP 4.5 / OpenACC 2.0

Make targetCompilerHardwareStatusNotes
make / make multig++ ≥12CPU (multi-core)✔ WorksStandard OpenMP threading; default build
make singleg++ ≥12CPU (single-core)✔ WorksNo threading; useful for debugging
make nvcnvc++NVIDIA GPU (OpenMP)✔ WorksBest NVIDIA perf; requires CUDA driver at runtime
make nvc ACC=onnvc++NVIDIA GPU (OpenACC)✔ WorksOpenACC path; similar performance to nvc
make nvidiag++ ≥12NVIDIA GPU✔ WorksGCC nvptx; PTX forward compat; CPU fallback if no GPU
make nvidiaclangclang++ ≥16NVIDIA GPU✔ WorksClang nvptx64; set CUDA_PATH
make amdclangROCm clang++ ≥17AMD GPU✔ WorksROCm ≥6.1; set GFX=gfx906 etc.
make amdg++ ≥12AMD GPU✘ BrokenGCC 13 libgomp-plugin-amdgcn runtime bug

🔄 NVIDIA GPU

make nvc (NVIDIA HPC SDK): full GPU support. Requires libcuda.so at runtime — no CPU fallback if the CUDA driver is absent.

make nvidia / make nvidiaclang: embed PTX (NVIDIA virtual ISA). The driver JIT-compiles PTX at runtime → forward-compatible with newer GPU generations. Falls back to CPU if no GPU detected.

Architecture: make nvidia SM=sm_86   Default: SM=sm_50

🔄 AMD GPU

make amdclang (ROCm clang++ ≥17): full AMD GPU support via OpenMP target offload. Override compiler with AMDCXX=.

make amd (GCC): broken — GCC 13 libgomp-plugin-amdgcn1 crashes even a trivial kernel. Use make amdclang instead.

No forward compat: gfx906 binary will not run on gfx1010 or newer. Set: make amdclang GFX=gfx1100   Default: GFX=gfx906

💻 CPU

make or make multi: parallel execution via OpenMP on all CPU cores. No GPU required.

Works on any x86-64 Linux / macOS / Windows with GCC ≥12 or compatible compiler.

Benchmarked: ~3034 photon/ms on 128-thread AMD Threadripper 3990X (10M photons, cube60).

🌎 Portability Notes

Unlike OpenCL (JIT at runtime), OpenMP/OpenACC compiles AOT to a specific ISA. A single binary targets one GPU architecture per vendor.

GCC supports fat binaries with multiple -foffload= targets to cover several GPU generations in one binary.

Standards: OpenMP 4.5 (pointer attachment de facto; formal in 5.0)  ·  OpenACC 2.0

Compilation

All builds start from the src/ directory

💻 CPU (default)

GCC ≥12 with OpenMP. Runs on any x86-64 Linux / macOS / Windows.

sudo apt-get install g++-14 cd src && make

🔄 NVIDIA GPU — nvc++

Download NVIDIA HPC SDK from developer.nvidia.com/hpc-sdk, then:

export PATH=/opt/nvidia/hpc_sdk/.../bin:$PATH cd src && make nvc

🔄 NVIDIA GPU — GCC

Requires GCC with nvptx offload libraries:

sudo apt-get install g++-14 gcc-14-offload-nvptx cd src && make nvidia

🔄 AMD GPU — ROCm clang++

Requires ROCm ≥6.1. Specify your GPU architecture:

cd src make amdclang GFX=gfx1100 # RDNA 3 (RX 7000) make amdclang GFX=gfx906 # Radeon VII / Vega 20 # rocminfo | grep gfx # find your arch

⚙️ CMake

CMake ≥3.5 is supported as an alternative build system:

cd src cmake -B ../build -DBACKEND=OMP cmake --build ../build # NVIDIA via nvc++: cmake -B ../build \ -DCMAKE_CXX_COMPILER=nvc++ -DBACKEND=NVC # AMD via ROCm clang++: cmake -B ../build \ -DCMAKE_CXX_COMPILER=/opt/rocm/llvm/bin/clang++ \ -DBACKEND=AMD_CLANG -DGFX=gfx1100

🗎️ From GitHub

Clone the repository and compile:

git clone https://github.com/fangq/umcx.git cd umcx/src make # CPU make nvc # NVIDIA GPU (nvc++) make amdclang # AMD GPU (ROCm)

Binary placed in ../bin/umcx

Get uMCX

Open source · Single file · No dependencies beyond a C++11 compiler

0.5
Current Version
<1000
Lines of Code
24×
Fewer than MCX
3
GPU Vendors

Supported by NIH Grant R01-GM114365  ·  GPL v3 License

Copyright © 2025–2026 Qianqian Fang <q.fang@neu.edu>

Powered by Habitat