# SPEEDUP - OPTIMIZATION AND PORTING OF PATH INTEGRAL MC CODE TO NEW COMPUTING ARCHITECTURES

V. SLAVNIĆ, A. BALAŽ, D. STOJILJKOVIĆ, A. BELIĆ, A. BOGOJEVIĆ SCIENTIFIC COMPUTING LABORATORY INSTITUTE OF PHYSICS BELGRADE, SERBIA HTTP://WWW.SCL.RS/





#### OVERVIEW

- INTRODUCTION
- SPEEDUP code
- TESTED HARDWARE ARCHITECTURES
- RESULTS
  - O SETUP
  - O SERIAL SPEEDUP CODE
  - O MPI SPEEDUP CODE
  - Modified SPEEDUP code
  - O CELL SPEEDUP CODE
- COMPARISON OF HARDWARE PERFORMANCE RESULTS
- CONCLUSIONS



SEP 08, 2009

#### INTRODUCTION

- SPEEDUP CODE IS USED FOR NUMERICAL STUDIES OF QUANTUM MECHANICAL SYSTEMS, PROPERTIES OF BECS AND ULTRA-COLD ATOMIC GASES
- PORTING OF THE CODE ENABLES ITS USE ON A BROADER SET OF COMPUTING RESOURCES
- CODE OPTIMIZATION ALLOWS US TO
  - FULLY UTILIZE COMPUTING RESOURCES
  - ELIMINATE BOTTLENECKS IN THE CODE
  - Use different architectures in a proper way
  - BUT, IT MUST BE DONE CAREFULLY (VERIFICATION!)
- POSSIBILITY OF BENCHMARKING OF DIFFERENT HARDWARE PLATFORMS
- USE RESULTS FOR PLANNING OF HARDWARE UPGRADES



# SPEEDUP CODE (1/2)

- MONTE CARLO SIMULATIONS ARE NATURAL CHOICE FOR NUMERICAL STUDIES OF RELEVANT PHYSICAL SYSTEMS IN THE FUNCTIONAL FORMALISM – PATH INTEGRAL MONTE CARLO
- SPEEDUP CODE CALCULATES TRANSITION AMPLITUDES USING THE EFFECTIVE ACTION APPROACH

$$A_{N}(i;f;T) = \left(\frac{1}{2\pi\varepsilon_{N}}\right)^{N/2} \int dq_{1}...dq_{N-1}e^{-S_{N}}$$

- IT IS ABLE TO CALCULATE PARTITION FUNCTIONS AND EXPECTATION VALUES
- IT CAN BE ALSO USED TO EXTRACT INFORMATION ABOUT THE LOW-LYING ENERGY SPECTRA OF QUANTUM SYSTEMS



# SPEEDUP CODE (2/2)

#### ALGORITHM





SEP 08, 2009

#### GOOD RNG IS ESSENTIAL - WE USE SPRNG

#### **TESTED ARCHITECTURES**

- IBM BLADECENTER WITH 3 KINDS OF SERVER SYSTEMS IN THE HPC H-TYPE CHASSIS:
  - HX21XM BLADE SERVER
    - INTEL XEON BASED
    - 2 QUADCORE 5405 PROCESSORS (SSE4.1)
    - ICC AND GCC COMPILERS USED
  - JS22 BLADE SERVER
    - POWER6 BASED
    - 2 DUALCORE PROCESSORS SUPPORTING MULTITHREADING AND ALTIVEC EXTENSIONS
    - IBM XLC/C++ AND GCC COMPILERS USED
  - O QS22 BLADE SERVER
    - CELL B/E ARCHITECTURE 2 POWERXCELLS 8I ON BOARD
    - 1 PowerPC Processor Element (PPE)
    - 8 SYNERGETIC PROCESSING ELEMENTS (SPES)
    - IBM XL C/C++ COMPILER FOR MULTICORE ACCELERATION AND GCC COMPILERS USED



SEP 08, 2009

#### PERFORMED TESTS

- NMC=5120000 MC SAMPLES
- BOUNDARY CONDITIONS FOR THE TRANSITION AMPLITUDE
  - О Q(т=О)=О
  - О ф(т=Т=1)=1
  - O ZERO ANHARMONICITY
  - LEVEL OF EFFECTIVE ACTION P=9 FOR THE
    QUARTIC ANHARMONIC OSCILLATOR
- SAME SEED FOR SPRNG GENERATOR USED FOR EASY VERIFICATION OF THE OBTAINED RESULTS



SEP 08, 2009

### SERIAL SPEEDUP RESULTS

| COMPILER | GCC          | ICC          | XLC          |
|----------|--------------|--------------|--------------|
| PLATFORM |              |              |              |
| INTEL    | (13760±50) s | (10160±30) s |              |
| POWER6   | (17000±10) s |              | (1900±10) s  |
| GELL     | (49410±50) s |              | (14020±20) s |

- SIGNIFICANT INCREASE IN THE SPEED WHEN PLATFORM-SPECIFIC COMPILER IS USED
- POWER PERFORMANCE DOMINATES IN THIS BENCHMARK
- CELL IS NO MATCH WHEN ONLY PPE IS USED (WITHOUT THE USE OF SPES)



# MPI SPEEDUP RESULTS



- EXCELLENT SCALABILITY WITH THE NUMBER OF MPI PROCESSES
- INTERESTING BEHAVIOR WHEN THE NUMBER OF MPI PROCESSES  $\geq 9$
- MINIMAL EXECUTION TIME OF 1320s

XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, 07-14 September, 2009



### MODIFIED SPEEDUP RESULTS



- IMPLEMENTED AS A THREADED VERSION USING
  POSIX THREADS (PTHREADS)
- EACH THREAD CALCULATES NMC/NUM THREADS
- INTEL HAS BETTER RELATIVE INCREASE IN THE SPEED (2.8X COMPARING TO POWER6'S 1.3X)



XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, 07-14 September, 2009



# CELL SPEEDUP RESULTS (1/3)

- HETEROGENEITY OF THE ARCHITECTURE
  REQUIRED THE SLIGHT REARRANGEMENT OF THE
  CODE
- SAME CODE IS EXECUTED ON ALL SPES
- EACH SPE PERFORMS NMC/NUMBER\_OF\_SPES MC STEPS
- No SPRNG LIBRARY FOR SPES!!!
- PTHREADS ON PPE FOR CONTROL OF SPES AND RNG GENERATION
- DMA TRANSFERS OF GENERATED RANDOM TRAJECTORIES FROM PPES TO SPES
- SYNCHRONIZATION WITH MAILBOX TECHNIQUE



SEP 08, 2009

# CELL SPEEDUP RESULTS (2/3)



- SATURATION OF THE PERFORMANCE AROUND 4 SPES CAUSED BY RNG
- COMMUNICATION DOES NOT HAVE SIGNIFICANT IMPACT ON THE EXECUTION TIME
  - TESTED WITH RNG ONLY, FOR VERIFICATION
  - TEST RESULT: 750s; IDEAL TIME: 250s

XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, 07-14 September, 2009



# **CELL SPEEDUP RESULTS (3/3)**



TO FULLY UTILIZE ALL SPE CAPABILITIES, ONE HAS TO EXTEND SPE CALCULATION TIME

- INCREASE IN THE EFFECTIVE ACTION LEVEL P 0
- WE DEMONSTRATE THIS BY COMPILING THE CODE Ο WITHOUT OPTIMIZATION

SCIENTIFIC COMPUTING LABORATORY

PERFECT SCALING WHEN PPES HAVE ENDUGH TIME FOR RNG

XXII International Symposium on Nuclear Electronics & Computing Bulgaria, Varna, 07-14 September, 2009



#### COMPARISON OF RESULTS

| INTEL | POWER6 | CELL | CELL IDEAL |
|-------|--------|------|------------|
| 460s  | 250s   | 750s | 250s       |

- RESULTS FOR INTEL AND POWER6 ARE OBTAINED USING MODIFIED SPEEDUP CODE
- CELL IDEAL TIME CORRESPONDS TO THE FULL UTILIZATION OF SPES (ESTIMATED)



SEP 08, 2009

#### CONCLUSIONS

- POWER6 AND INTEL OPTIMIZATION IS DONE USING THREADED VERSION OF THE CODE
- CELL PLATFORM REQUIRES MORE COMPLEX
  CHANGES OF THE CODE
- PLATFORM-SPECIFIC COMPILERS ALWAYS GIVE MUCH BETTER PERFORMANCE
- SPEEDUP EASILY OPTIMIZED ON THE POWER6 PLATFORM, WITH SUPERIOR PERFORMANCE
- GOOD PERFORMANCE AND SCALABILITY FOR
  INTEL PLATFORM
- SAME LEVEL OF PERFORMANCE AS POWER6 WITH HIGHER CALCULATION TIMES FOR CELL
- FUTURE WORK: PORTING OF SPRNG LIBRARY TO SPES AND IMPLEMENTATION OF PLATFORM-SPECIFIC INSTRUCTIONS (VECTORIZATION) FOR EACH TESTED PLATFORM

