Particle Container to Pure SoA Again #4653

ax3l · 2024-01-30T22:03:21Z

Transition to new, purely SoA particle containers.

This was originally merged in #3850 and reverted in #4652, since we discovered issues loosing particles & laser particles on GPU.

update AMReX again with Jan 30/31 changes AMReX: Latest development #4654
improve performance of id handling via constants and Add ParticleIDWrapper::make_invalid() AMReX-Codes/amrex#3735
update docs to make sure AoS is not mentioned anymore and particle id handling is guided
run test suite on CUDA devices with high checksum tolerance

Fun Mini-Benchmarks on CPU, DP

Hardware: 12th Gen Intel(R) Core(TM) i9-12900H

export OMP_NUM_THREADS=1
./warpx.3d ../../Examples/Tests/performance_tests/automated_test_1_uniform_rest_32ppc amr.max_grid_size=64 amr.n_cell=64 64 64 max_step=5 &
taskset -cp 6 $!

cpu_legacy.txt, cpu_soa.txt

Overall speed: similar to noise level of repeated runs (as expected).

Few noteworthy details in top 10 functions by runtime (Excl.):

ParticleContainer::RedistributeCPU: 8% slower 👀 -> ParticleContainer::RedistributeCPU for Pure SoA AMReX-Codes/amrex#3744
WarpXParticleContainer::ApplyBoundaryConditions: 5% faster
WarpX::OneStep_nosub: 2% slower 👀

Fun Mini-Benchmarks on A100 GPU, DP

Hardware: Perlmutter (NERSC) A100 GPU

./warpx.3d ../../Examples/Tests/performance_tests/automated_test_1_uniform_rest_32ppc amr.max_grid_size=256 amr.n_cell=256 256 256 max_step=5

Overall speed: 1.4% faster

Few noteworthy details in top 10 functions by runtime (Excl.):

GatherAndPush: 1.2% faster
Redistribute_partition: 4% faster
AddPlasma: 2.6% faster
ApplyBoundaryConditions: 1% faster
SortParticlesForDeposition: 231% faster 🚀 🚀 ✨
PermutationForDeposition: 3% faster
InitData: 15% faster 🚀
rest in TOP10 is the same as before

ax3l · 2024-01-31T01:22:19Z

Source/Diagnostics/ReducedDiags/FieldProbeParticleContainer.cpp

+        idcpu_data.push_back(0);
+        amrex::ParticleIDWrapper{idcpu_data.back()} = ParticleType::NextID();
+        amrex::ParticleCPUWrapper(idcpu_data.back()) = ParallelDescriptor::MyProc();


Let's use AMReX-Codes/amrex#3733

Source/Particles/ParticleCreation/SmartCreate.H

Source/Particles/ParticleCreation/SmartUtils.H

Source/Particles/WarpXParticleContainer.cpp

## Summary Update `ParticleCopyPlan::build` for pure SoA particle layout. ## Additional background - [x] testing on GPU in BLAST-WarpX/warpx#4653 ## Checklist The proposed changes: - [x] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate --------- Co-authored-by: Andrew Myers <atmyers2@gmail.com>

Source/Particles/Collision/BinaryCollision/DSMC/SplitAndScatterFunc.H

Source/Particles/Collision/BinaryCollision/ParticleCreationFunc.H

Source/Particles/ElementaryProcess/QEDPairGeneration.H

Source/Particles/ElementaryProcess/QEDPhotonEmission.H

Source/Particles/Resampling/LevelingThinning.cpp

Source/Particles/WarpXParticleContainer.cpp

Source/Diagnostics/WarpXOpenPMD.cpp

More pure SoA and id handling goodness.

Transition to new, purely SoA particle containers. This was originally merged in BLAST-WarpX#3850 and reverted in BLAST-WarpX#4652, since we discovered issues loosing particles & laser particles on GPU.

- faster: less emitted operations, no jumps - cheaper: less used registers - safer: no read-before-write warnings - cooler: no explanation needed

ax3l · 2024-02-02T17:56:36Z

GPU Tests (CUDA, A100 on Perlmutter)

diff --git a/Examples/analysis_default_openpmd_regression.py b/Examples/analysis_default_openpmd_regression.py
index 3aadc49ac5..3e9fb98789 100755
--- a/Examples/analysis_default_openpmd_regression.py
+++ b/Examples/analysis_default_openpmd_regression.py
@@ -15,6 +15,6 @@ test_name = os.path.split(os.getcwd())[1]
 
 # Run checksum regression test
 if re.search( 'single_precision', fn ):
-    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=2.e-6)
+    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=4.)
 else:
-    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd')
+    checksumAPI.evaluate_checksum(test_name, fn, output_format='openpmd', rtol=4.)
diff --git a/Examples/analysis_default_regression.py b/Examples/analysis_default_regression.py
index 453f650be0..6fa855df3d 100755
--- a/Examples/analysis_default_regression.py
+++ b/Examples/analysis_default_regression.py
@@ -15,6 +15,6 @@ test_name = os.path.split(os.getcwd())[1]
 
 # Run checksum regression test
 if re.search( 'single_precision', fn ):
-    checksumAPI.evaluate_checksum(test_name, fn, rtol=2.e-6)
+    checksumAPI.evaluate_checksum(test_name, fn, rtol=4.)
 else:
-    checksumAPI.evaluate_checksum(test_name, fn)
+    checksumAPI.evaluate_checksum(test_name, fn, rtol=4.)
diff --git a/Regression/WarpX-tests.ini b/Regression/WarpX-tests.ini
index 3310e642dd..84133add09 100644
--- a/Regression/WarpX-tests.ini
+++ b/Regression/WarpX-tests.ini
@@ -40,7 +40,7 @@ use_ctools = 0
 # sections.
 
 #MPIcommand = mpiexec -host @host@ -n @nprocs@ @command@
-MPIcommand = mpiexec -n @nprocs@ @command@
+MPIcommand = srun -n @nprocs@ @command@
 MPIhost =
 
 reportActiveTestsOnly = 1
@@ -64,7 +64,7 @@ branch = 24.02
 [source]
 dir = /home/regtester/AMReX_RegTesting/warpx
 branch = development
-cmakeSetupOpts = -DAMReX_ASSERTIONS=ON -DAMReX_TESTING=ON -DWarpX_PYTHON_IPO=OFF -DpyAMReX_IPO=OFF
+cmakeSetupOpts = -DAMReX_ASSERTIONS=ON -DAMReX_TESTING=ON -DWarpX_PYTHON_IPO=OFF -DpyAMReX_IPO=OFF -DWarpX_COMPUTE=CUDA
 # -DPYINSTALLOPTIONS="--disable-pip-version-check"
 
 # individual problems follow

cat test.sbatch

#!/bin/bash -l

# Copyright 2021-2023 Axel Huebl, Kevin Gott
#
# This file is part of WarpX.
#
# License: BSD-3-Clause-LBNL

#SBATCH -t 10:00:00
#SBATCH -N 1
#SBATCH -J run_test_soa
#SBATCH -A m4272_g
#SBATCH -q regular
# A100 40GB (most nodes)
#SBATCH -C gpu
# A100 80GB (256 nodes)
#S BATCH -C gpu&hbm80g
#SBATCH --exclusive
# ideally single:1, but NERSC cgroups issue
#SBATCH --gpu-bind=none
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-node=4
#SBATCH -o WarpX.o%j
#SBATCH -e WarpX.e%j

# pin to closest NIC to GPU
export MPICH_OFI_NIC_POLICY=GPU

# threads for OpenMP and threaded compressors per MPI rank
#   note: 16 avoids hyperthreading (32 virtual cores, 16 physical)
export SRUN_CPUS_PER_TASK=16

export WARPX_CI_NUM_MAKE_JOBS=32

./run_test.sh

Tests that pass within a 10hr walltime in `development`

53 (until walltime reached)

Tests that pass within a 10hr walltime with this PR

53 (until walltime reached)

Tests that Already Crash in `development`

ImplicitPicard_VandB_2d CRASHED (backtraces produced)
Langmuir_multi_2d_MR CRASHED (backtraces produced)
Langmuir_multi_2d_MR_momentum_conserving CRASHED (backtraces produced)
Langmuir_multi_2d_MR_psatd CRASHED (backtraces produced)
LaserInjection CRASHED (backtraces produced)
...

Some of those crash because of our default warning threshold, so I can retry with those.

!!! WARNING : [high][Performance] Too many boxes per GPU!
...
1::Assertion `msg_priority < abort_priority' failed, file "/tmp/ci-bbDk7v5Td5/warpx/Source/ablastr/warn_manager/WarnManager.cpp", line 97, Msg:

Tests that Crash with this PR

ImplicitPicard_VandB_2d CRASHED (backtraces produced)
Langmuir_multi_2d_MR CRASHED (backtraces produced)
Langmuir_multi_2d_MR_momentum_conserving CRASHED (backtraces produced)
Langmuir_multi_2d_MR_psatd CRASHED (backtraces produced)
LaserInjection CRASHED (backtraces produced)

`rtol=4.` Checksums

Tests that Already Fail Analysis in `development`

BTD_rz FAILED
Deuterium_Deuterium_Fusion_3D FAILED
Deuterium_Deuterium_Fusion_3D_intraspecies FAILED
Deuterium_Tritium_Fusion_3D FAILED
Deuterium_Tritium_Fusion_RZ FAILED
ElectrostaticSphereRZ FAILED
FluxInjection FAILED
FluxInjection3D FAILED
ImplicitPicard_1d FAILED
Langmuir_multi_psatd_single_precision FAILED
Langmuir_multi_rz FAILED
Langmuir_multi_rz_psatd FAILED
Langmuir_multi_rz_psatd_current_correction FAILED
Langmuir_multi_single_precision FAILED
LaserAcceleration_BTD FAILED
LaserInjectionFromLASYFile_RZ FAILED
...

Tests that Already Fail with this PR

BTD_rz FAILED
Deuterium_Deuterium_Fusion_3D FAILED
Deuterium_Deuterium_Fusion_3D_intraspecies FAILED
Deuterium_Tritium_Fusion_3D FAILED
Deuterium_Tritium_Fusion_RZ FAILED
ElectrostaticSphereRZ FAILED
FluxInjection FAILED
FluxInjection3D FAILED
ImplicitPicard_1d FAILED
Langmuir_multi_psatd_single_precision FAILED
Langmuir_multi_rz FAILED
Langmuir_multi_rz_psatd FAILED
Langmuir_multi_rz_psatd_current_correction FAILED
Langmuir_multi_single_precision FAILED
LaserAcceleration_BTD FAILED
LaserInjectionFromLASYFile_RZ FAILED
...

ax3l · 2024-02-09T22:40:35Z

Source/Diagnostics/WarpXOpenPMD.cpp

-                   currSpecies["position"]["z"].storeChunk(z, {offset}, {numParticleOnTile64});
-                }
-
-                //   reconstruct x and y from polar coordinates r, theta


Oopsi, reconstruction re-added in #4686

ax3l · 2024-02-21T02:52:59Z

Source/Particles/PhysicalParticleContainer.cpp

@@ -1084,7 +1083,7 @@ PhysicalParticleContainer::AddPlasma (PlasmaInjector const& plasma_injector, int
        const int max_new_particles = Scan::ExclusiveSum(counts.size(), counts.data(), offset.data());

        // Update NextID to include particles created in this function
-        Long pid;
+        int pid;


ax3l · 2024-02-21T02:53:31Z

Source/Particles/ParticleCreation/SmartUtils.H

-        auto& p = pp[ip];
-        p.id() = pid+ip;
-        p.cpu() = cpuid;
+        auto const new_id = ip + old_size;


ax3l added Performance optimization component: core Core WarpX functionality labels Jan 30, 2024

ax3l requested review from atmyers, WeiqunZhang and roelof-groenewald January 30, 2024 22:03

This was referenced Jan 30, 2024

Assigning idcpu in one go AMReX-Codes/amrex#3731

Closed

ParticleCopyPlan for SoA Particles AMReX-Codes/amrex#3732

Merged

BenWibking mentioned this pull request Jan 31, 2024

convert to pure SoA particle containers quokka-astro/quokka#515

Open