Software Coordinating Committee FNAL Workshop November 7-8, 2008 Recorder: C. DeTar Participants: Brower, DeTar, Oktay, Holmgren, Efstathiadis, Babich, Clark, Scholz, Witzel, Jung, Osborn, Mawhinney, Mackenzie, Simone, Fowler, Dreher, DiPierro, Joo, Edwards, Luciano Piccoli, Jim Kowalkowski, Xien-He Sun, Nirmal Sinu, Amitoj Singh ===================================================================== ** Action items ====================================================================== Friday, November 7 ====================================================================== Brower: A parallel agenda for today is preparation for the upcoming SciDAC-II review in January. We will be updating our progress report as well. ** I will send out last year's report so we can work from there. ---------------------------------------- MILC SciDAC integration DeTar: MILC code supports all the Asqtad and gauge force QOP modules needed for HMC and the Wilson inverter. Development plans for QOP include completing the clover inverter, adding the EigCG inverter, adding an improved gauge force module, and in the long term adding a multigrid solver and improved heavy quark inverter. QIO The partfile mode of writing from I/O partitions consisting of four cores each can give substantial improvement in I/O times. Mawhinney: Having concurrency control would help. Chulwoo: We should also consider a reorganization of some of the QMP modules so it is clear which ones belong with which architecture. ---------------------------------------- SciDAC QMP, QLA, QDP update Osborn: BG/P utilization Soon USQCD will reach the 300 M core-hour mark We may be asked to write something for ASCR news I also need some BG/P stories for SC08. QMP updates Set allocated machine on BG/P Also -qmp-alloc-map to permute axes Balint added TopoMgr API to get personality Balint moved CVS to SVN Proposals Chulwoo - change base address of sends Pochinski - nonblocking collectives Idea - one-sided communications (e.g. BG/P SPI) Pavlos - split single MPI into multiple independent QMP QLA updates Parallel makes OpenMP "parallel for" loops Proposals Merge BG/L branch with main branch Improve OMP support Performance tuning QA0/BAGEL? Mike: 'Q' type (long double) support Mike: Allow variable sized N color objects (currently max is 5 without recompiling) Mike: more vector-type operations Mike: rectangular matrices Mike: QDP: QOP proposals: Testing eigCG - works well for Wilson, not staggered (too chiral) BLAS option - double precision works better here - 20% over QLA on BG/P Eigensolvers - Lanzcos and Kalkreuter/Simma Multigrid preconditioners CUDA inverters qinstall: parallel makes, easier customization ---------------------------------------- Threading Osborn: Threading tests: Within QLA Uses OpenMP pragmas For large problem size, threading cuts performance by 25% because the compiler devectorizes the loops Tried splitting the loop and calling a separate subroutine now the asymptotic performance is the same as not using OpenMP. Within QDP Not much different from QLA Within QOP Perhaps Starting threads at application level: the performance is the same. Osborn: It would be good to avoid barriers as much as possible Edwards: In QMT we removed barriers, but then copied the parameters to a private stack to preserve them. Osborn: We may need to support a variety of options at first and let the user choose what works best. Balint: Agreed ---------------------------------------- CPS SciDAC integration Mawhinney: BG/Q 200 TFlops/rack (all I can say about it) CPS for BG/Q: Threading will be an issue. We will have an aggressively threaded Dslash We are concerned about non-Dslash parts of the code. If they aren't threaded, they will become a more significant cost. Perhaps CPS will switch to a threaded QDP, QLA It is conceivable that we can gain substantially by this approach. DeTar: Will the QCDOC-2 be distinct from a commercial BG/Q? Mawhinney: To a first approximation, they will be the same. Fowler: NVidia is talking about having a general purpose core. ---------------------------------------- CPS SciDAC integration Jung: To put CPS on the BG/P, we wrote an interface to QMP Multicore experience on the BG/P Tried a scheme similar to James and saw no performance gain. Osborn: IBM's OpenMP should be fairly efficient -- it depends on how you use it. Joo: To put QMT on the BG/P requires some rewriting because there is some assembly coding Jung: Also, I wrote a version of QMP that adds QMP_change_address to allow sending of different data of the same size and destination. [ We agreed this should be made part of the head version of QMP ] For QIO I have a concurrency control version of QIO for partfiles. ** I will check it in as a branch. Also having a barrel shift version of QIO would be good. It would be good to update QMP. Mawhinney: Can someone update the source tree? DeTar: I think anyone making substantial changes to a code should be doing some housekeeping as well. ---------------------------------------- Workflow Piccoli: Uses Ruote. We are testing the system with a lattice generation project Current projects: Adding the ability to recover from a failed step. Integrate with the cluster reliability system Holmgren: We are looking for volunteers to test the workflow. Brower: Could this system be moved to Argonne? Simone: Yes, and we are thinking of moving it to other FNAL projects. ---------------------------------------- Cluster Reliability Seenu: We are currently monitoring health, user and system processes. Mawhinney: Are you gaining some understanding of what information is most useful, so the databases don't overwhelm? Seenu: Yes Singh: We are rolling over the databases and compressing them in an archive. ---------------------------------------- Performance Tools Fowler: Current work: Database, on-node performance analysis, multicore Future: Workflow, Blue Waters We have been using LQCD codes. Chroma expression templates have driven development of performance tools. Multicore Dslash - working with Balint Mawhinney: Can you distinguish performance on separate threads? Fowler: Not yet. We have a proposal to do this. DeTar: Can we run these tools on ranger and BG/P? Fowler: The compute node kernel on the BG/P doesn't support it. The issue is access to appropriate hardware counters. As for ranger, I can find out. ---------------------------------------- JLab/RENCI Collaboration Joo: Threads Using RenCI performance tools to help code development. Threading dslash QDP++ threading is in progess (Xu Guo at EPCC) Still needs double precision SSE and collectives Temporal preconditioning Combining with red-black preconditioning helps for a large anisotropy. Paper to appear soon. This seems to be cost effective. Faster Dslash Some techniques helped and some not. Could get a 2X improvement. ---------------------------------------- Future clusters Holmgren: By June 2009 JPsi will be a > 8 TF cluster ====================================================================== Saturday, November 8 ====================================================================== Visualization DiPierro: QCD code + vtk + isee also interface with analysis code Workflow takes a lattice file converts to a scalar field then renders. Future: Web-based interface? Dreher: Use Paraview to render data Gives 3D shading Shows side-by-side RENCI has a surround multiscreen visualization room (Social Vis Room) Mawhinney: These tools are probably mainly useful for algorithm tuning. For most physics questions, we are looking at quantum noise. As for algorithms, when we encounter a large force in the molecular dynamics, it might be interesting to take a snapshot, diagnose the cause and possibly suggest remedies. Mawhinney: Showed another way to render the topo charge density. ---------------------------------------- GPU's Babich: 100s of cores 10,000 threads active much more general purpose components now high memory bandwidth, but large latency threads scheduled according to availability of data TF/card, but bandwidth limited local "cache" is addressable double precision is full IEEE native single may not be fully IEEE? The only choice for G80 Language is CUDA "compute unified device architecture" C with extensions, essentially Architecture is multithread oriented 16K bytes in shared memory per G80 4 GB on-board memory per processor Code generator needed for Dslash to unroll loops, etc. Implemented a Wilson Dslash, Inverters (CG, BiCGstab) Get ~100 GF single precision plus emulated DP for global sums Need memory saving tricks, such as 3rd row reconstruction and temporal gauge fixing to drop most of the time-links Brower: Very cost-effective. Babich: Future: Wilson clover. Gottlieb: staggered. Reliability: No ECC memory and parity on buses. May need redundancy. Further ahead: OpenCL may be a replacement for CUDA ---------------------------------------- Multigrid Clark: 16^3 x 32 time to solution is about 4x better than BiCGstab somewhat better on this size than deflation QCDMG slow QLA for linear algebra, but no optimization of MG specific operations Extend QLA? QA0 Bagel Almost ready for production use. Future: Add support for clover. Do staggered Chiral fermions Port to CUDA SciDAC modifications that could help: QDP support for multiple lattices QDP support for arbitrary colors Add QCDMG to standard code base Multiple precision could help in preconditioning - 16-bit + 32-bit + 64-bit ==========================================================