Software Coordinating Committee Conference Call April 12, 2007 1:00 PM EDT Recorder: C. DeTar Present: DeTar, Levkova, Khoriaty, Holmgren, Simone, Pochinski, Fowler, Alan Porterfield, Efstathiadis, Brower, Mawhinney, Gottlieb, Basak, Zhang, Osborn, Edwards, Joo, Jung, Scholz, Absent (??): Clark, Renner, Watson, ===================================================================== ** Action items 0. Brower: Heads up USQCD review in May: The USQCD project is being reviewed by the DOE at JLab on May 14-15, 2007. While the SciDAC software component is not the major focus of the review, it is still important. I will give a short report on SciDAC software and Kostas will report on user experience. For this meeting I would like to collect upto date information --- particularly performance & code readiness for the Cray XT4 and the BlueGene/L, workflow plans and threaded software requirements. 1. Finalize Propagator File Format --- see appended note from Carleton, also posted at http://super.bu.edu/~brower/ccs as usual (Carleton, Osborn, Bob M, et al) Brower: I would suggest working off line and bringing this up up next week. ** DeTar: OK 2. Multi-threading/Multi-core plans: (Robert, Jie, Andrew, Rob et al) Edwards: We should not be waiting for information about BG/Q and BG/P before starting in on multithreading. We already have Intel and AMD processors that could benefit from it. Pochinski: The Cray will have multithreaded Linux support for multicore. Joo: Next fall. There will be Linux support on the back-end nodes. There is OpenMP support. They will need proper support for memory traffic, etc. Brower: I agree we should start now. Joo: For the Cray we could implement QMP in Cray shmem or Cray portals. We are just discussing this at JLab. Brower: Keep us informed if you implement it and test it. Pochinski: We need to identify the issues first. Brower: Please see Robert and Balint's document: http://super.bu.edu/~brower/doc/thread_vs_multiproc.txt Edwards: This document lays out the general issues. Multiprocess is really not any more difficult than multithread. With multithread, we assumed one process is the master and handles all the I/O. Pochinski: On BG/L one core is the reader, one the writer. Edwards: I suppose that could also work. Pochinski: On the Opteron, different memory has different latency. Holmgren: We would have to have a strategy for doing the mapping that is independent of the implementation. There one specific process should do the I/O, since it is closest to the PCI bus. Pochinski: Will the strategy be scalable to 8 cores or higher? Holmgren: We might be seeing 16-core machines next year. Intel and AMD will be quad core very soon. A four-processor box is conceivable, so 16 altogether. Brower: At what point would the master be overwhelmed by message passing, so can't keep up? Pochinski: That could be an issue. There should be routines that declare how long a core can idle. Edwards: What is the near-term number of cores we should target? The Intel version of 128 cores is just a proof of concept. We won't see that on the market for some time. Fowler: Dual socket blades will be the most likely building block. The exercise should identify the control mechanisms that we need to bind cores to processors. Holmgren: Yes, that is a requirement. It is already possible with Opterons. Pochinski: Can we assume a single memory space? DeTar: What is "lexicographic" in your document? Edwards: Actually, I meant the operations should involve sequential memory references. [ Edwards led us through the two examples in the document above, one multithread and one multiprocess. ] My assessment is that multiprocess requires more changes in our packages. Some further multithread strategies: have half the cores handle red sites and the other half black. Jie has an initial implementation of QMT, slightly different from Andrew's. The master blocks on the QMT exec call. This simplifies the implementation. Pochinski: That prevents having separate functions on each core. Edwards: We concluded that making the functions bigger would work as well. Joo: On the Opteron evaluation machine with four cores involved on one box, I compared 4 threads and 4 MPI (OpenMPI over shared memory.) See my talk at the All Hands Meeting. There was no real difference in performance. I can't say whether there were memory locality issues degrading the multithread, or whether OpenMPI handles shmem extremely well. Fowler: It would be interesting to run this in an environment with performance counters available. Joo: Summary from slide Global volume thread vs MPI 2^4 thread < MPI 4^4 break even 4^2 x 8^2 thread slightly > MPI 10^2 x 8^2 thread 25% gain over MPI 12^4 thread 8% gain over MPI Pochinski: Did you try running the same test with only two cores? Just to look at the NUMA feature. Just remove one of the sockets. Joo: No. This is a very initial run. libnuma questions Edwards: How does the malloc work? Holmgren: libnuma allows you to set the policy for attaching memory lines to a socket. The malloc doesn't actually fix the memory. When you reference it, a TLB miss occurs and the socket gets bound to the page of memory. When you do a free and then re-malloc, you are still stuck with that assignment. So when you malloc, you probably need to touch each page from the appropriate core to get the assignments you want. Fowler: A lot of this is done at boot time on the Opteron systems I have worked with. Holmgren: The first Opteron I had allowed such options in BIOS, but subsequent versions had only one option. You want a separate heap for each thread. Fowler: Zone-based allocators are fairly common. Joo: Perhaps we need a separate memory allocation module for each core. Edwards: But we really don't want separate allocators for each core. How is the lattice addressed? Holmgren: You are right. A shared memory approach might work better. But we would then need to align fields by page. Level 3 interaction Brower: What if we wrote our Level 3 inverters for multithreading and didn't have the same model elsewhere? Edwards: We don't want to stop and restart QMP to switch between Level 3 and the rest of the code. Joo: We would have to amortize the startup cost. Pochinski: On the BG/L you have to choose multiprocess or multithread at startup. You can't switch. Fowler: Perhaps we should assume a single MPI process per socket, rather than trying to optimize across multiple sockets. That might simplify the implementation. The optimization strategy might involve sharing cache. Committee conference concluded at 3:00 PM EDT. Next call Apr 19 at 1:00 PM EDT ====================================================================== ============== Propagator Format Note from Carleton ============== Hi Folks, >From our discussion on Thursday, I believe we were left with three propagator formats for our initial use cases. Let's just look at the record content and the required metadata before we get into byte order. I am not completely happy with the names I have given the file formats, so please suggest better ones if you can. Note that this proposal does not involve changes to QIO. The higher level codes (QDP, MILC, CPS, QDP++) would need to support them and figure out how to distinguish them. QIO provides enough functionality that the required higher level support should be easy. Regards, Carleton --------------------------------------------------------------------- Record content and order --------------------------------------------------------------------- 1. USQCD_DiracFermion_ScalarSource_12Sink Here the source is given by one complex scalar field and the same scalar field is used for each source color and spin. Logical record 1: The complex scalar source (Do we allow a single-precision source and double precision solutions?) Logical records 2 - 13: DiracFermion sink fields Solutions for each choice of source color and spin 2. USQCD_DiracFermion_Source_Sink_Pairs Here we specify the full source field and its solution field. There can be any number of such pairs. Logical record 1: DiracFermion field (source) Source #1 Logical record 2: DiracFermion field (sink) Solution #1 Logical record 3: DiracFermion field (source) Source #2 Logical record 4: DiracFermion field (sink) Solution #2 etc. 3. LHPC_DiracPropagator Logical record 1: Here a single record contains the full propagator. (Should we require a complex scalar field for the source as we did in type 1?) --------------------------------------------------------------------- Required metadata --------------------------------------------------------------------- The QIO format has file and logical record XML strings. We would use both. The file XML string should have the following required fields: file type (see name above) prevailing precision (32 or 64 -- or we could use F and D) Then for the individual file formats above we would have 1. USQCD_DiracFermion_ScalarSource_12Sink Nothing required for the source For the sink field, for safety, we probably need the source spin color even though we adopt a standard order for the records. 2. USQCD_DiracFermion_Source_Sink_Pairs In the file XML do we need to specify the number of source/sink pairs expected? QIO has the capability of appending to a file, so if we specify it in the file XML, that number would not be updated. Since all the records are DiracFermion fields, it is probably a good idea to distinguish: "sink" or "source" --------------------------------------------------------------------- Optional metadata --------------------------------------------------------------------- Beyond these required fields for any of the user XML, we would allow an additional optional tag that nests any collaboration metadata. For example, in the file XML, it would be a good idea to identify the corresponding gauge field file and specify and gauge fixing, etc.