Multi-threads versus multi-process We shall compare a threaded versus a multi-process implementation of a QCD code. SZIN implemented both of these methods in the late 90's, so practical experience is given. Jie Chen has a current functioning implementation of QMT. This docucement is not trying to closely follow the API, but is close in spirit. I) Definitions a) Multi-threaded: A parallel code communicating among boxes each with multiple threads in a master/slave model. Master is responsible for main code flow and all communications and IO. Threads used mainly for computations. The master dispatches functions to the threads which are sitting waiting to start. The master may actually function as a thread. b) Multi-process: A parallel code communicating among boxes each with multiple processes running in a shared memory model. Primary process on a box is responsible for all communications and IO. All processes work concurrently. //-------------------------------------------------------------------------------- II) Assumptions: The lattice object lives in memory space of master. The bits for each threaded are divided lexicographically. During QDP like ops, a thread can read from the part of the lattice of a source on another thread. However, it can't write to the part of the lattice on another thread. This single address object might be problematic on Opterons. b) Multi-process: The lattice object lives in a shared memory space known to all process. E.g., each process has a pointer to the same physical memory address. The bits for each process are divided lexicographically. During QDP like ops, a process can read from the part of the lattice of a source on another process. However, it can't write to the part of the lattice on another process. This single address object might be problematic on Opterons. //-------------------------------------------------------------------------------- III) Typical execution model (in QDP/C) //-------------------------------------------------------------------------------- a) Multi-threaded: // Assumption here: QMT API function dispatch has form typedef void (*qmt_userfunc_t) (unsigned int lo, // starting (site) number unsigned int hi, // ending (site) number int thread_id, // yes, thread number void *ptr); // pointer to user data // Thread: A QLA add function void QLA_add_fermion(QLA_ferm* dest, QLA_ferm* src1, QLA_ferm* src2, int num); // Thread: Argument struct struct Arg_add_ferm { QLA_ferm* dest; QLA_ferm* src1; QLA_ferm* src2; }; // Thread: Wrapper for that qla function void QLA_add_fermion_X(int start, int end, int thread_no, void *arg_t) { Arg_add* arg = (Arg_add*)arg_t; QLA_add_fermion(arg->dest+start, arg->src1+start, arg->src2+start, end-start+1); } // Thread: QDP add void QDP_add_fermion(QLA_ferm& dest, QDP_ferm& src1, QDP_ferm& src2) { // Create arg structure to hand down into wrapper Arg_add arg; arg.dest = dest.; arg.src1 = src1.; arg.src2 = src2.; QMT_exec(QLA_add_fermion_X, QDP_subgrid_volume(), arg); // launch threads, does sync } // Thread: A QLA function for innerproduct void QLA_innerproduct(QLA_complex* dest, QLA_ferm* src1, QLA_ferm* src2, int num); // Thread: Argument struct struct Arg_inner { QLA_complex* dest; QLA_ferm* src1; QLA_ferm* src2; }; // Thread: Wrapper for that qla function void QLA_innerproduct_X(int start, int end, int thread_no, void *arg_t) { Arg_add* arg = (Arg_add*)arg_t; QLA_innerprod(arg->dest+thread_no, arg->src1+start, arg->src2+start, end-start+1); } // Thread: QDP innerproduct void QDP_innerproduct(QLA_complex& dest, QDP_ferm& src1, QDP_ferm& src2) { // Create arg structure to hand down into wrapper // NOTE: malloc done by master, threads get address by arg structure Arg_add arg; arg.dest = malloc(QMT_num_threads()*sizeof(QLA_complex)); arg.src1 = src1.; arg.src2 = src2.; // Do local (within thread) inner product QMT_exec(QLA_innerproduct_X, QDP_subgrid_volume(), arg); // launch threads, does sync // Have to add all the individual thread contributions dest = arg.dest[0]; for(int i=1; i < QMT_num_threads(); ++i) dest += arg.dest[i]; free(arg.dest); QMP_global_sum_array(dest,2); // Now global sum the result } // Thread: Main program for multi-threads int main() { QMP_init(); QMT_init(); // IO for parameters if (QMP_primary_node()) { fscanf(); } QMP_broadcast(); // parameter now known to all nodes // Do QDP add of two fermions QDP_ferm dest_ferm, src_ferm_1, src_ferm_2; QDP_add_fermion(dest_ferm, src_ferm_1, src_ferm_2); // Do QDP innerproduct QLA_complex dest_scalar; QDP_innerproduct(dest_scalar, src_ferm_1, src_ferm_2); // Write some binary // NOTE: no changes needed within QIO! // Outside of QIO, need a conventional parallel writer // No change here from unthreaded case // NERSC FILE* fp; if (QMP_primary_node()) { fp = fopen("my_nersc_file"); } for(int site=0; site < QDP_volume(); ++site) { if (QMP_primary_node()) { QMP_receive(); fwrite(fp, ); } else if (QDP_node(site) == QMP_get_node_number()) { QMP_send(); } } QMT_finalize(); QMP_finalize(); } //-------------------------------------------------------------------------------- a) Multi-process: // Process: A QLA add function void QLA_add_fermion(QLA_ferm* dest, QLA_ferm* src1, QLA_ferm* src2, int num); // Process: QDP add void QDP_add_fermion(QLA_ferm& dest, QDP_ferm& src1, QDP_ferm& src2) { // These starting/ending site numbers could be precomputed int start = (getProcessNumber())*QDP_subgrid_volume() / totalNumProcess(); int end = (getProcessNumber() + 1)*QDP_subgrid_volume() / totalNumprocess(); QLA_add_fermion(dest+start, src1+start, src2+start, end-start+1); } // Process: A QLA function void QLA_innerproduct(QLA_complex* dest, QLA_ferm* src1, QLA_ferm* src2, int num); // Process: QDP innerproduct void QDP_innerproduct(QLA_complex& dest, QDP_ferm& src1, QDP_ferm& src2) { // These starting/ending site numbers could be precomputed int start = (getProcessNumber())*QDP_subgrid_volume() / totalNumProcess(); int end = (getProcessNumber() + 1)*QDP_subgrid_volume() / totalNumProcess(); smp_sync(); // sync all processes. NOTE: could be avoid if malloc below blocks // Problematic!! A shared memory space must be allocated and all the processes // on a node must have this same pointer. This could possibly be pre-allocated. // However, there is a malloc_shared_memory call for every lattice object. // QLA_complex* sum_tmp = malloc_shared_memory(totalNumProcess()*sizeof(QLA_complex)); // Do local (within process) inner product QLA_innerproduct_X(sum_tmp+getProcessNumber(), src1+start, src2+start); smp_sync(); // sync all processes. // Sum the individual contributions. // NOTE: because the sum is the same on all processes, the result is // bit-wise identical. Thus, no broadcast is needed at this stage. dest = sum_tmp[0]; for(int i=1; i < totalNumProcess(); ++i) dest += sum_tmp[i]; // Sync all processes. If a different temp used below, this could be avoided. smp_sync(); // Global sum, but only on primary process if (QDP_primary_process() == true) { QMP_global_sum_array(dest,2); // Will broadcast back out to processes. Can reuse shared memory for(int i=0; i < totalNumProcess(); ++i) sum_tmp[i] = dest[i]; } // Do the broadcast smp_sync(); dest = sum_tmp[getProcessNumber()]; smp_sync(); free_shared_memory(sum_tmp); } // Process: Main program for multi-process int main() { QMP_init(); SMP_init(); // // IO for parameters // float variable; if (QMP_primary_node() && QDP_primary_process()) { fscanf(variable); } smp_sync(); // not needed if (QDP_primary_process()) { QMP_broadcast(variable); // parameter now known to all nodes on primary process } smp_sync(); // still need some kind of sync/barrier in multi-process // Need to broadcast between processes // Allocate or acquire a process global shared memory float* tmp_variable = malloc_shared_memory(totalNumProcess()*sizeof(float)); smp_sync(); if (QDP_primary_process()) { for(int i=0; i < totalNumProcess(); ++i) tmp_variable[i] = tmp_variable; } smp_sync(); tmp_variable = tmp_variable[getProcessNumber()]; smp_sync(); free_shared_memory(tmp_variable); // // Do QDP add of two fermions // QDP_ferm dest_ferm, src_ferm_1, src_ferm_2; QDP_add_fermion(dest_ferm, src_ferm_1, src_ferm_2); // // Do QDP innerproduct // QLA_complex dest; QDP_ferm src1, src2; QDP_innerproduct(dest, src1, src2); // Write some binary // Within QIO, there would have to be changes! // Example of changes here for NERSC format FILE* fp; if (QMP_primary_node() && QDP_primary_process()) { fp = fopen("my_nersc_file"); } // NOTE: in multi-process, the pointers to lattice objects are shared // with all process. E.g., each process has the SAME pointer to the same // shared address space. So, no communications needed between processes // here. smp_sync(); for(int site=0; site < QDP_volume(); ++site) { if (QMP_primary_node() && QDP_primary_process()) { QMP_receive(); fwrite(); } else if (QDP_node(site) == QMP_get_node_number() && QDP_primary_process()) { QMP_send(); } } smp_sync(); SMP_finalize(); QMP_finalize(); } //-------------------------------------------------------------------------------- IV) Other topics not brought up. a) Random number generation - how do you share a common seed. b) Indexing - say you want to grab a lattice site value or poke into a lattice site. This is awkward under multi-process.