Hopefully it will steer you in the right direction. I believe I’m correct, but there could be bugs in my code, so caveat emprot. We haven’t added “cudaMemcpyDefault”, “cudaDeviceCanAccessPeer”, “cudaDeviceEnablePeerAccess”, and “cudaDeviceDisablePeerAccess” to the “cudafor” module yet, so you need to add the interfaces for now. It first enables the Peer To Peer communication, copies two arrays to device 0, copies the arrays from device 0 to device 1, performs the vector add on device 1, and copies the data back to the host. While I hadn’t used this feature until now, I put together an example Vector Add program. Like Asynchronous data copies, these isn’t a natural way in the Fortran syntax to allow for this, so you’ll need to use the CUDA ABI directly. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication? Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C. ! computing the reference solution on the CPU ! Allocate B using standard allocate call ! From this point on, we can use A and C as normal Fortran array If (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err If (err > 0) print *,"Error in allocating A with cuda HostAlloc =",errĮrr = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped) If (err > 0) print *,"Error in setting cudaSetDeviceFlags=",errĮrr = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped) ! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arraysĮrr=cudaSetDeviceFlags(cudaDeviceMapHost) Real(fp_kind) ,allocatable, dimension (:) :: B Real(fp_kind) ,pointer, dimension (:) :: A,C Integer, parameter :: fp_kind = kind(0.0d0) ! Double precision We will use the standard Fortran allocator for this one. B is an array that we will use to compute a reference solution on the CPU. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.Ī is the input array, C is the output array from the GPU computation. Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). This is achieved with calls to cudaHostGetDevicePointer. These are the pointers that we will pass to the CUDA kernels.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |