Simply fortran cuda support

9/13/2023

Hopefully it will steer you in the right direction. I believe I’m correct, but there could be bugs in my code, so caveat emprot. We haven’t added “cudaMemcpyDefault”, “cudaDeviceCanAccessPeer”, “cudaDeviceEnablePeerAccess”, and “cudaDeviceDisablePeerAccess” to the “cudafor” module yet, so you need to add the interfaces for now. It first enables the Peer To Peer communication, copies two arrays to device 0, copies the arrays from device 0 to device 1, performs the vector add on device 1, and copies the data back to the host. While I hadn’t used this feature until now, I put together an example Vector Add program. Like Asynchronous data copies, these isn’t a natural way in the Fortran syntax to allow for this, so you’ll need to use the CUDA ABI directly. When interfacing C and Fortran, it is important to remember that while arguments in C are passed by values, in Fortran they are passed by reference.But, how to use the peer-to-peer communication? Should i make any changes in my program in order to use several GPUs without arranging OpenMP communication? Since we are using standard Fortran, we will need to write the computation on the GPU using CUDA C.

! computing the reference solution on the CPU ! Allocate B using standard allocate call

! From this point on, we can use A and C as normal Fortran array If (err > 0) print *,"Error in allocating C with cuda HostAlloc =",err If (err > 0) print *,"Error in allocating A with cuda HostAlloc =",errĮrr = cudaHostAlloc(cptr_C,N*sizeof(fp_kind),cudaHostAllocMapped) If (err > 0) print *,"Error in setting cudaSetDeviceFlags=",errĮrr = cudaHostAlloc(cptr_A,N*sizeof(fp_kind),cudaHostAllocMapped) ! Allocate A and C using cudaHostAlloc and then map the C pointer to Fortran arraysĮrr=cudaSetDeviceFlags(cudaDeviceMapHost) Real(fp_kind) ,allocatable, dimension (:) :: B

Real(fp_kind) ,pointer, dimension (:) :: A,C Integer, parameter :: fp_kind = kind(0.0d0) ! Double precision We will use the standard Fortran allocator for this one. B is an array that we will use to compute a reference solution on the CPU. Since we want to use the zero copy features on these two, we will allocate them with cudaHostAlloc. We need to do a couple of extra steps: call the CUDA allocator in C, and then pass the C pointer to Fortran using the function C_F_Pointer provided by the iso C bindings.Ī is the input array, C is the output array from the GPU computation. Since we are using a standard Fortran 90 compiler, we can't use the built in allocator ( it has no knowledge of pinned memory). This is achieved with calls to cudaHostGetDevicePointer. These are the pointers that we will pass to the CUDA kernels.

Get the device pointers to the mapped memory.
Allocate the host mapped arrays: this is achieved with cudaHostAlloc with the flag cudaHostAllocMapped.
Set the device flag for mapping host memory: this is achieved with a call to the cudaSetDeviceFlags with the flag cudaDeviceMapHost.
To declare the mapped array, we will need to perform the following steps: If you are not familiar with the zero-copy feature in CUDA C, it allows compute kernels to share host system memory and provides zero-copy support for direct access to host system memory when running on many newer CUDA-enabled graphics processors. The basic idea is to use the original CUDA C functions to allocate host arrays that are page-locked ( aka pinned) and with the right attributes to be used by the zero copy feature of CUDA.

0 Comments

Simply fortran cuda support

Leave a Reply.

Author

Archives

Categories