Recently I started working with numerical computation and solving mathematical problems numerically, programing in C++ with OpenMP. But now my problem is to big and take days to solve even parallelized. So, I’m thinking in start learning CUDA to reduce the time, but I have some doubts.

The heart of my code is the following function. The entries are two pointes to vectors. `N_mesh_points_x,y,z`

are integers pre-defined, `weights_x,y,z`

are column matrices, `kern_1`

is an exponential function and `table_kernel`

is a function who access a 50 Gb matrix stored in RAM and pre calculated.

```
void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
double r, sum_1 = 0 , sum_2 = 0;
double phir;
for (int l = 0; l < N_mesh_points_x; l++){
for (int m = 0; m < N_mesh_points_y; m++){
for (int p = 0; p < N_mesh_points_z; p++){
sum_1 = 0;
sum_2 = 0;
#pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
for (int i = 0; i < N_mesh_points_x; i++){
for (int j = 0; j < N_mesh_points_y; j++){
for (int k = 0; k < N_mesh_points_z; k++){
if (!(i==l) || !(j==m) || !(k==p)){
phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
sum_1 += phir * (*K1)[position(i,j,k)];
sum_2 += phir;
}
}
}
}
(*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
}
}
}
return;
}
```

My questions are:

- Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.
- The function
`table_kernel`

who access a big matrix, the matrix is to big to be stored in the memory of my video card, so the file will stay in RAM. This is a problem? The CUDA can access easily the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?

Source: Windows Questions C++