Runge Kutta Fehlbergソルバーでエラーが発生するのはなぜですか？

私の現在のプロジェクトは、C ++での何千ものODEのソリューションを含むタンパク質フォールディングモデルの再プログラミングです。完全にGPUで実行するソルバーを作成しているので、私はいくつかの停止と開始を進めてきました。ようやく統合しましたが、固定ステップサイズh = .001のRKF45アルゴリズムの5次解を使用してdC / dt = -Cを解こうとすると、e ^ {-tの計算値から決定が得られます}は、10 ^ {-12}程度のグローバルエラーが予想される場合、10 ^ {-4}程度です。

これが発生しているのは、適応ステップサイズと4次と5次の両方のソリューションを使用したエラー制御を使用していないためですか？私の考えは、適応アルゴリズムは4次と5次の違いを推測し、基本的には5次を「正しい答え」として扱うため、その答えを使用してインテグレーターを試してみるというものでした。

編集：

それ以来

グローバルエラー=（ポイント数）*（ローカルエラー）

そして

ローカルエラー= h ^ O

ここで、hはステップサイズ、Oは次数です、私は計算します

O = ln（グローバルエラー/ポイント数）/ ln h

私が正しいと思う方法と回答で提案された方法の両方で次数分析を行うと、次の結果が得られます。

Number of Points    h       Global Error    My Way      The Answer's Way
10                  0.1     2.89E-06        6.539E+00   
20                  0.05    7.09E-08        6.495E+00   5.350E+00
40                  0.025   1.76E-07        5.216E+00   -1.315E+00
50                  0.02    1.128E-07       5.089E+00   2.003E+00
80                  0.0125  4.401E-08       4.866E+00   2.002E+00
100                 0.01    2.816E-08       4.775E+00   2.001E+00
500                 0.002   1.126E-09       4.316E+00   2.000E+00
1000                0.001   2.810E-10       4.184E+00   2.003E+00
10000               0.0001  3.000E-12       3.881E+00   1.972E+00

それで、昨夜の私の疑似突破にもかかわらず、実際にプログラミングエラーがあったにもかかわらず、ソルバーの順序はまだオフです。.001ステップサイズに関連する前述のエラーは、O（10 ^ -4）からO（10 ^ -10）に劇的に減少しました。

二次編集

LN（E）-LN（a）とLN（h）のグラフを次に示します。私は線形適合を投げたので、それが私に順序を明らかにすることを期待します。ここで本当に混乱しているのは、メソッドの順序に3つの計算があることです。グラフは〜3と教えてくれ、Godricの方法はほぼ正確に2と教えてくれ、私の方法はステップサイズを増やすために6-> 4から変化することを示しています。グローバルエラーが非常に低く（〜10 ^ -10）、そのステップサイズ（10 ^ -15 * 10 ^ 3 = 10 ^ -12）で私が期待するもの（10 ^ -15 * 10 ^ -12）に最も近いため、私の計算を信じる傾向があります。 ^ -3）。

ここに画像の説明を入力してください

編集3：

私のコード。愚痴を許してください。

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <iostream>
#include <iomanip>                      //display 2 decimal places
#include <math.h>
using namespace std;

/*
Things to do log:
1. Decide how adaptive step sizes is handled by the integrator loop.  We can no longer say how many points to integrate if that is to be determined in the kernel
itself, only what the initial step size is.
2. Figure out how I'm going to graph all this stuff.
3. On paper solve for the step size.
4. Adaptive Step Size
    a. arrMin kernel to get min of array
    b. Decide how to implement tolerances
5. Use multiples of 256 for maxlength 

Important things accomplished:
1. Can create 2D array for storing concentrations over time.
2. Can store a 1D array in 2D array.
3. Can calculate all the k-arrays given static step size.
4. Can take a step given static step size.
5. Can step through an integration from t0 to tf
6. Can accomodate fixed time-step
7. Calculated the current order of integrator using logs.
8. Learned how to profile the program.
9. Fixed the GPU timeout problem with WDDM.

Ideas:
1. Just use large fixed size array
*/
__global__ void rkf5(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*, int*, size_t, double*, double*, double*);
__global__ void calcK(double*, double*, double*);
__global__ void k1(double*, double*, double*);
__global__ void k2(double*, double*, double*);
__global__ void k3(double*, double*, double*);
__global__ void k4(double*, double*, double*);
__global__ void k5(double*, double*, double*);
__global__ void k6(double*, double*, double*);
__global__ void arrAdd(double*, double*, double*);
__global__ void arrSub(double*, double*, double*);
__global__ void arrMult(double*, double*, double*);
__global__ void arrInit(double*, double);
__global__ void arrCopy(double*, double*);
__device__ void setup(double , double*, double*, double*, double*, int*);
__device__ double flux(int, double*) ;
__device__ double knowles_flux(int, double*);
__device__ void calcStepSize(double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, double*, int*);
__global__ void storeConcs(double*, size_t, double*, int);
__global__ void takeFourthOrderStep(double*, double*, double*, double*, double*, double*, double*);
__global__ void takeFifthOrderStep(double*, double*, double*, double*, double*, double*, double*, double*);

//Error checking that I don't understand yet.
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

//Main program.
int main(int argc, char** argv)
{
    //std::cout << std::fixed;          //display 2 decimal places
    //std::cout << std::setprecision(12);   //display 2 decimal places
    const int maxlength = 1;            //Number of discrete concentrations we are tracking.
    double concs[maxlength];            //Meant to store the current concentrations 
    double temp1[maxlength];                //Used as a bin to store products of Butcher's tableau and k values.
    double temp2[maxlength];                //Used as a bin to store products of Butcher's tableau and k values.
    double tempsum[maxlength];          //Used as a bin to store cumulative sum of tableau and k values
    double k1s[maxlength];
    double k2s[maxlength];
    double k3s[maxlength];
    double k4s[maxlength];
    double k5s[maxlength];
    double k6s[maxlength];
    const int numpoints = 40;       
    double to = 0;
    double tf = 1;
    //double dt = static_cast<double>(.5)/static_cast<double>(64);
    double dt = (tf-to)/static_cast<double>(numpoints);
    double mo = 1;
    double concStorage[maxlength][numpoints];   //Stores concs vs. time                     

    //Initialize all the arrays on the host to ensure arrays of 0's are sent to the device.
    //Also, here is where we can seed the system.
    std::cout<<dt;
    std::cout<<"\n";
    concs[0]=mo;
    std::cout<<concs[0];
    std::cout<<" ";
    for (int i=0; i<maxlength; i++)
    {
        for (int j=0; j<numpoints; j++)
            concStorage[i][j]=0;
        concs[i]=0;
        temp1[i]=0;
        temp2[i]=0;
        tempsum[i]=0;
        k1s[i]=0;
        k2s[i]=0;
        k3s[i]=0;
        k4s[i]=0;
        k5s[i]=0;
        k6s[i]=0;
        std::cout<<concs[i];
        std::cout<<" ";
    }
    concs[0]=mo;
    std::cout<<"\n";

    //Define all the pointers to device array memory addresses. These contain the on-GPU
    //addresses of all the data we're generating/using.
    double *d_concs;
    double *d_temp1;
    double *d_temp2;
    double *d_tempsum;
    double *d_k1s;
    double *d_k2s;
    double *d_k3s;
    double *d_k4s;
    double *d_k5s;
    double *d_k6s;
    double *d_dt;
    int *d_maxlength;
    int *d_numpoints;
    double *d_to;
    double *d_tf;
    double *d_concStorage;

    //Calculate all the sizes of the arrays in order to allocate the proper amount of memory on the GPU.
    size_t size_concs = sizeof(concs);
    size_t size_temp1 = sizeof(temp1);
    size_t size_temp2 = sizeof(temp2);
    size_t size_tempsum = sizeof(tempsum);
    size_t size_ks = sizeof(k1s);
    size_t size_maxlength = sizeof(maxlength);
    size_t size_numpoints = sizeof(numpoints);
    size_t size_dt = sizeof(dt);
    size_t size_to = sizeof(to);
    size_t size_tf = sizeof(tf);
    size_t h_pitch = numpoints*sizeof(double);
    size_t d_pitch;

    //Calculate the "pitch" of the 2D array.  The pitch is basically the length of a 2D array's row.  IT's larger 
    //than the actual row full of data due to hadware issues.  We thusly will use the pitch instead of the data 
    //size to traverse the array.
    gpuErrchk(cudaMallocPitch( (void**)&d_concStorage, &d_pitch, numpoints * sizeof(double), maxlength)); 

    //Allocate memory on the GPU for all the arrrays we're going to use in the integrator.
    gpuErrchk(cudaMalloc((void**)&d_concs, size_concs));
    gpuErrchk(cudaMalloc((void**)&d_temp1, size_temp1));
    gpuErrchk(cudaMalloc((void**)&d_temp2, size_temp1));
    gpuErrchk(cudaMalloc((void**)&d_tempsum, size_tempsum));
    gpuErrchk(cudaMalloc((void**)&d_k1s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k2s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k3s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k4s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k5s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_k6s, size_ks));
    gpuErrchk(cudaMalloc((void**)&d_maxlength, size_maxlength));
    gpuErrchk(cudaMalloc((void**)&d_numpoints, size_numpoints));
    gpuErrchk(cudaMalloc((void**)&d_dt, size_dt));
    gpuErrchk(cudaMalloc((void**)&d_to, size_to));
    gpuErrchk(cudaMalloc((void**)&d_tf, size_tf));

    //Copy all initial values of arrays to GPU.
    gpuErrchk(cudaMemcpy2D(d_concStorage, d_pitch, concStorage, h_pitch, numpoints*sizeof(double), maxlength, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_concs, &concs, size_concs, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_temp1, &temp1, size_temp1, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_temp2, &temp2, size_temp2, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_tempsum, &tempsum, size_tempsum, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k1s, &k1s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k2s, &k2s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k3s, &k3s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k4s, &k4s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k5s, &k5s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_k6s, &k6s, size_ks, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_maxlength, &maxlength, size_maxlength, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_numpoints, &numpoints, size_numpoints, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_dt, &dt, size_dt, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_to, &to, size_to, cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_tf, &tf, size_tf, cudaMemcpyHostToDevice));

    //Run the integrator.
    rkf5<<<1,1>>>(d_concs, d_concStorage, d_temp1, d_temp2, d_tempsum, d_k1s, d_k2s, d_k3s, d_k4s, d_k5s, d_k6s, d_maxlength, d_numpoints, d_pitch, d_dt, d_to, d_tf);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );
    cudaDeviceSynchronize();

    //Copy concentrations from GPU to Host.  Almost defunct now that transferring the 2D array works.
    cudaMemcpy(concs, d_concs, size_concs, cudaMemcpyDeviceToHost);
    //Copy 2D array of concentrations vs. time from GPU to Host.
    gpuErrchk( cudaMemcpy2D(concStorage, h_pitch, d_concStorage, d_pitch, numpoints*sizeof(double), maxlength, cudaMemcpyDeviceToHost) );   

    //Print concentrations after the integrator kernel runs.  Used to test that data was transferring to and from GPU correctly.
    std::cout << "\n";
    for (int i=0; i<maxlength; i++)
    {
        std::cout<<concs[i];
        std::cout<<" ";
    }

    double a[10];
    double b[10];
    double c[10];
    for(int i = 0; i< 10; i++)
    {
        a[i]=0;
        b[i]=0;
        c[i]=0;
    }



    //Print out the concStorage array after the kernel runs.  Used to test that the 2D array transferred correctly from host to GPU and back.
    std::cout << "\n\n";
    std::cout << "Calculated Array";
    std::cout << "\n\n";
    for (int i=0; i<maxlength; i++)
    {
        for(int j=0; j<numpoints; j++)
        {
            if (j%(numpoints/10)==0)
            {
                a[j/(numpoints/10)]=concStorage[i][j];
                std::cout<<concStorage[i][j];
                std::cout<<"   ";
            }
        }
        std::cout << "\n";
    }
    std::cout << "\n";
    std::cout << "Exponential";
    std::cout << "\n\n";
    for (int i=0; i<10; i++)
    {
        b[i]=exp(-i*(tf-to)/10);
        std::cout<<exp(-i*(tf-to)/10);
        std::cout<<"   ";
    }
    std::cout << "\n\n";
    std::cout << "Error Array";
    std::cout << "\n\n";
    for (int i=0; i<10; i++)
    {
        c[i]=a[i]-b[i];
        std::cout<<c[i];
        std::cout<<"   ";
    }
    std::cout << "\n\n";
    cudaDeviceReset();  //Clean up all memory.

    return 0;
}
//Main kernel.  This is mean to be run as a master thread that calls all the other functions and thusly "runs" the integrator.
__global__ void rkf5(double* concs, double* concStorage, double* temp1, double* temp2, double* tempsum, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s, int* maxlength, int* numpoints, size_t pitch, double* dt, double* to, double* tf)
{
    /*
    axy variables represent the coefficients in the Butcher's tableau where x represents the order of the step and the y value corresponds to the ky value 
    the coefficient gets multiplied by.  Have to cast them all as doubles, or the ratios evaluate as integers.
    e.g. a21 -> a21 * k1
    e.g. a31 -> a31 * k1 + a32 * k2
    */
    double a21 = static_cast<double>(.25);

    double a31 = static_cast<double>(3)/static_cast<double>(32);
    double a32 = static_cast<double>(9)/static_cast<double>(32);

    double a41 = static_cast<double>(1932)/static_cast<double>(2197);
    double a42 = static_cast<double>(-7200)/static_cast<double>(2197);
    double a43 = static_cast<double>(7296)/static_cast<double>(2197);

    double a51 = static_cast<double>(439)/static_cast<double>(216);
    double a52 = static_cast<double>(-8);
    double a53 = static_cast<double>(3680)/static_cast<double>(513);
    double a54 = static_cast<double>(-845)/static_cast<double>(4104);

    double a61 = static_cast<double>(-8)/static_cast<double>(27);
    double a62 = static_cast<double>(2);
    double a63 = static_cast<double>(-3544)/static_cast<double>(2565);
    double a64 = static_cast<double>(1859)/static_cast<double>(4104);
    double a65 = static_cast<double>(-11)/static_cast<double>(40);

    //for loop that integrates over the specified number of points. Actually, might have to make it a do-while loop for adaptive step sizes 
    for(int k = 0; k < *numpoints; k++)
    {
        if (k!=0)
        {
            arrCopy<<< 1, *maxlength >>>(concs, tempsum);
            cudaDeviceSynchronize();
        }
        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        arrInit<<< 1, *maxlength >>>(temp1, 0);
        cudaDeviceSynchronize();
        arrInit<<< 1, *maxlength >>>(temp2, 0);
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(concs, k1s, dt);         //k1 = dt * flux (concs)
        cudaDeviceSynchronize(); //Sync here because kernel continues onto next line before k1 finished

        setup(a21, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a21*k1
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a21*k1    
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k2s, dt);       //k2 = dt * flux (concs + a21*k1)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a31, temp1, tempsum, k1s, concs, maxlength);  //temp1sum = a31*k1
        setup(a32, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a31*k1 + a32*k2
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a31*k1 + a32*k2
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k3s, dt);           //k3 = dt * flux (concs + a31*k1 + a32*k2)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a41, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a41*k1
        setup(a42, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a41*k1 + a42*k2
        setup(a43, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a41*k1 + a42*k2 + a43*k3
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a41*k1 + a42*k2 + a43*k3
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k4s, dt);           //k4 = dt * flux (concs + a41*k1 + a42*k2 + a43*k3)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a51, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a51*k1
        setup(a52, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a51*k1 + a52*k2
        setup(a53, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a51*k1 + a52*k2 + a53*k3
        setup(a54, temp1, tempsum, k4s, concs, maxlength);  //tempsum = a51*k1 + a52*k2 + a53*k3 + a54*k4
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a51*k1 + a52*k2 + a53*k3 + a54*k4
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k5s, dt);           //k5 = dt * flux (concs + a51*k1 + a52*k2 + a53*k3 + a54*k4)
        cudaDeviceSynchronize();

        arrInit<<< 1, *maxlength >>>(tempsum, 0);
        cudaDeviceSynchronize();
        setup(a61, temp1, tempsum, k1s, concs, maxlength);  //tempsum = a61*k1
        setup(a62, temp1, tempsum, k2s, concs, maxlength);  //tempsum = a61*k1 + a62*k2
        setup(a63, temp1, tempsum, k3s, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3
        setup(a64, temp1, tempsum, k4s, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3 + a64*k4
        setup(a65, temp1, tempsum, **k4s**, concs, maxlength);  //tempsum = a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5
        arrAdd<<< 1, *maxlength >>>(concs, tempsum, tempsum);   //tempsum = concs + a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5
        cudaDeviceSynchronize();

        calcK<<< 1, *maxlength >>>(tempsum, k6s, dt);           //k6 = dt * flux (concs + a61*k1 + a62*k2 + a63*k3 + a64*k4 + a65*k5)
        cudaDeviceSynchronize();

        //At this point, temp1 and tempsum are maxlength dimension arrays that are able to be used for other things.

        //Calculate acceptable step size before storing the concentrations.
        calcStepSize(temp1, temp2, tempsum, concs, k1s, k2s, k3s, k4s, k5s, k6s, dt, maxlength);    //temp1 = 4th Order guess, tempsum = 5th Order guess
        cudaDeviceSynchronize();

        //Store the initial conditions in the first column of the storage array.
        if (k==0)
        {
            storeConcs<<< 1, *maxlength >>>(concStorage, pitch, concs, k);  //Store this step's concentrations in 2D array
            cudaDeviceSynchronize();
        }
        //Store future concentration in next column of storage array.
        storeConcs<<< 1, *maxlength >>>(concStorage, pitch, tempsum, k+1);  //Store this step's concentrations in 2D array
        cudaDeviceSynchronize();
    }
}
//calcStepSize will take in an error tolerance, the current concentrations and the k values and calculate the resulting step size according to the following equation
//e[n+1]=y4[n+1] - y5[n+1]
__device__ void calcStepSize(double* temp1, double*temp2, double* tempsum, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s, double* dt, int* maxlength)
{
    //do
    //{
        takeFourthOrderStep<<< 1, *maxlength >>>(temp1, concs, k1s, k2s, k3s, k4s, k5s);            //Store 4th order guess in temp1
        takeFifthOrderStep<<< 1, *maxlength >>>(tempsum, concs, k1s, k2s, k3s, k4s, k5s, k6s);  //Store 5th order guess in tempsum
        cudaDeviceSynchronize();
        arrSub<<< 1, *maxlength >>>(temp1, tempsum, temp2)
        arrMin<<< 1, *maxlength >>>
    //arrMult
    //}
    //while
}
//takeFourthOrderStep is going to overwrite the old temp1 array with the new array of concentrations that result from a 4th order step.  This kernel is meant to be launched 
// with as many threads as there are discrete concentrations to be tracked.
__global__ void takeFourthOrderStep(double* y4, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s)
{
    double b41 = static_cast<double>(25)/static_cast<double>(216);
    double b42 = static_cast<double>(0);
    double b43 = static_cast<double>(1408)/static_cast<double>(2565);
    double b44 = static_cast<double>(2197)/static_cast<double>(4104);
    double b45 = static_cast<double>(-1)/static_cast<double>(5);
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    y4[idx] = concs[idx] + b41 * k1s[idx] + b42 * k2s[idx] + b43 * k3s[idx] + b44 * k4s[idx] + b45 * k5s[idx];
}
//takeFifthOrderStep is going to overwrite the old array of concentrations with the new array of concentrations.  As of now, this will be the 5th order step.  Another function can be d
//defined that will take a fourth order step if that is interesting for any reason.  This kernel is meant to be launched with as many threads as there are discrete concentrations
//to be tracked.
//Store b values in register? Constants?
__global__ void takeFifthOrderStep(double* y5, double* concs, double* k1s, double* k2s, double* k3s, double* k4s, double* k5s, double* k6s)
{
    double b51 = static_cast<double>(16)/static_cast<double>(135);
    double b52 = static_cast<double>(0);
    double b53 = static_cast<double>(6656)/static_cast<double>(12825);
    double b54 = static_cast<double>(28561)/static_cast<double>(56430);
    double b55 = static_cast<double>(-9)/static_cast<double>(50);
    double b56 = static_cast<double>(2)/static_cast<double>(55);
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    y5[idx] = concs[idx] + b51 * k1s[idx] + b52 * k2s[idx] + b53 * k3s[idx] + b54 * k4s[idx] + b55 * k5s[idx] + b56 * k6s[idx];
}
//storeConcs takes the current array of concentrations and stores it in the cId'th column of the 2D concStorage array
//pitch = memory size of a row
__global__ void storeConcs(double* cS, size_t pitch, double* concs, int cId)
{
    int tIdx = threadIdx.x;
    //cS is basically the memory address of the first element of the flattened (1D) 2D array.
    double* row = (double*)((char*)cS + tIdx * pitch);
    row[cId] = concs[tIdx];
}
//Perhaps I can optimize by using shared memory to hold conc values.
__global__ void calcK(double* concs, double* ks, double* dt)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    ks[idx]=(*dt)*flux(idx, concs);
}
//Adds two arrays (a + b) element by element and stores the result in array c.
__global__ void arrAdd(double* a, double* b, double* c)
{                                                 
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]+b[idx];
}
//Subtracts two arrays (a - b) element by element and stores the result in array c.
__global__ void arrSub(double* a, double* b, double* c)
{                                                 
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]-b[idx];
}
//Multiplies two arrays (a * b) element by element and stores the result in array c.
__global__ void arrMult(double* a, double* b, double* c)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    c[idx]=a[idx]*b[idx];
}
//Will find the min of errors array.
__global__ void arrMin(double* errors)
{
    //extern _shared_ double[7];
}
//Initializes an array a to double value b.
__global__ void arrInit(double* a, double b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b;
}
//Copies array b onto array a.
__global__ void arrCopy(double* a, double* b)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    a[idx]=b[idx];
}
//Placeholder function for the flux calculation.  It will take the size of the oligomer and current concentrations as inputs.
__device__ double flux(int r, double *concs) 
{
    return -concs[r];
}
//This function multiplies a tableau value by the corresponding k array and adds the result to tempsum.  Used to
//add all the a*k terms.
//e.g. setup(a21, temp1, tempsum, k1s, concs, maxlength) => tempsum = a21 * k1
__device__ void setup(double tableauValue, double *temp1, double *tempsum, double *ks, double *concs, int *maxlength) 
{
    //Sets tempsum to tabVal * k
    arrInit<<< 1, *maxlength >>>(temp1, tableauValue);      //Set [temp1] to tableau value, temp1 = a
    cudaDeviceSynchronize();
    arrMult<<< 1, *maxlength >>>(ks, temp1, temp1);         //Multiply tableau value by appropriate [k], temp1 = a*k
    cudaDeviceSynchronize();
    arrAdd<<< 1, *maxlength >>>(tempsum, temp1, tempsum);   //Move tabVal*k to [tempsum], tempsum = tempsum+temp1
    cudaDeviceSynchronize();
    //temp1 = tableauValue * kArray
    //tempsum = current sum (tableauValue * kArray)
}

最終編集 私は自分のエラーを見つけました。K6を計算するタイプミスでした。「k5s」であるはずの問題のある「k4s」用語の周りにアスタリスクを付けました。CUDA RKF45インテグレーターを必要とする人は誰でも大歓迎です。現在機能しており、正しい順序で動作するためです。

Number of Points    h       Global Error    Calculated Order
        10          0.1         3.59E-09    
        20          0.05        1.08E-10    5.06E+00
        40          0.025       3.31E-12    5.03E+00
        80          0.0125      1.03E-13    5.01E+00
        160         0.00625     2.20E-15    5.55E+00
        320         0.003125    1.20E-15    8.74E-01
        640         0.0015625   1.40E-15    -2.22E-01
        1280        0.00078125  2.00E-16    2.81E+00

— スリザリンの髪
ソース

このメソッドを実装するライブラリを使用していない理由はありますか？

— Geoff Oxberry 2013年

C ++、CUDA、およびODEソルバーを初めて使用する場合、このタスクは非常に難しいようです。Parallel Programming ToolボックスとNVidiaグラフィックカードを使用してMatLabにアクセスできる場合は、MatLabでこのコードのプロトタイプを作成することをお勧めします。最近はCUDA GPU（http://www.mathworks.com/discovery/matlab-gpu.html）（およびOpenMPとMPI）をサポートしています。コードがMatLabで実行されたら、コードをC ++に移植するか、MeXを使用できます。ちょっとした考え。

— 2013年

odeintライブラリ（Boostの一部）をThrustインターフェースで使用できます。独自のコードを導入する場合は、製造されたソリューションの方法で実装をテストする必要があります。

— Geoff Oxberry 2013年

E (h) = a * h^{n}

$E(h) = a * h^{n}$

\log E = \log a + n \log h

$\log{E} = \log{a} + n \log{h}$

@KarstenChuポイントが取れました。私はPython / Cコンボを好みます。私は心を開いておくのが好きです。最近、MatLabで何ができるかを確認するには、スウェーデンのストックホルムにあるPDCクラスターからこのリンクをチェックしてください。スライドと多数のサンプルMATLABスクリプトがあります。それらには、openmp、MPI、cudaが含まれます。

— 2013年

エラー許容値をに設定していると想定しています1e-12。アダプティブスキームが現在のステップサイズを受け入れる場合、5次スキームがすべての意図と目的で「正しい」答えであると想定していることは正しいです。ただし、これは現在のステップを受け入れる場合のみです。4次と5次のステップの差が大きすぎると、どちらも十分に受け入れられません。このような単一のアルゴリズムをテストするには、複数の時間ステップで実行し、収束の順序を確認してそれを確認する必要があります。

一方で、次のステップ 0.001テスト問題で5次スキームを使用（私は1Dであると想定しています）は、エラーをほぼ機械的イプシロンに駆動するのに十分以上のはずです。私の推測では、インテグレータを1次に減らすタイプミスがどこかにあると思います。1段落で述べたように、複数のタイムステップで実行し、収束順序を確認します。これにより、表示されたエラーが妥当かどうか、またはインテグレーターのどこかにエラーがあるかどうかがわかります。

編集

odeソルバーの収束次数を計算するには、2つの異なるタイムステップについて、既知の解を持つ問題で実行します。 $h$ $\frac{h}{2}$ 。グローバルエラーは次のようになります。

$\frac{E(h)}{E(\frac{h}{2})} = 2^n$

$n=5$

— ゴドリック・シーア
ソース

現時点では、適応ステップアルゴリズムがプログラムされていないため、おそらくエラーの原因となっています。ただし、固定サイズの場合、ソルバーのおおよその次数をそのまま計算するために、対数を少しおもしろく計算して、約2.1を取得しています。5次推定値と許容値の差を「グローバルエラー」として使用して、ステップサイズ.001、.01、および.1に対してO = ln（GlobalError / NumberOfPoints）/ ln（StepSize）を実行しました。私のコードにエラーがあることを示唆していますか？

— スリザリンの髪2013年

注文の計算方法については、私の編集を参照してください。

— Godric Seer 2013年

データとそれに関連するエラーを数回実行し、それらを方程式O = LN（E（h1）/ E（h2））/ LN（h1 / h2）にプラグインするチャートを追加しました。それが私が「あなたの注文」としてチャートに置いたものです

— スリザリンの髪

テーブルの最後の2つのデータポイントを確認します。ステップサイズを10分の1に減らし、エラーは約100分の1に減少しました。この方法は間違いなく2次です。あなたのコードを見なければ、係数を再確認する以外に何を提案すべきかわかりません。

— Godric Seer 2013年

係数はすべて正しいです。私が見なかったのは、さまざまなKを計算するためにdtを変更していたところです（つまり、f（t + a dt、x + a K））。方程式（テストおよび実数）は自律的であるように見えるため、それは問題になりません。

— Godric Seer 2013