タブ区切りの整数が2列あります。1つ目はランダムな整数、2つ目はこのプログラムで生成できるグループを識別する整数です。（generate_groups.cc）

#include <cstdlib>
#include <iostream>
#include <ctime>

int main(int argc, char* argv[]) {
  int num_values = atoi(argv[1]);
  int num_groups = atoi(argv[2]);

  int group_size = num_values / num_groups;
  int group = -1;

  std::srand(42);

  for (int i = 0; i < num_values; ++i) {
    if (i % group_size == 0) {
      ++group;
    }
    std::cout << std::rand() << '\t' << group << '\n';
  }

  return 0;
}

次に、2番目のプログラム（sum_groups.cc）を使用して、グループごとの合計を計算します。

#include <iostream>
#include <chrono>
#include <vector>

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[p_g[i]] += p_x[i];
  }
}

int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums;

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group > n_groups) {
      n_groups = group;
    }
  }
  sums.resize(n_groups);

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  for (int i = 0; i < 10; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sums.data());
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << std::endl;

  return 0;
}

次に、これらのプログラムを所定のサイズのデータセットで実行し、同じデータセットの行の順序をシャッフルすると、シャッフルされたデータは合計を、順序付けされたデータの約2倍以上高速に計算します。

g++ -O3 generate_groups.cc -o generate_groups
g++ -O3 sum_groups.cc -o sum_groups
generate_groups 1000000 100 > groups
shuf groups > groups2
sum_groups < groups
sum_groups < groups2
sum_groups < groups2
sum_groups < groups
20784
8854
8220
21006

グループごとに並べ替えられた元のデータは、データの局所性が向上し、高速になると期待していましたが、逆の動作が見られます。誰かがその理由を仮定できるかどうか疑問に思っていましたか？

c++ performance

— ジム
ソース

わかりませんが、合計ベクトルの範囲外の要素に書き込んでいます-通常のことを行い、データ要素へのポインタの代わりにベクトルへの参照を渡してから、境界を.at()行うデバッグモードまたはデバッグモードoperator[]を使用した場合あなたが見るだろうチェック。

— ショーン

「groups2」ファイルにすべてのデータが含まれていて、すべて読み込まれて処理されていることを確認しましたか？途中のどこかにEOFキャラクターがいるのでしょうか？

— -1201ProgramAlarm

サイズを変更しないため、プログラムの動作は未定義ですsum。代わりにsums.reserve(n_groups);呼び出す必要がありますsums.resize(n_groups);-@Shawnが示唆していたことです。

— ユージーン

2つのベクトル（値とグループ）ではなく、ペアのベクトルが期待どおりに動作することに注意してください（たとえば、ここまたはここを参照）。

— Bob__

データを値で並べ替えましたよね？しかし、それによってグループも分類され、xpressionに影響を与えp_out[p_g[i]] += p_x[i];ます。おそらく元のスクランブル順序で、グループはp_outアレイへのアクセスに関して実際に良好なクラスタリングを示しています。値を並べ替えると、グループインデックス付きのアクセスパターンが不十分になる可能性がありますp_out。

— Kaz

セットアップ/遅くする

まず、プログラムはほぼ同じ時間で実行されます。

sumspeed$ time ./sum_groups < groups_shuffled 
11558358

real    0m0.705s
user    0m0.692s
sys 0m0.013s

sumspeed$ time ./sum_groups < groups_sorted
24986825

real    0m0.722s
user    0m0.711s
sys 0m0.012s

ほとんどの時間は入力ループで費やされます。しかし、に興味があるので、それはgrouped_sum()無視しましょう。

ベンチマークループを10回から1000回の反復に変更するとgrouped_sum()、ランタイムの支配が始まります。

sumspeed$ time ./sum_groups < groups_shuffled 
1131838420

real    0m1.828s
user    0m1.811s
sys 0m0.016s

sumspeed$ time ./sum_groups < groups_sorted
2494032110

real    0m3.189s
user    0m3.169s
sys 0m0.016s

パフォーマンスの違い

これperfで、プログラムで最もホットなスポットを見つけることができます。

sumspeed$ perf record ./sum_groups < groups_shuffled
1166805982
[ perf record: Woken up 1 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
Warning:
Processed 4636 samples and lost 6.95% samples!

[ perf record: Captured and wrote 0.176 MB perf.data (4314 samples) ]

sumspeed$ perf record ./sum_groups < groups_sorted
2571547832
[ perf record: Woken up 2 times to write data ]
[kernel.kallsyms] with build id 3a2171019937a2070663f3b6419330223bd64e96 not found, continuing without symbols
[ perf record: Captured and wrote 0.420 MB perf.data (10775 samples) ]

そしてそれらの違い：

sumspeed$ perf diff
[...]
# Event 'cycles:uppp'
#
# Baseline  Delta Abs  Shared Object        Symbol                                                                  
# ........  .........  ...................  ........................................................................
#
    57.99%    +26.33%  sum_groups           [.] main
    12.10%     -7.41%  libc-2.23.so         [.] _IO_getc
     9.82%     -6.40%  libstdc++.so.6.0.21  [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<c
     6.45%     -4.00%  libc-2.23.so         [.] _IO_ungetc
     2.40%     -1.32%  libc-2.23.so         [.] _IO_sputbackc
     1.65%     -1.21%  libstdc++.so.6.0.21  [.] 0x00000000000dc4a4
     1.57%     -1.20%  libc-2.23.so         [.] _IO_fflush
     1.71%     -1.07%  libstdc++.so.6.0.21  [.] std::istream::sentry::sentry
     1.22%     -0.77%  libstdc++.so.6.0.21  [.] std::istream::operator>>
     0.79%     -0.47%  libstdc++.so.6.0.21  [.] __gnu_cxx::stdio_sync_filebuf<char, std::char_traits<char> >::uflow
[...]

のインライン化main()がgrouped_sum()進んだと思われます。すばらしいです、perfさん。

パフォーマンスに注釈を付ける

内部で 時間を費やす場所に違いはありmain()ますか？

シャッフル：

sumspeed$ perf annotate -i perf.data.old
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  6,88 │190:   movslq (%r9,%rax,4),%rdx
 58,54 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  3,86 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 29,61 │       add    %esi,(%rcx,%rdx,4)
[...]

並べ替え：

sumspeed$ perf annotate -i perf.data
[...]
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
       │180:   xor    %eax,%eax
       │       test   %rdi,%rdi
       │     ↓ je     1a4
       │       nop
       │         p_out[p_g[i]] += p_x[i];
  1,00 │190:   movslq (%r9,%rax,4),%rdx
 55,12 │       mov    (%r8,%rax,4),%esi
       │     #include <chrono>
       │     #include <vector>
       │
       │     // This is the function whose performance I am interested in
       │     void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
       │       for (size_t i = 0; i < n; ++i) {
  0,07 │       add    $0x1,%rax
       │         p_out[p_g[i]] += p_x[i];
 43,28 │       add    %esi,(%rcx,%rdx,4)
[...]

いいえ、同じ2つの指示が支配的です。そのため、どちらの場合も長い時間がかかりますが、データを並べ替えるとさらに悪くなります。

パフォーマンス統計

はい。しかし、それらを同じ回数実行する必要があるため、各命令は何らかの理由で遅くなっているはずです。何をperf stat言っているのか見てみましょう。

sumspeed$ perf stat ./sum_groups < groups_shuffled 
1138880176

 Performance counter stats for './sum_groups':

       1826,232278      task-clock (msec)         #    0,999 CPUs utilized          
                72      context-switches          #    0,039 K/sec                  
                 1      cpu-migrations            #    0,001 K/sec                  
             4 076      page-faults               #    0,002 M/sec                  
     5 403 949 695      cycles                    #    2,959 GHz                    
       930 473 671      stalled-cycles-frontend   #   17,22% frontend cycles idle   
     9 827 685 690      instructions              #    1,82  insn per cycle         
                                                  #    0,09  stalled cycles per insn
     2 086 725 079      branches                  # 1142,639 M/sec                  
         2 069 655      branch-misses             #    0,10% of all branches        

       1,828334373 seconds time elapsed

sumspeed$ perf stat ./sum_groups < groups_sorted
2496546045

 Performance counter stats for './sum_groups':

       3186,100661      task-clock (msec)         #    1,000 CPUs utilized          
                 5      context-switches          #    0,002 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
             4 079      page-faults               #    0,001 M/sec                  
     9 424 565 623      cycles                    #    2,958 GHz                    
     4 955 937 177      stalled-cycles-frontend   #   52,59% frontend cycles idle   
     9 829 009 511      instructions              #    1,04  insn per cycle         
                                                  #    0,50  stalled cycles per insn
     2 086 942 109      branches                  #  655,014 M/sec                  
         2 078 204      branch-misses             #    0,10% of all branches        

       3,186768174 seconds time elapsed

際立っているのは、stalled-cycles-frontendだけです。

さて、命令パイプラインはストールしています。フロントエンド。正確にそれが意味することは、おそらくマイクロアーキテクチャによって異なります。

でも私は推測しています。寛大であれば、それを仮説と呼ぶかもしれません。

仮説

入力を並べ替えることで、書き込みの局所性が高まります。実際、それらは非常にローカルです。あなたが行うほとんどすべての追加は、前のものと同じ場所に書き込みます。

これはキャッシュには適していますが、パイプラインには適していません。データの依存関係を導入し、次の追加命令が前の追加が完了するまで続行されないようにする（または、後続の命令で結果を利用できるようにする）

それはあなたの問題だ。

おもう。

それを修正する

複数の合計ベクトル

実は、やってみよう。複数の合計ベクトルを使用して、加算ごとに切り替え、最後にそれらを合計するとどうなりますか？少し局所性が必要ですが、データの依存関係を取り除く必要があります。

（コードはきれいではありません。私を判断しないでください、インターネット!!）

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}

int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << std::endl;

  return 0;
}

（ああ、そして私もn_groups計算を修正しました;それは1つずれていました。）

結果

メイクファイルを構成-DNSUMS=...してコンパイラに引数を与えると、次のようになります。

sumspeed$ for n in 1 2 4 8 128; do make -s clean && make -s NSUMS=$n && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done
1134557008 with NSUMS=1
       924 611 882      stalled-cycles-frontend   #   17,13% frontend cycles idle   
2513696351 with NSUMS=1
     4 998 203 130      stalled-cycles-frontend   #   52,79% frontend cycles idle   
1116188582 with NSUMS=2
       899 339 154      stalled-cycles-frontend   #   16,83% frontend cycles idle   
1365673326 with NSUMS=2
     1 845 914 269      stalled-cycles-frontend   #   29,97% frontend cycles idle   
1127172852 with NSUMS=4
       902 964 410      stalled-cycles-frontend   #   16,79% frontend cycles idle   
1171849032 with NSUMS=4
     1 007 807 580      stalled-cycles-frontend   #   18,29% frontend cycles idle   
1118732934 with NSUMS=8
       881 371 176      stalled-cycles-frontend   #   16,46% frontend cycles idle   
1129842892 with NSUMS=8
       905 473 182      stalled-cycles-frontend   #   16,80% frontend cycles idle   
1497803734 with NSUMS=128
     1 982 652 954      stalled-cycles-frontend   #   30,63% frontend cycles idle   
1180742299 with NSUMS=128
     1 075 507 514      stalled-cycles-frontend   #   19,39% frontend cycles idle

合計ベクトルの最適な数は、おそらくCPUのパイプラインの深さに依存します。私の7年前のウルトラブックCPUは、新しいファンシーデスクトップCPUが必要とするよりも少ないベクトルでパイプラインを最大化できるでしょう。

明らかに、多いほど良いとは限りません。128の合計ベクトルに夢中になったとき、私たちはキャッシュミスに悩まされ始めました-シャッフルされた入力が並べ替えよりも遅くなることからもわかるように、当初の予想どおりです。一周しました！:)

レジスター内のグループごとの合計

（これは編集で追加されました）

ああ、オタクが狙われた！入力が並べ替えられてさらにパフォーマンスが向上することがわかっている場合は、次の関数の書き直し（追加の合計配列なし）は、少なくとも私のコンピューターではさらに高速です。

// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int* p_out) {
  int i = n-1;
  while (i >= 0) {
    int g = p_g[i];
    int gsum = 0;
    do {
      gsum += p_x[i--];
    } while (i >= 0 && p_g[i] == g);
    p_out[g] += gsum;
  }
}

この1つのトリックは、コンパイラーがgsum変数（グループの合計）をレジスターに保持できることです。パイプライン内のフィードバックループがここで短くなり、メモリアクセスが少なくなるため、これはより高速になると思います（ただし、非常に間違っている可能性があります）。優れた分岐予測子は、グループの等価性の追加チェックを安くします。

結果

入力をシャッフルするのはひどいです...

sumspeed$ time ./sum_groups < groups_shuffled
2236354315

real    0m2.932s
user    0m2.923s
sys 0m0.009s

...しかし、ソートされた入力用の私の「多数合計」ソリューションよりも約40％高速です。

sumspeed$ time ./sum_groups < groups_sorted
809694018

real    0m1.501s
user    0m1.496s
sys 0m0.005s

多くの小さなグループはいくつかの大きなグループよりも遅くなるので、これがより速い実装であるかどうかは、実際にここのデータに依存します。そしていつものように、あなたのCPUモデルで。

ビットマスキングの代わりにオフセットを使用した複数の合計ベクトル

ソペルは、私のビットマスキングアプローチの代わりとして、4つの展開された追加を提案しました。私は彼らの提案の一般化されたバージョンを実装しましたNSUMS。私は、コンパイラが内部ループを展開することを期待しています（少なくともNSUMS=4）。

#include <iostream>
#include <chrono>
#include <vector>

#ifndef NSUMS
#define NSUMS (4) // must be power of 2 (for masking to work)
#endif

#ifndef INNER
#define INNER (0)
#endif
#if INNER
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  size_t i = 0;
  int quadend = n & ~(NSUMS-1);
  for (; i < quadend; i += NSUMS) {
    for (int k=0; k<NSUMS; ++k) {
      p_out[k][p_g[i+k]] += p_x[i+k];
    }
  }
  for (; i < n; ++i) {
    p_out[0][p_g[i]] += p_x[i];
  }
}
#else
// This is the function whose performance I am interested in
void grouped_sum(int* p_x, int *p_g, int n, int** p_out) {
  for (size_t i = 0; i < n; ++i) {
    p_out[i & (NSUMS-1)][p_g[i]] += p_x[i];
  }
}
#endif


int main() {
  std::vector<int> values;
  std::vector<int> groups;
  std::vector<int> sums[NSUMS];

  int n_groups = 0;

  // Read in the values and calculate the max number of groups
  while(std::cin) {
    int value, group;
    std::cin >> value >> group;
    values.push_back(value);
    groups.push_back(group);
    if (group >= n_groups) {
      n_groups = group+1;
    }
  }
  for (int i=0; i<NSUMS; ++i) {
    sums[i].resize(n_groups);
  }

  // Time grouped sums
  std::chrono::system_clock::time_point start = std::chrono::system_clock::now();
  int* sumdata[NSUMS];
  for (int i = 0; i < NSUMS; ++i) {
    sumdata[i] = sums[i].data();
  }
  for (int i = 0; i < 1000; ++i) {
    grouped_sum(values.data(), groups.data(), values.size(), sumdata);
  }
  for (int i = 1; i < NSUMS; ++i) {
    for (int j = 0; j < n_groups; ++j) {
      sumdata[0][j] += sumdata[i][j];
    }
  }
  std::chrono::system_clock::time_point end = std::chrono::system_clock::now();

  std::cout << (end - start).count() << " with NSUMS=" << NSUMS << ", INNER=" << INNER << std::endl;

  return 0;
}

結果

測定する時間。昨日は/ tmpで作業していたため、まったく同じ入力データがないことに注意してください。したがって、これらの結果は以前の結果と直接比較することはできません（ただし、おそらく十分に近いものです）。

sumspeed$ for n in 2 4 8 16; do for inner in 0 1; do make -s clean && make -s NSUMS=$n INNER=$inner && (perf stat ./sum_groups < groups_shuffled && perf stat ./sum_groups < groups_sorted)  2>&1 | egrep '^[0-9]|frontend'; done; done1130558787 with NSUMS=2, INNER=0
       915 158 411      stalled-cycles-frontend   #   16,96% frontend cycles idle   
1351420957 with NSUMS=2, INNER=0
     1 589 408 901      stalled-cycles-frontend   #   26,21% frontend cycles idle   
840071512 with NSUMS=2, INNER=1
     1 053 982 259      stalled-cycles-frontend   #   23,26% frontend cycles idle   
1391591981 with NSUMS=2, INNER=1
     2 830 348 854      stalled-cycles-frontend   #   45,35% frontend cycles idle   
1110302654 with NSUMS=4, INNER=0
       890 869 892      stalled-cycles-frontend   #   16,68% frontend cycles idle   
1145175062 with NSUMS=4, INNER=0
       948 879 882      stalled-cycles-frontend   #   17,40% frontend cycles idle   
822954895 with NSUMS=4, INNER=1
     1 253 110 503      stalled-cycles-frontend   #   28,01% frontend cycles idle   
929548505 with NSUMS=4, INNER=1
     1 422 753 793      stalled-cycles-frontend   #   30,32% frontend cycles idle   
1128735412 with NSUMS=8, INNER=0
       921 158 397      stalled-cycles-frontend   #   17,13% frontend cycles idle   
1120606464 with NSUMS=8, INNER=0
       891 960 711      stalled-cycles-frontend   #   16,59% frontend cycles idle   
800789776 with NSUMS=8, INNER=1
     1 204 516 303      stalled-cycles-frontend   #   27,25% frontend cycles idle   
805223528 with NSUMS=8, INNER=1
     1 222 383 317      stalled-cycles-frontend   #   27,52% frontend cycles idle   
1121644613 with NSUMS=16, INNER=0
       886 781 824      stalled-cycles-frontend   #   16,54% frontend cycles idle   
1108977946 with NSUMS=16, INNER=0
       860 600 975      stalled-cycles-frontend   #   16,13% frontend cycles idle   
911365998 with NSUMS=16, INNER=1
     1 494 671 476      stalled-cycles-frontend   #   31,54% frontend cycles idle   
898729229 with NSUMS=16, INNER=1
     1 474 745 548      stalled-cycles-frontend   #   31,24% frontend cycles idle

うん、内側のループNSUMS=8は私のコンピューターで最速です。私の「ローカルgsum」アプローチと比較して、シャッフルされた入力に対してひどくならないという追加の利点もあります。

興味深いことに、NSUMS=16より悪くなりNSUMS=8ます。これは、キャッシュミスが増え始めているか、内部ループを適切に展開するための十分なレジスタがないために発生する可能性があります。

— スニルド・ドルコウ
ソース

楽しかったです。:)

— Snild Dolkow

それはすごかった！について知りませんでしたperf。

— Tanveer Badar

最初のアプローチでは、4つの異なるアキュムレータで4xを手動でアンロールすると、パフォーマンスが向上するのではないかと思います。godbolt.org/z/S-PhFmの

— ソペル

提案をありがとう。はい、それによりパフォーマンスが向上しました。それを答えに追加しました。

— Snild Dolkow、

ありがとう！私はこのような可能性を考えていましたが、詳細な回答に感謝し、それを決定する方法を知りませんでした！

— ジム

これが、ソートされたグループが未保存のグループよりも遅い理由です。

最初に、ループを合計するためのアセンブリコードを示します。

008512C3  mov         ecx,dword ptr [eax+ebx]
008512C6  lea         eax,[eax+4]
008512C9  lea         edx,[esi+ecx*4] // &sums[groups[i]]
008512CC  mov         ecx,dword ptr [eax-4] // values[i]
008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]
008512D1  sub         edi,1
008512D4  jne         main+163h (08512C3h)

この問題の主な理由であるadd命令を見てみましょう。

008512CF  add         dword ptr [edx],ecx // sums[groups[i]]+=values[i]

プロセッサがこの命令を最初に実行すると、edxのアドレスにメモリ読み取り（ロード）要求を発行し、次にecxの値を追加してから、同じアドレスに書き込み（ストア）要求を発行します。

プロセッサの呼び出し元のメモリリオーダーに機能があります

命令実行のパフォーマンス最適化を可能にするために、IA-32アーキテクチャーは、Pentium 4、Intel Xeon、およびP6ファミリーのプロセッサーにおけるプロセッサー順序付けと呼ばれる強力な順序付けモデルからの逸脱を可能にします。これらのプロセッサ順序付けのバリエーション（ここではメモリ順序付けモデルと呼びます）により、読み取りをバッファリングされた書き込みよりも先に進めるなど、パフォーマンスを向上させる操作が可能になります。これらのバリエーションの目標は、マルチプロセッサシステムでもメモリの一貫性を維持しながら、命令の実行速度を向上させることです。

そしてルールがあります

読み取りは、異なる場所への古い書き込みでは並べ替えられますが、同じ場所への古い書き込みでは並べ替えられない場合があります。

したがって、次の反復が書き込み要求が完了する前に追加命令に到達した場合、edxアドレスが前の値と異なる場合は待機せず、読み取り要求を発行し、古い書き込み要求を介して並べ替え、追加命令が続行します。ただし、アドレスが同じ場合、add命令は古い書き込みが行われるまで待機します。

ループは短く、メモリコントローラーがメモリへの書き込み要求を完了するよりも速く、プロセッサがループを実行できることに注意してください。

そのため、ソートされたグループの場合、同じアドレスから連続して何度も読み取りと書き込みが行われるため、メモリの並べ替えを使用してパフォーマンスの向上が失われます。一方、ランダムグループが使用される場合、各反復はおそらく異なるアドレスを持つため、読み取りは古い書き込みを待機せず、その前に並べ替えられます。add命令は、前の命令が実行されるのを待ちません。

— アーメドアンター
ソース

並べ替えられたグループでは、並べ替えられていないグループよりもグループ化された合計が遅いのはなぜですか？