パンダのiterrowsにはパフォーマンスの問題がありますか？

Question 1

パンダのiterrowを使用すると、パフォーマンスが非常に低下することに気づきました。

これは他の人が経験するものですか？それはiterrowsに固有ですか？この関数は特定のサイズのデータ（200万から300万行を処理しています）に対しては回避する必要がありますか？

GitHub に関するこの議論から、データフレームでdtypeが混在していることが原因であると信じるようになりましたが、以下の簡単な例では、1つのdtype（float64）を使用している場合でも存在することを示しています。これは私のマシンで36秒かかります：

import pandas as pd
import numpy as np
import time

s1 = np.random.randn(2000000)
s2 = np.random.randn(2000000)
dfa = pd.DataFrame({'s1': s1, 's2': s2})

start = time.time()
i=0
for rowindex, row in dfa.iterrows():
    i+=1
end = time.time()
print end - start

applyのようなベクトル化された操作がなぜこれほど速くなるのですか？そこにも行ごとの反復がいくつかあるはずだと思います。

私の場合にiterrowを使用しない方法を理解できません（これは今後の質問のために保存します）。したがって、この繰り返しを常に回避できる場合は、ご連絡いただければ幸いです。個別のデータフレームのデータに基づいて計算を行っています。ありがとうございました！

---編集：実行したいものの簡略版を以下に追加しました---

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b'],
      'number1':[50,-10]}

t2 = {'letter':['a','a','b','b'],
      'number2':[0.2,0.5,0.1,0.4]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=[0])

#%% Iterate through filtering relevant data, optimizing, returning info
for row_index, row in table1.iterrows():   
    t2info = table2[table2.letter == row['letter']].reset_index()
    table3.ix[row_index,] = optimize(t2info,row['number1'])

#%% Define optimization
def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2']*t1info)
    maxrow = calculation.index(max(calculation))
    return t2info.ix[maxrow]

Question 2

一般iterrowsに、非常に特定のケースでのみ使用する必要があります。これは、さまざまな操作のパフォーマンスの一般的な優先順位です。

1) vectorization
2) using a custom cython routine
3) apply
    a) reductions that can be performed in cython
    b) iteration in python space
4) itertuples
5) iterrows
6) updating an empty frame (e.g. using loc one-row-at-a-time)

通常、カスタムCythonルーチンの使用は複雑すぎるため、ここではスキップします。

1）ベクトル化は常に、常に最初の最良の選択です。ただし、明らかな方法でベクトル化できない少数のケース（通常は再発を伴う）があります。さらに、小さめのDataFrame場合は、他の方法を使用する方が速いことがあります。

3）apply 通常、Cythonスペースのイテレーターで処理できます。これはパンダによって内部的に処理されますが、apply式の中で何が起こっているかに依存します。たとえば、df.apply(lambda x: np.sum(x))非常に迅速に実行されますが、もちろん、df.sum(1)さらに優れています。ただし、のようなものdf.apply(lambda x: x['b'] + 1)はPython空間で実行されるため、非常に遅くなります。

4）itertuplesデータをにボックス化しませんSeries。タプルの形式でデータを返すだけです。

5）iterrowsデータをにボックスしますSeries。本当に必要でない限り、別の方法を使用してください。

6）空のフレームを一度に1行ずつ更新します。この方法がWAYを使いすぎているのを見てきました。それは断然遅いです。これはおそらく一般的な場所です（一部のpython構造ではかなり高速です）が、DataFrameインデックス作成に対してかなりの数のチェックを行うため、一度に行を更新するのは常に非常に遅くなります。新しい構造を作成する方がはるかに優れていますconcat。

Question 3

Numpyおよびpandasでのベクトル演算は、いくつかの理由により、バニラPythonでのスカラー演算よりもはるかに高速です。

償却型ルックアップ：Pythonは動的に型付けされた言語であるため、配列の各要素にランタイムオーバーヘッドがあります。ただし、Numpy（およびpandas）はCで計算を実行します（多くの場合Cythonを介して）。配列のタイプは、反復の開始時にのみ決定されます。この節約だけでも最大のメリットの1つです。
キャッシングの向上：C配列を反復処理することは、キャッシュフレンドリーで非常に高速です。pandas DataFrameは「列指向のテーブル」です。つまり、各列は実際には単なる配列です。したがって、DataFrameで実行できるネイティブアクション（列内のすべての要素を合計するなど）では、キャッシュミスがほとんどありません。
並列処理の機会の増加：SIMD命令を介して単純なC配列を操作できます。Numpyの一部では、CPUとインストールプロセスに応じて、SIMDを有効にします。並列処理の利点は、静的型付けやより優れたキャッシングほど劇的なものではありませんが、それでも確実に有利です。

物語の教訓：Numpyとpandasでベクトル演算を使用します。これらの演算は、Cプログラマーがとにかく手動で記述したものであるという単純な理由により、Pythonのスカラー演算よりも高速です。（配列の概念は、SIMD命令が埋め込まれた明示的なループよりもはるかに読みやすいことを除きます。）

Question 4

これが問題を解決する方法です。これはすべてベクトル化されています。

In [58]: df = table1.merge(table2,on='letter')

In [59]: df['calc'] = df['number1']*df['number2']

In [60]: df
Out[60]: 
  letter  number1  number2  calc
0      a       50      0.2    10
1      a       50      0.5    25
2      b      -10      0.1    -1
3      b      -10      0.4    -4

In [61]: df.groupby('letter')['calc'].max()
Out[61]: 
letter
a         25
b         -1
Name: calc, dtype: float64

In [62]: df.groupby('letter')['calc'].idxmax()
Out[62]: 
letter
a         1
b         2
Name: calc, dtype: int64

In [63]: df.loc[df.groupby('letter')['calc'].idxmax()]
Out[63]: 
  letter  number1  number2  calc
1      a       50      0.5    25
2      b      -10      0.1    -1

Question 5

別のオプションは使用することですto_records()速いの両方よりもある、itertuplesとiterrows。

しかし、あなたの場合、他のタイプの改善の余地はたくさんあります。

これが私の最終的な最適化バージョンです

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        # np.multiply is in general faster than "x * y"
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

ベンチマークテスト：

-- iterrows() --
100 loops, best of 3: 12.7 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

-- itertuple() --
100 loops, best of 3: 12.3 ms per loop

-- to_records() --
100 loops, best of 3: 7.29 ms per loop

-- Use group by --
100 loops, best of 3: 4.07 ms per loop
  letter  number2
1      a      0.5
2      b      0.1
4      c      5.0
5      d      4.0

-- Avoid multiplication --
1000 loops, best of 3: 1.39 ms per loop
  letter  number2
0      a      0.5
1      b      0.1
2      c      5.0
3      d      4.0

完全なコード：

import pandas as pd
import numpy as np

#%% Create the original tables
t1 = {'letter':['a','b','c','d'],
      'number1':[50,-10,.5,3]}

t2 = {'letter':['a','a','b','b','c','d','c'],
      'number2':[0.2,0.5,0.1,0.4,5,4,1]}

table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)

#%% Create the body of the new table
table3 = pd.DataFrame(np.nan, columns=['letter','number2'], index=table1.index)


print('\n-- iterrows() --')

def optimize(t2info, t1info):
    calculation = []
    for index, r in t2info.iterrows():
        calculation.append(r['number2'] * t1info)
    maxrow_in_t2 = calculation.index(max(calculation))
    return t2info.loc[maxrow_in_t2]

#%% Iterate through filtering relevant data, optimizing, returning info
def iterthrough():
    for row_index, row in table1.iterrows():   
        t2info = table2[table2.letter == row['letter']].reset_index()
        table3.iloc[row_index,:] = optimize(t2info, row['number1'])

%timeit iterthrough()
print(table3)

print('\n-- itertuple() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.itertuples():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.itertuples():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()


print('\n-- to_records() --')
def optimize(t2info, n1):
    calculation = []
    for index, letter, n2 in t2info.to_records():
        calculation.append(n2 * n1)
    maxrow = calculation.index(max(calculation))
    return t2info.iloc[maxrow]

def iterthrough():
    for row_index, letter, n1 in table1.to_records():   
        t2info = table2[table2.letter == letter]
        table3.iloc[row_index,:] = optimize(t2info, n1)

%timeit iterthrough()

print('\n-- Use group by --')

def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    for index, letter, n1 in table1.to_records():
        t2 = table2.iloc[grouped.groups[letter]]
        calculation = t2.number2 * n1
        maxrow = calculation.argsort().iloc[-1]
        ret.append(t2.iloc[maxrow])
    global table3
    table3 = pd.DataFrame(ret)

%timeit iterthrough()
print(table3)

print('\n-- Even Faster --')
def iterthrough():
    ret = []
    grouped = table2.groupby('letter', sort=False)
    t2info = table2.to_records()
    for index, letter, n1 in table1.to_records():
        t2 = t2info[grouped.groups[letter].values]
        maxrow = np.multiply(t2.number2, n1).argmax()
        # `[1:]`  removes the index column
        ret.append(t2[maxrow].tolist()[1:])
    global table3
    table3 = pd.DataFrame(ret, columns=('letter', 'number2'))

%timeit iterthrough()
print(table3)

最終バージョンは、元のコードよりもほぼ10倍高速です。戦略は次のとおりです。

groupby値の繰り返し比較を回避するために使用します。
to_records生のnumpy.recordsオブジェクトにアクセスするために使用します。
すべてのデータをコンパイルするまで、DataFrameを操作しないでください。

Question 6

はい、Pandasのitertuples（）はiterrows（）より高速です。ドキュメントを参照できます：https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html

「行を反復するときにdtypeを保持するには、値の名前付きタプルを返し、一般にiterrowよりも高速なitertuples（）を使用することをお勧めします。」

Question 7

このビデオの詳細

基準