データサイエンス descriptive-statistics

1

xgboostがsklearn GradientBoostingClassifierよりもずっと速いのはなぜですか？

私は、100個の数値特徴を備えた50kの例で勾配ブースティングモデルをトレーニングしようとしています。XGBClassifier一方、私のマシンで43秒以内に、ハンドル500本の木、GradientBoostingClassifierハンドルのみ10樹木（！）1分2秒:(私は気にしませんでしたでは、それは時間がかかるだろうと500本の木を育てるしようとしている。私は、同じ使用していますlearning_rateし、max_depth設定を、下記参照。 XGBoostがこれほど速くなったのはなぜですか？sklearnの人が知らない勾配ブースティングのためのいくつかの新しい実装を使用していますか？それとも、「角を切り」、より浅い木を育てるのですか？ PS私はこの議論を知っています：https : //www.kaggle.com/c/higgs-boson/forums/t/10335/xgboost-post-competition-surveyが、そこに答えを得ることができませんでした... XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0, learning_rate=0.05, max_delta_step=0, max_depth=10, min_child_weight=1, missing=None, n_estimators=500, nthread=-1, objective='binary:logistic', reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=0, silent=True, subsample=1) GradientBoostingClassifier(init=None, learning_rate=0.05, loss='deviance', max_depth=10, max_features=None, max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, presort='auto', random_state=None, subsample=1.0, verbose=0, warm_start=False)

29 scikit-learn xgboost gbm data-mining classification data-cleaning machine-learning reinforcement-learning data-mining bigdata dataset nlp language-model stanford-nlp machine-learning neural-network deep-learning randomized-algorithms machine-learning beginner career xgboost loss-function neural-network software-recommendation naive-bayes-classifier classification scikit-learn feature-selection r random-forest cross-validation data-mining python scikit-learn random-forest churn python clustering k-means machine-learning nlp sentiment-analysis machine-learning programming python scikit-learn nltk gensim visualization data csv neural-network deep-learning descriptive-statistics machine-learning supervised-learning text-mining orange data parameter-estimation python pandas scraping r clustering k-means unsupervised-learning

1

LSTMセルはいくつ使用すればよいですか？

使用する必要があるLSTMセルの最小、最大、および「妥当な」量に関する経験則（または実際の規則）はありますか？具体的には、TensorFlowとプロパティのBasicLSTMCellに関連していnum_unitsます。私が定義する分類問題があると仮定してください： t - number of time steps n - length of input vector in each time step m - length of output vector (number of classes) i - number of training examples たとえば、トレーニングの例の数は次の数よりも多い必要がありますか？ 4*((n+1)*m + m*m)*c cセルの数はどこですか？これに基づいています：LSTMネットワークのパラメーターの数を計算する方法？私が理解しているように、これはパラメータの総数を与えるはずであり、トレーニング例の数よりも少なくなければなりません。

12 rnn machine-learning r predictive-modeling random-forest python language-model sentiment-analysis encoding machine-learning deep-learning neural-network dataset caffe classification xgboost multiclass-classification unbalanced-classes time-series descriptive-statistics python r clustering machine-learning python deep-learning tensorflow machine-learning python predictive-modeling probability scikit-learn svm machine-learning python classification gradient-descent regression research python neural-network deep-learning convnet keras python tensorflow machine-learning deep-learning tensorflow python r bigdata visualization rstudio pandas pyspark dataset time-series multilabel-classification machine-learning neural-network ensemble-modeling kaggle machine-learning linear-regression cnn convnet machine-learning tensorflow association-rules machine-learning predictive-modeling training model-selection neural-network keras deep-learning deep-learning convnet image-classification predictive-modeling prediction machine-learning python classification predictive-modeling scikit-learn machine-learning python random-forest sampling training recommender-system books python neural-network nlp deep-learning tensorflow python matlab information-retrieval search search-engine deep-learning convnet keras machine-learning python cross-validation sampling machine-learning

5

平均値と中央値を使用する場合

私はデータサイエンスと統計に不慣れなので、これは初心者の質問のように思えるかもしれません。私は、ユーザーのTwitterフォロワーが1日に獲得できるデータセットに取り組んでいます。一定期間の平均的な成長を測定したいのですが、成長の平均を求めることで測定しました。しかし、誰かが私にこれに中央値を使うように勧めています。誰もが説明できますか、どのユースケースで平均を使用する必要があり、いつ中央値を使用するのですか？

7 statistics descriptive-statistics

タグ付けされた質問 「descriptive-statistics」

タグ付けされた質問「descriptive-statistics」