`dabl.search`.GridSuccessiveHalving¶

dabl.search.GridSuccessiveHalving(estimator, param_grid, scoring=None, n_jobs=None, refit=True, verbose=0, cv=5, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=True, max_budget='auto', budget_on='n_samples', ratio=3, r_min='auto', aggressive_elimination=False, force_exhaust_budget=False)[source]¶

Grid-search with successive halving.

The search strategy for hyper-parameter optimization starts evaluating all the candidates with a small amount of resource and iteratively selects the best candidates, using more and more resources.

See also

RandomSuccessiveHalving: Random search over a set of parameters using successive halving.

Notes

The parameters selected are those that maximize the score of the held-out data, according to the scoring parameter.

If n_jobs was set to a value higher than one, the data is copied for each parameter setting (and not n_jobs times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch. Then, the memory is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.

Attributes

n_candidates_int

The number of candidate parameters that were evaluated at the first iteration.

n_remaining_candidates_int

The number of candidate parameters that are left after the last iteration.

max_budget_int

The maximum number of resources that any candidate is allowed to use for a given iteration. Note that since the number of resources used at each iteration must be a multiple of r_min_, the actual number of resources used at the last iteration may be smaller than max_budget_.

r_min_int

The amount of resources that are allocated for each candidate at the first iteration.

n_iterations_int

The actual number of iterations that were run. This is equal to n_required_iterations_ if aggressive_elimination is True. Else, this is equal to min(n_possible_iterations_, n_required_iterations_).

n_possible_iterations_int

The number of iterations that are possible starting with r_min_ resources and without exceeding max_budget_.

n_required_iterations_int

The number of iterations that are required to end up with less than ratio candidates at the last iteration, starting with r_min_ resources. This will be smaller than n_possible_iterations_ when there isn’t enough budget.

cv_results_dict of numpy (masked) ndarrays

A dict with keys as column headers and values as columns, that can be imported into a pandas DataFrame.

For instance the below given table

param_kernel	param_gamma	split0_test_score	…	rank_test_score
‘rbf’	0.1	0.80	…	2
‘rbf’	0.2	0.90	…	1
‘rbf’	0.3	0.70	…	1

will be represented by a cv_results_ dict of:

{
'param_kernel' : masked_array(data = ['rbf', 'rbf', 'rbf'],
                              mask = False),
'param_gamma'  : masked_array(data = [0.1 0.2 0.3], mask = False),
'split0_test_score'  : [0.80, 0.90, 0.70],
'split1_test_score'  : [0.82, 0.50, 0.70],
'mean_test_score'    : [0.81, 0.70, 0.70],
'std_test_score'     : [0.01, 0.20, 0.00],
'rank_test_score'    : [3, 1, 1],
'split0_train_score' : [0.80, 0.92, 0.70],
'split1_train_score' : [0.82, 0.55, 0.70],
'mean_train_score'   : [0.81, 0.74, 0.70],
'std_train_score'    : [0.01, 0.19, 0.00],
'mean_fit_time'      : [0.73, 0.63, 0.43],
'std_fit_time'       : [0.01, 0.02, 0.01],
'mean_score_time'    : [0.01, 0.06, 0.04],
'std_score_time'     : [0.00, 0.00, 0.00],
'params'             : [{'kernel' : 'rbf', 'gamma' : 0.1}, ...],
}

NOTE

The key 'params' is used to store a list of parameter settings dicts for all the parameter candidates.

The mean_fit_time, std_fit_time, mean_score_time and std_score_time are all in seconds.

best_estimator_estimator or dict

Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

best_score_float

Mean cross-validated score of the best_estimator.

best_params_dict

Parameter setting that gave the best results on the hold out data.

best_index_int

The index (of the cv_results_ arrays) which corresponds to the best candidate parameter setting.

The dict at search.cv_results_['params'][search.best_index_] gives the parameter setting for the best model, that gives the highest mean score (search.best_score_).

scorer_function or a dict

Scorer function used on the held out data to choose the best parameters for the model.

n_splits_int

The number of cross-validation splits (folds/iterations).

refit_time_float

Seconds used for refitting the best model on the whole dataset.

This is present only if refit is not False.

dabl.search.GridSuccessiveHalving¶

`dabl.search`.GridSuccessiveHalving¶