dabl.detect_types

dabl.detect_types(X, type_hints=None, max_cat_cardinality='auto', dirty_float_threshold=0.9, near_constant_threshold=0.95, target_col=None, verbose=0)[source]

Detect types of dataframe columns.

Columns are labeled as one of the following types: ‘continuous’, ‘categorical’, ‘low_card_int’, ‘dirty_float’, ‘free_string’, ‘date’, ‘useless’

Pandas categorical variables, strings and integers of low cardinality and float values with two columns are labeled as categorical. Integers of high cardinality are labeled as continuous. Integers of intermediate cardinality are labeled as “low_card_int”. Float variables that sometimes take string values are labeled “dirty_float” String variables with many unique values are labeled “free_text” (and currently not processed by dabl). Date types are labeled as “date” (and currently not processed by dabl). Anything that is constant, nearly constant, detected as an integer index, or doesn’t match any of the above categories is labeled “useless”.

Parameters:
Xdataframe

input

max_cat_cardinality: int or ‘auto’, default=’auto’

Maximum number of distinct integer or string values for a column to be considered categorical. ‘auto’ is max(42, n_samples/100).

dirty_float_thresholdfloat, default=.9

The fraction of floats required in a dirty continuous column before it’s considered “useless” or categorical (after removing top 5 string values)

target_colstring, int or None

Specifies the target column in the data, if any. Target columns are never dropped.

verboseint

How verbose to be

Returns:
resdataframe, shape (n_columns, 7)

Boolean dataframe of detected types. Rows are columns in input X, columns are possible types (see above).