dabl.detect_types

dabl.detect_types(X, type_hints=None, max_int_cardinality='auto', dirty_float_threshold=0.9, near_constant_threshold=0.95, target_col=None, verbose=0)[source]

Detect types of dataframe columns.

Columns are labeled as one of the following types: ‘continuous’, ‘categorical’, ‘low_card_int’, ‘dirty_float’, ‘free_string’, ‘date’, ‘useless’

Pandas categorical variables, strings and integers of low cardinality and float values with two columns are labeled as categorical. Integers of high cardinality are labeled as continuous. Integers of intermediate cardinality are labeled as “low_card_int”. Float variables that sometimes take string values are labeled “dirty_float” String variables with many unique values are labeled “free_text” (and currently not processed by dabl). Date types are labeled as “date” (and currently not processed by dabl). Anything that is constant, nearly constant, detected as an integer index, or doesn’t match any of the above categories is labeled “useless”.

Parameters
Xdataframe

input

max_int_cardinality: int or ‘auto’, default=’auto’

Maximum number of distinct integers for an integer column to be considered categorical. ‘auto’ is max(42, n_samples/10). Integers are also always considered as continuous variables. FIXME not true any more?

dirty_float_thresholdfloat, default=.9

The fraction of floats required in a dirty continuous column before it’s considered “useless” or categorical (after removing top 5 string values)

target_colstring, int or None

Specifies the target column in the data, if any. Target columns are never dropped.

verboseint

How verbose to be

Returns
resdataframe, shape (n_columns, 7)

Boolean dataframe of detected types. Rows are columns in input X, columns are possible types (see above).