.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot/plot_categorical_types_adult.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_plot_categorical_types_adult.py: Comparing categorical variable visualizations ============================================= This example showcases the four types of visualization supported for categorical variables for classification, which are 'count', 'proportion', 'mosaic' and 'sankey'. .. GENERATED FROM PYTHON SOURCE LINES 8-14 .. code-block:: Python from dabl.plot import plot_classification_categorical from dabl.datasets import load_adult data = load_adult() .. GENERATED FROM PYTHON SOURCE LINES 15-20 The 'count' plot is easiest to understand and closest to the data, as it simply provides a bar-plot of class counts per category. However, it makes it hard to make comparisons between different categories. For example, for workclass, it is hard to see the differences in proportions among the categories. .. GENERATED FROM PYTHON SOURCE LINES 21-23 .. code-block:: Python plot_classification_categorical(data, target_col='income', kind="count") .. image-sg:: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_001.png :alt: Categorical Features vs Target, relationship, marital-status, education, occupation, hours-per-week, gender, workclass, native-country, race :srcset: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) /home/circleci/project/~/miniconda/envs/testenv/lib/python3.11/site-packages/seaborn/categorical.py:641: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. grouped_vals = vals.groupby(grouper) array([[, , ], [, , ], [, , ]], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 24-29 The 'proportion' plot on the other hand *only* shows the proportion, so we can see that the proportions in state-government, government, and self-employed are nearly the same. However, 'proportion' does not show how many samples are in each category. How much each category is actually present in the data can be very important, though. .. GENERATED FROM PYTHON SOURCE LINES 30-32 .. code-block:: Python plot_classification_categorical(data, target_col='income', kind="proportion") .. image-sg:: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_002.png :alt: Categorical Features vs Target, relationship, marital-status, education, education-num, occupation, hours-per-week, gender, workclass, native-country, race :srcset: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] /home/circleci/project/dabl/plot/supervised.py:539: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = (X_new.groupby(col)[target_col] array([[, , , , ], [, , , , ]], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 33-37 The 'mosaic' plot shows both the class proportions within each category (on the x axis) as well as the proportion of the category in the data (on the y axis). The 'mosaic' plot can be a bit busy; in particular if there are many classes and many catgories, it becomes harder to interpret. .. GENERATED FROM PYTHON SOURCE LINES 38-39 .. code-block:: Python plot_classification_categorical(data, target_col='income', kind="mosaic") .. image-sg:: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_003.png :alt: Categorical Features vs Target, relationship, marital-status, education, education-num, occupation, hours-per-week, gender, workclass, native-country, race :srcset: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) array([[, , , , ], [, , , , ]], dtype=object) .. GENERATED FROM PYTHON SOURCE LINES 40-44 The 'sankey' plot is even busier, as it combines the features of the 'count' plot with an alluvial flow diagram of interactions. By default, only the 5 most common features are included in the sankey diagram, which can be adjusted by calling the plot_sankey function directly. .. GENERATED FROM PYTHON SOURCE LINES 45-47 .. code-block:: Python plot_classification_categorical(data, target_col='income', kind="sankey") .. image-sg:: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_004.png :alt: plot categorical types adult :srcset: /auto_examples/plot/images/sphx_glr_plot_categorical_types_adult_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/preprocessing.py:172: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format. pd.to_datetime(series[:10]) /home/circleci/project/dabl/plot/sankey.py:264: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. sizes = data.groupby(data.columns.tolist()).size() /home/circleci/project/dabl/plot/sankey.py:144: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. weights = source.groupby(col)[weight_col].sum() /home/circleci/project/dabl/plot/sankey.py:179: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.0007115026257834999' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. source.loc[i, coord_col_name] = coord[1] /home/circleci/project/dabl/plot/sankey.py:179: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.14092156530577674' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. source.loc[i, coord_col_name] = coord[1] /home/circleci/project/dabl/plot/sankey.py:179: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.0005420972386921904' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. source.loc[i, coord_col_name] = coord[1] /home/circleci/project/dabl/plot/sankey.py:179: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.05476198543113671' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. source.loc[i, coord_col_name] = coord[1] /home/circleci/project/dabl/plot/sankey.py:179: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '0.0005420972386921904' has dtype incompatible with int64, please explicitly cast to a compatible dtype first. source.loc[i, coord_col_name] = coord[1] .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 9.226 seconds) .. _sphx_glr_download_auto_examples_plot_plot_categorical_types_adult.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_categorical_types_adult.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_categorical_types_adult.py ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_