###############################################
Quickstart to ML with dabl
###############################################

Let's dive right in!

Let's start with the classic. You have the titanic.csv file and want to predict
whether a passenger survived or not based on the information about the
passenger in that file.
We know, for tabular data like this, pandas is our friend.
Clearly we need to start with loading our data:

    >>> import pandas as pd
    >>> import dabl
    >>> titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))

Let's familiarize ourself with the data a bit; what's the shape, what are the
columns, what do they look like?

    >>> titanic.shape
    (1309, 14)

    >>> titanic.head() # doctest: +ELLIPSIS
       pclass  survived  ... body                        home.dest
    0       1         1  ...    ?                     St Louis, MO
    1       1         1  ...    ?  Montreal, PQ / Chesterville, ON
    2       1         0  ...    ?  Montreal, PQ / Chesterville, ON
    3       1         0  ...  135  Montreal, PQ / Chesterville, ON
    4       1         0  ...    ?  Montreal, PQ / Chesterville, ON
    <BLANKLINE>
    [5 rows x 14 columns]


So far so good! There's already a bunch of things going on in the data that we
can see here, but let's ask dabl what it thinks by cleaning up the data:

    >>> titanic_clean = dabl.clean(titanic, verbose=0)

This provides us with lots of information about what is happening in the
different columns. In this case, we might have been able to figure this out
quickly from the call to head,
but in larger datasets this might be a bit tricky.
For example we can see that there are several dirty columns with "?" in it.
This is probably a marker for a missing value and we could go back and fix our
parsing of the CSV, but let's try and continue with what dabl is doing
automatically for now.  In dabl, we can also get a best guess of the column
types in a convenient format:

    >>> types = dabl.detect_types(titanic_clean)
    >>> print(types) # doctest: +ELLIPSIS
                          continuous  dirty_float  ...  free_string  useless
    pclass                     False        False  ...        False    False
    survived                   False        False  ...        False    False
    name                       False        False  ...         True    False
    sex                        False        False  ...        False    False
    sibsp                      False        False  ...        False    False
    parch                      False        False  ...        False    False
    ticket                     False        False  ...         True    False
    cabin                      False        False  ...         True    False
    embarked                   False        False  ...        False    False
    boat                       False        False  ...        False    False
    home.dest                  False        False  ...         True    False
    age_?                      False        False  ...        False    False
    age_dabl_continuous         True        False  ...        False    False
    fare_?                     False        False  ...        False     True
    fare_dabl_continuous        True        False  ...        False    False
    body_?                     False        False  ...        False    False
    body_dabl_continuous        True        False  ...        False    False
    <BLANKLINE>
    [17 rows x 7 columns]


Having a very rough idea of the shape of our data, we can now start looking
at the actual content. The easiest way to do that is using visualization of
univariate and bivariate patterns. With plot,
we can create plot of the features deemed most important for our task.

    >>> dabl.plot(titanic, 'survived')
    Target looks like classification
    Linear Discriminant Analysis training set score: 0.578

.. plot::

    >>> import pandas as pd
    >>> import dabl
    >>> titanic = pd.read_csv(dabl.datasets.data_path("titanic.csv"))
    >>> dabl.plot(titanic, 'survived')
    Target looks like classification
    Linear Discriminant Analysis training set score: 0.578
    >>> import matplotlib.pyplot as plt; plt.show()


Finally, we can find an initial model for our data. The SimpleClassifier does all
the work for us. It implements the familiar scikit-learn API of fit and
predict. Alternatively we could also use the same interface as before and pass
the whole data frame and specify the target column.

    >>> fc = dabl.SimpleClassifier(random_state=0)
    >>> X = titanic_clean.drop("survived", axis=1)
    >>> y = titanic_clean.survived
    >>> fc.fit(X, y) # doctest: +SKIP
    DummyClassifier(strategy='prior')
    accuracy: 0.618    average_precision: 0.382    recall_macro: 0.500    roc_auc: 0.500
    new best (using recall_macro):
    accuracy             0.618
    average_precision    0.382
    recall_macro         0.500
    roc_auc              0.500
    Name: DummyClassifier(strategy='prior'), dtype: float64
    GaussianNB()
    accuracy: 0.897    average_precision: 0.870    recall_macro: 0.902    roc_auc: 0.919
    new best (using recall_macro):
    accuracy             0.897
    average_precision    0.870
    recall_macro         0.902
    roc_auc              0.919
    Name: GaussianNB(), dtype: float64
    MultinomialNB()
    accuracy: 0.888    average_precision: 0.981    recall_macro: 0.891    roc_auc: 0.985
    DecisionTreeClassifier(class_weight='balanced', max_depth=1)
    accuracy: 0.976    average_precision: 0.954    recall_macro: 0.971    roc_auc: 0.971
    new best (using recall_macro):
    accuracy             0.976
    average_precision    0.954
    recall_macro         0.971
    roc_auc              0.971
    Name: DecisionTreeClassifier(class_weight='balanced', max_depth=1), dtype: float64
    DecisionTreeClassifier(class_weight='balanced', max_depth=5)
    accuracy: 0.957    average_precision: 0.943    recall_macro: 0.953    roc_auc: 0.970
    DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01)
    accuracy: 0.976    average_precision: 0.954    recall_macro: 0.971    roc_auc: 0.971
    LogisticRegression(C=0.1, class_weight='balanced')
    accuracy: 0.963    average_precision: 0.986    recall_macro: 0.961    roc_auc: 0.989
    Best model:
    DecisionTreeClassifier(class_weight='balanced', max_depth=1)
    Best Scores:
    accuracy             0.976
    average_precision    0.954
    recall_macro         0.971
    roc_auc              0.971
    Name: DecisionTreeClassifier(class_weight='balanced', max_depth=1), dtype: float64
    SimpleClassifier(random_state=0, refit=True, verbose=1)