California Housing Regression#

What are we going to do?#

In this quick demo, we will use JAI to:

  • Train and deploy models into a secure and scalable production-ready environment.

  • Regression - Determine how each analyzed attribute influences the value of the house.

  • Model inference - Predict the value of new houses and check the results.


Importing Libraries#

[1]:
import pandas as pd
from jai import Jai

JAI Auth Key#

If you don’t already have an auth key, you can get your auth key here - free forever. Also, please make sure to check your spam folder if you can’t find it in your inbox!

[ ]:
from jai import get_auth_key
get_auth_key(email = 'email@emailnet.com', firstName = 'JAI', lastName = 'Z')
<Response [201]>

Dataset quick look#

This dataset contains data collected by the 1990 U.S. Census concerning housing in the state of California. It contains data such as average number of rooms per dwelling per block group, median income in block group and block group population, a block group being the smallest geographical unit per which the U.S. Census publishes data (typically an area ranging from 600-3000 in population). As the target, the dataset contains the median value of a house per block group, expressed in hundreds of thousands of dollars.

[2]:
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
california_df = pd.DataFrame(data = california_housing.data, columns = california_housing.feature_names)
# target is true median value of house per block group
california_df['price'] = california_housing.target
[3]:
california_df.info()
# More information about what each column represents can be found on https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
 8   price       20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB
[4]:
# Showing first five rows of DataFrame
california_df.head()
[4]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude price
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Inserting data into Jai#

To be able to use Jai’s functionalities, we first need to turn the data into a Jai collection. The method used to send data to Jai is j.setup (or j.fit; they are the same), which can then be consumed through the methods j.similar and j.predict. By using the setup method you add your raw data to your JAI environment, use the data to train your model based on the chosen model type and your model’s latent vector representation is then stored in the Jai collection.

[5]:
# Instantiate Jai class
j = Jai()

j.setup(
    #JAI collection name
    name = 'california_housing',

    #data to be processed - a Pandas DataFrame is expected
    data = california_df,

    #collection type
    db_type = 'Supervised',

    #verbose 2 -> shows the loss graph at the end of training
    verbose = 2,

    label =
    {
        'task':'regression',
        'label_name':'price'
    },

    # Overwrite allows you to overwrite other existing databases with the same name
    overwrite = True
)
Insert Data: 100%|██████████| 2/2 [00:00<00:00,  2.41it/s]
Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: Supervised
- label:
  * task             : regression
  * label_name       : price
  * regression_scaler: None
  * quantiles        : []
JAI is working: 100%|██████████|22/22 [00:36]
../../_images/source_examples_California_Regression_9_3.png

Setup Report:
Metrics Regression:
MAE: 0.44369176030158997
MSE: 0.42195257544517517
MAPE: 0.23954549431800842
R2 Score: 0.6779993327463893
Pinball Loss 0.5: 0.22184588015079498

Best model at epoch: 25 val_loss: 0.41
[5]:
({1: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False},
  0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Training Model',
  'Status': 'Job Created',
  'Description': 'Check status after some time!',
  'kwargs': {'db_type': '"Supervised"',
   'label': '{"task": "regression", "label_name": "price", "regression_scaler": "None", "quantiles": []}'}})

Model Inference#

We can use the trained model for inference

[8]:
# every JAI Supervised collection can be used for inference using j.predict()
ans = j.predict(
    # collection to be queried
    name = 'california_housing',
    # Remember to drop the label as it is not a feature!
    # Predicting the value of the first 5 houses in the dataset
    data = california_df.head(1).drop('price',axis = 1),
    as_frame = True
)
Predict: 100%|██████████| 1/1 [00:00<00:00,  3.75it/s]
Predict Processing: 100%|██████████| 1/1 [00:00<00:00, 18893.26it/s]
[9]:
ans['Target'] = california_housing.target
ans
[9]:
predict Target
id
0 4.017475 4.526