Linear Models#

Basics#

In this section, we’ll show how to use Jai to train a linear model.

[1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

california_housing = fetch_california_housing(as_frame=True)
df = california_housing['frame']

# target is true median value of house per block group
target = california_housing.target
[2]:
X_train, X_test, y_train, y_test = train_test_split(df, target)

Linear model#

Here it is how to train a Linear Model using the LinearModel Module.

We use scikit-learn’s models in the back end. You can use most of the parameters as describe in the documentation.

Tasks: - regression : LinearRegression - sgd_regression : SGDRegressor - classification : LogisticRegression - sgd_classification: SGDClassifier

[3]:
from jai import LinearModel
model = LinearModel("california_linear",  "regression")
report = model.fit(X_train, y_train, overwrite=True)
[4]:
# After training, the model is ready to be consumed
model.predict(X_test)
[4]:
predict
id
0 1.308
1 0.885
2 2.193
3 3.427
4 3.042
... ...
5155 3.442
5156 0.665
5157 1.487
5158 2.750
5159 2.329

5160 rows × 1 columns

[5]:
# You can improve the model using one new sample
model.learn(X_test.iloc[[0]], y_test.iloc[[0]])
[5]:
{'before': {'MAE': 2.1094237467877974e-14,
  'MSE': 4.44966854351227e-28,
  'MAPE': 1.6127092865350133e-14},
 'after': {'MAE': 2.1094237467877974e-14,
  'MSE': 4.44966854351227e-28,
  'MAPE': 1.6127092865350133e-14},
 'change': True}
[6]:
# Or you can improve the model using multiple new samples
model.learn(X_test.iloc[1:4], y_test.iloc[1:4])
[6]:
{'before': {'MAE': 1.021405182655144e-14,
  'MSE': 1.3995707226796117e-28,
  'MAPE': 8.507385228034694e-15,
  'R2_Score': 1.0},
 'after': {'MAE': 1.021405182655144e-14,
  'MSE': 1.3995707226796117e-28,
  'MAPE': 8.507385228034694e-15,
  'R2_Score': 1.0},
 'change': True}
[7]:
model.predict(X_test)
[7]:
predict
id
0 1.308
1 0.885
2 2.193
3 3.427
4 3.042
... ...
5155 3.442
5156 0.665
5157 1.487
5158 2.750
5159 2.329

5160 rows × 1 columns

Pretrained Bases application#

On Jai, you can use previously trained collections and reuse them for new tasks.

So if we had trained a model before, now we can build a linear model using the vectors from that database.

[8]:
from jai import Trainer

# Let's create a collection using only the features of the dataset.
trainer = Trainer("pretrained_california")

trainer.set_parameters(db_type="SelfSupervised")

Recognized fit arguments:
- db_type: SelfSupervised
[9]:
query = trainer.fit(X_train, overwrite=True)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  1.73it/s]

Recognized fit arguments:
- db_type: SelfSupervised
JAI is working: 100%|██████████|20/20 [02:22]

Setup Report:

Best model at epoch: 63 val_loss: 0.71

Now we’ll build a linear model using that collection.

We’ll first make a mapping to train the new Linear Model.

In this case, we’ll only use one feature, id_california, which is the id value from the previous collection. Each id will correspond to its vector stored in Jai, and those vectors are used to train the linear model.

[10]:
import pandas as pd
df_train = pd.DataFrame(X_train.index, index=X_train.index, columns=["id_california"])
[11]:
model = LinearModel("california_pretrained",  "regression")
report = model.fit(df_train, y_train, pretrained_bases=[{"id_name":"id_california", "db_parent": "pretrained_california"}], overwrite=True)
[12]:
# You can add the new data to the previous collection
trainer.append(X_test)
Insert Data: 100%|██████████| 1/1 [00:00<00:00,  3.75it/s]
JAI is working: 100%|██████████|20/20 [00:02]
[12]:
({0: {'Task': 'Adding new data for tabular setup',
   'Status': 'Completed',
   'Description': 'Insertion completed.',
   'Interrupted': False}},
 {'Task': 'Adding new data to database',
  'Status': 'Running',
  'Description': 'Task started to run now',
  'Interrupted': False})
[13]:
# And map the ids to make the prediction
df_test = pd.DataFrame(X_test.index, index=X_test.index, columns=["id_california"])
model.predict(df_test)
[13]:
predict
id
0 1.386100
1 0.785691
2 2.185110
3 3.764664
4 3.319453
... ...
5155 4.088492
5156 0.703170
5157 1.311731
5158 2.087266
5159 2.143795

5160 rows × 1 columns

[14]:
# Or, if you don't want to modify the previous collection, you can consume the model using the original raw data
model.predict(X_test)
[14]:
predict
id
0 1.386100
1 0.785691
2 2.185110
3 3.764664
4 3.319453
... ...
5155 4.088492
5156 0.703170
5157 1.311731
5158 2.087266
5159 2.143795

5160 rows × 1 columns