US Census Classification#

What are we going to do?#

In this quick demo, we will use JAI to:

  • Train and deploy models into a secure and scalable production-ready environment.

  • Complete a classification task - Predict whether household income exceeds $50K/yr based on census data


Importing libraries#

[1]:
from jai import Jai
import pandas as pd

JAI Auth Key#

If you don’t already have an auth key, you can get your auth key here - free forever. Also, please make sure to check your spam folder if you can’t find it in your inbox!

[ ]:
from jai import get_auth_key
get_auth_key(email = 'email@emailnet.com', firstName = 'JAI', lastName = 'Z')
<Response [201]>

Dataset quick look#

This dataset contains data collected by the 1994 U.S. Census. It contains personal information such as education, marital status, occupation and sex and in this example we will use that data to predict whether an individual is making more or less that 50k dollars per year.

[2]:
df = pd.read_csv('https://myceliademo.blob.core.windows.net/census-us/adult.csv?sv=2020-04-08&st=2021-05-17T18%3A19%3A59Z&se=2025-01-18T18%3A19%3A00Z&sr=b&sp=r&sig=sH%2B2Za%2FTuXsatqgmRX3eG%2FQfZTh1M2ptMUi8NTXBXF4%3D')
df = df.reset_index().rename(columns={'index':'id'})
[3]:
# Show name of columns and non-null count
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              32561 non-null  int64
 1   age             32561 non-null  int64
 2   workclass       32561 non-null  object
 3   fnlwgt          32561 non-null  int64
 4   education       32561 non-null  object
 5   education.num   32561 non-null  int64
 6   marital.status  32561 non-null  object
 7   occupation      32561 non-null  object
 8   relationship    32561 non-null  object
 9   race            32561 non-null  object
 10  sex             32561 non-null  object
 11  capital.gain    32561 non-null  int64
 12  capital.loss    32561 non-null  int64
 13  hours.per.week  32561 non-null  int64
 14  native.country  32561 non-null  object
 15  income          32561 non-null  object
dtypes: int64(7), object(9)
memory usage: 4.0+ MB
[4]:
# Show first 5 lines of dataframe
df.head()
[4]:
id age workclass fnlwgt education education.num marital.status occupation relationship race sex capital.gain capital.loss hours.per.week native.country income
0 0 90 ? 77053 HS-grad 9 Widowed ? Not-in-family White Female 0 4356 40 United-States <=50K
1 1 82 Private 132870 HS-grad 9 Widowed Exec-managerial Not-in-family White Female 0 4356 18 United-States <=50K
2 2 66 ? 186061 Some-college 10 Widowed ? Unmarried Black Female 0 4356 40 United-States <=50K
3 3 54 Private 140359 7th-8th 4 Divorced Machine-op-inspct Unmarried White Female 0 3900 40 United-States <=50K
4 4 41 Private 264663 Some-college 10 Separated Prof-specialty Own-child White Female 0 3900 40 United-States <=50K

Inserting data into Jai#

To be able to use Jai’s functionalities, we first need to turn the data into a Jai collection. The method used to send data to Jai is j.setup (or j.fit; they are the same), which can then be consumed through the methods j.similar and j.predict. By using the setup method you add your raw data to your JAI environment, use the data to train your model based on the chosen model type and your model’s latent vector representation is then stored in the Jai collection.

[5]:
# Instantiate Jai class
j = Jai()

ans = j.setup(
    # Name of Jai collection
    name = 'census',
    # verbose = 2 shows loss graph at the end of the setup
    verbose = 2,
    # data to be inserted into Jai - a pandas dataframe is expected
    data = df,

    db_type = 'Supervised', #Text, FastText, TextEdit, Image, Supervised, Selfsupervised

    label = {'task':'classification', 'label_name':'income'}, #classification, metric_classification, regression, quantile_regression

    split = {'type':'stratified', 'split_column':'income', 'test_size':0.1},

    overwrite = True
    )
Insert Data: 100%|██████████| 2/2 [00:01<00:00,  1.07it/s]
Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: Supervised
- label:
  * task             : classification
  * label_name       : income
  * regression_scaler: None
  * quantiles        : []
JAI is working: 100%|██████████|22/22 [00:36]
../../_images/source_examples_US_Census_9_3.png

Setup Report:
Metrics classification:
              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91      4976
        >50K       0.75      0.58      0.65      1536

    accuracy                           0.86      6512
   macro avg       0.82      0.76      0.78      6512
weighted avg       0.85      0.86      0.85      6512


Best model at epoch: 05 val_loss: 0.31

Model Inference#

We can use the trained model to make inferences on any specific index or in other added new data

[6]:
j.predict('census',
          data = df.tail(1).drop('income',axis = 1),
          # predict_proba = True shows the probability that the item belongs to each class
          predict_proba = True)
Predict: 100%|██████████| 1/1 [00:00<00:00,  4.16it/s]
[6]:
[{'id': 32560,
  'predict': {'<=50K': 0.9997827410697937, '>50K': 0.0002172834356315434}}]
[7]:
j.predict('census',
          data = df.head().drop('income',axis = 1),
          # predict_proba = True shows the probability that the item belongs to each class
          predict_proba = True)
Predict: 100%|██████████| 1/1 [00:00<00:00,  3.79it/s]
[7]:
[{'id': 0,
  'predict': {'<=50K': 0.9080128073692322, '>50K': 0.09198720753192902}},
 {'id': 1,
  'predict': {'<=50K': 0.7557201981544495, '>50K': 0.24427981674671173}},
 {'id': 2,
  'predict': {'<=50K': 0.7873145937919617, '>50K': 0.21268539130687714}},
 {'id': 3,
  'predict': {'<=50K': 0.9710868000984192, '>50K': 0.02891319803893566}},
 {'id': 4,
  'predict': {'<=50K': 0.725451648235321, '>50K': 0.27454838156700134}}]

Requests via Rest API#

[8]:
body = df.head().drop('income',axis=1).to_dict(orient='records')
body
[8]:
[{'id': 0,
  'age': 90,
  'workclass': '?',
  'fnlwgt': 77053,
  'education': 'HS-grad',
  'education.num': 9,
  'marital.status': 'Widowed',
  'occupation': '?',
  'relationship': 'Not-in-family',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 1,
  'age': 82,
  'workclass': 'Private',
  'fnlwgt': 132870,
  'education': 'HS-grad',
  'education.num': 9,
  'marital.status': 'Widowed',
  'occupation': 'Exec-managerial',
  'relationship': 'Not-in-family',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 18,
  'native.country': 'United-States'},
 {'id': 2,
  'age': 66,
  'workclass': '?',
  'fnlwgt': 186061,
  'education': 'Some-college',
  'education.num': 10,
  'marital.status': 'Widowed',
  'occupation': '?',
  'relationship': 'Unmarried',
  'race': 'Black',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 3,
  'age': 54,
  'workclass': 'Private',
  'fnlwgt': 140359,
  'education': '7th-8th',
  'education.num': 4,
  'marital.status': 'Divorced',
  'occupation': 'Machine-op-inspct',
  'relationship': 'Unmarried',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 3900,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 4,
  'age': 41,
  'workclass': 'Private',
  'fnlwgt': 264663,
  'education': 'Some-college',
  'education.num': 10,
  'marital.status': 'Separated',
  'occupation': 'Prof-specialty',
  'relationship': 'Own-child',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 3900,
  'hours.per.week': 40,
  'native.country': 'United-States'}]
[ ]:
import requests

header={'Auth': 'INSERT_YOUR_AUTH_KEY_HERE'}

url_predict = f"https://mycelia.azure-api.net/predict/census?predict_proba=True"

ans = requests.put(url_predict, json=body, headers=header)
ans.json()
[{'id': 0,
  'predict': {'<=50K': 0.9708954095840454, '>50K': 0.0291045643389225}},
 {'id': 1,
  'predict': {'<=50K': 0.9445657134056091, '>50K': 0.05543423444032669}},
 {'id': 2,
  'predict': {'<=50K': 0.7718135714530945, '>50K': 0.22818641364574432}},
 {'id': 3,
  'predict': {'<=50K': 0.9780237078666687, '>50K': 0.021976308897137642}},
 {'id': 4,
  'predict': {'<=50K': 0.6955583095550537, '>50K': 0.3044416606426239}}]