US Census Classification#

What are we going to do?#

In this quick demo, we will use JAI to:

Train and deploy models into a secure and scalable production-ready environment.
Complete a classification task - Predict whether household income exceeds $50K/yr based on census data

Importing libraries#

[1]:

from jai import Jai
import pandas as pd

JAI Auth Key#

If you don’t already have an auth key, you can get your auth key here - free forever. Also, please make sure to check your spam folder if you can’t find it in your inbox!

[ ]:

from jai import get_auth_key
get_auth_key(email = 'email@emailnet.com', firstName = 'JAI', lastName = 'Z')

<Response [201]>

Dataset quick look#

This dataset contains data collected by the 1994 U.S. Census. It contains personal information such as education, marital status, occupation and sex and in this example we will use that data to predict whether an individual is making more or less that 50k dollars per year.

[2]:

df = pd.read_csv('https://myceliademo.blob.core.windows.net/census-us/adult.csv?sv=2020-04-08&st=2021-05-17T18%3A19%3A59Z&se=2025-01-18T18%3A19%3A00Z&sr=b&sp=r&sig=sH%2B2Za%2FTuXsatqgmRX3eG%2FQfZTh1M2ptMUi8NTXBXF4%3D')
df = df.reset_index().rename(columns={'index':'id'})

[3]:

# Show name of columns and non-null count
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              32561 non-null  int64
 1   age             32561 non-null  int64
 2   workclass       32561 non-null  object
 3   fnlwgt          32561 non-null  int64
 4   education       32561 non-null  object
 5   education.num   32561 non-null  int64
 6   marital.status  32561 non-null  object
 7   occupation      32561 non-null  object
 8   relationship    32561 non-null  object
 9   race            32561 non-null  object
 10  sex             32561 non-null  object
 11  capital.gain    32561 non-null  int64
 12  capital.loss    32561 non-null  int64
 13  hours.per.week  32561 non-null  int64
 14  native.country  32561 non-null  object
 15  income          32561 non-null  object
dtypes: int64(7), object(9)
memory usage: 4.0+ MB

[4]:

# Show first 5 lines of dataframe
df.head()

[4]:

	id	age	workclass	fnlwgt	education	education.num	marital.status	occupation	relationship	race	sex	capital.loss	hours.per.week	native.country	income
0	0	90	?	77053	HS-grad	9	Widowed	?	Not-in-family	White	Female	4356	40	United-States	<=50K
1	1	82	Private	132870	HS-grad	9	Widowed	Exec-managerial	Not-in-family	White	Female	4356	18	United-States	<=50K
2	2	66	?	186061	Some-college	10	Widowed	?	Unmarried	Black	Female	4356	40	United-States	<=50K
3	3	54	Private	140359	7th-8th	4	Divorced	Machine-op-inspct	Unmarried	White	Female	3900	40	United-States	<=50K
4	4	41	Private	264663	Some-college	10	Separated	Prof-specialty	Own-child	White	Female	3900	40	United-States	<=50K

Inserting data into Jai#

To be able to use Jai’s functionalities, we first need to turn the data into a Jai collection. The method used to send data to Jai is j.setup (or j.fit; they are the same), which can then be consumed through the methods j.similar and j.predict. By using the setup method you add your raw data to your JAI environment, use the data to train your model based on the chosen model type and your model’s latent vector representation is then stored in the Jai collection.

[5]:

# Instantiate Jai class
j = Jai()

ans = j.setup(
    # Name of Jai collection
    name = 'census',
    # verbose = 2 shows loss graph at the end of the setup
    verbose = 2,
    # data to be inserted into Jai - a pandas dataframe is expected
    data = df,

    db_type = 'Supervised', #Text, FastText, TextEdit, Image, Supervised, Selfsupervised

    label = {'task':'classification', 'label_name':'income'}, #classification, metric_classification, regression, quantile_regression

    split = {'type':'stratified', 'split_column':'income', 'test_size':0.1},

    overwrite = True
    )

Insert Data: 100%|██████████| 2/2 [00:01<00:00,  1.07it/s]

Training might finish early due to early stopping criteria.

Recognized setup args:
- db_type: Supervised
- label:
  * task             : classification
  * label_name       : income
  * regression_scaler: None
  * quantiles        : []

JAI is working: 100%|██████████|22/22 [00:36]

../../_images/source_examples_US_Census_9_3.png


Setup Report:
Metrics classification:
              precision    recall  f1-score   support

       <=50K       0.88      0.94      0.91      4976
        >50K       0.75      0.58      0.65      1536

    accuracy                           0.86      6512
   macro avg       0.82      0.76      0.78      6512
weighted avg       0.85      0.86      0.85      6512


Best model at epoch: 05 val_loss: 0.31

Model Inference#

We can use the trained model to make inferences on any specific index or in other added new data

[6]:

j.predict('census',
          data = df.tail(1).drop('income',axis = 1),
          # predict_proba = True shows the probability that the item belongs to each class
          predict_proba = True)

Predict: 100%|██████████| 1/1 [00:00<00:00,  4.16it/s]

[6]:

[{'id': 32560,
  'predict': {'<=50K': 0.9997827410697937, '>50K': 0.0002172834356315434}}]

[7]:

j.predict('census',
          data = df.head().drop('income',axis = 1),
          # predict_proba = True shows the probability that the item belongs to each class
          predict_proba = True)

Predict: 100%|██████████| 1/1 [00:00<00:00,  3.79it/s]

[7]:

[{'id': 0,
  'predict': {'<=50K': 0.9080128073692322, '>50K': 0.09198720753192902}},
 {'id': 1,
  'predict': {'<=50K': 0.7557201981544495, '>50K': 0.24427981674671173}},
 {'id': 2,
  'predict': {'<=50K': 0.7873145937919617, '>50K': 0.21268539130687714}},
 {'id': 3,
  'predict': {'<=50K': 0.9710868000984192, '>50K': 0.02891319803893566}},
 {'id': 4,
  'predict': {'<=50K': 0.725451648235321, '>50K': 0.27454838156700134}}]

Requests via Rest API#

[8]:

body = df.head().drop('income',axis=1).to_dict(orient='records')
body

[8]:

[{'id': 0,
  'age': 90,
  'workclass': '?',
  'fnlwgt': 77053,
  'education': 'HS-grad',
  'education.num': 9,
  'marital.status': 'Widowed',
  'occupation': '?',
  'relationship': 'Not-in-family',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 1,
  'age': 82,
  'workclass': 'Private',
  'fnlwgt': 132870,
  'education': 'HS-grad',
  'education.num': 9,
  'marital.status': 'Widowed',
  'occupation': 'Exec-managerial',
  'relationship': 'Not-in-family',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 18,
  'native.country': 'United-States'},
 {'id': 2,
  'age': 66,
  'workclass': '?',
  'fnlwgt': 186061,
  'education': 'Some-college',
  'education.num': 10,
  'marital.status': 'Widowed',
  'occupation': '?',
  'relationship': 'Unmarried',
  'race': 'Black',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 4356,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 3,
  'age': 54,
  'workclass': 'Private',
  'fnlwgt': 140359,
  'education': '7th-8th',
  'education.num': 4,
  'marital.status': 'Divorced',
  'occupation': 'Machine-op-inspct',
  'relationship': 'Unmarried',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 3900,
  'hours.per.week': 40,
  'native.country': 'United-States'},
 {'id': 4,
  'age': 41,
  'workclass': 'Private',
  'fnlwgt': 264663,
  'education': 'Some-college',
  'education.num': 10,
  'marital.status': 'Separated',
  'occupation': 'Prof-specialty',
  'relationship': 'Own-child',
  'race': 'White',
  'sex': 'Female',
  'capital.gain': 0,
  'capital.loss': 3900,
  'hours.per.week': 40,
  'native.country': 'United-States'}]

[ ]:

import requests

header={'Auth': 'INSERT_YOUR_AUTH_KEY_HERE'}

url_predict = f"https://mycelia.azure-api.net/predict/census?predict_proba=True"

ans = requests.put(url_predict, json=body, headers=header)
ans.json()

[{'id': 0,
  'predict': {'<=50K': 0.9708954095840454, '>50K': 0.0291045643389225}},
 {'id': 1,
  'predict': {'<=50K': 0.9445657134056091, '>50K': 0.05543423444032669}},
 {'id': 2,
  'predict': {'<=50K': 0.7718135714530945, '>50K': 0.22818641364574432}},
 {'id': 3,
  'predict': {'<=50K': 0.9780237078666687, '>50K': 0.021976308897137642}},
 {'id': 4,
  'predict': {'<=50K': 0.6955583095550537, '>50K': 0.3044416606426239}}]

US Census Classification

Contents