Skip to content

Files

Latest commit

 

History

History
347 lines (249 loc) · 12.4 KB

README_addmodel.md

File metadata and controls

347 lines (249 loc) · 12.4 KB

Contributing Guidelines

Pull Requests, Fixes, New Models

Read following instructions before adding a new model.

Get started quickly

#### Easy path finding
from mlmodels.util import path_norm 
  path_withPrefix = path_norm("dataset/timseries/myfile.csv")   ##   site-package/mlmodels/dataset/timseries/myfile.csv



### Run some model on Command Line for debugging
cd mlmodels
python optim.py

python model_tch/textcnn.py

python model_keras/textcnn.py



List of TODO / ISSUES List

https://github.com/arita37/mlmodels/issues?q=is%3Aissue+is%3Aopen+sort%3Aupdated-desc

Index of Functions/Methods

Index

Using Online Editor (pre-installed mlmodels)

Gitpod

Colab

Code Style:

  • You can use to 120 characters per line : Better code readability
  • Do Not FOLLOW strict PEP8, make your code EASY TO READ : Align "=" together, ....
  • Do NOT reformat existing files.

Read The Examples


1) Fork

Fork from arita37/mlmodels. Please use same branch for your developpements: dev branch

2) Configure for Tests (No Tests Success, No PR Accepted)

Change in these files where needed with your MODEL_NAME and BRANCH NAME :

3) Create Python Script For New Model

Create mlmodels/model_XXXX/yyyyy.py. Check template.
See examples: model_keras/textcnn.py, transformer_sentence.py

Please re-use existing functions in util.py

 from mlmodels.util import os_package_root_path, log, 
                        path_norm, get_model_uri, path_norm_dict

 ### Use path_norm to normalize your path.
 data_path = path_norm("dataset/text/myfile.txt")
    --> FULL_ PATH   /home/ubuntu/mlmodels/dataset/text/myfile.txt


 ### Use path_norm to normalize your path.
 data_path = path_norm("ztest/text/myfile.txt")
    --> FULL_ PATH   /home/ubuntu/mlmodels/ztest/text/myfile.txt


 data_path = path_norm("ztest/text/myfile.txt")
    --> FULL_ PATH   /home/ubuntu/mlmodels/ztest/text/myfile.txt

4) Create JSON For Parameters

Create mlmodels/model_XXXX/yyyy.json file following this template.

5) Keep Your Branch Updated

Sync your branch with arita37/mlmodels:dev to reduce conflicts at final steps.

Pull Request : arita37/dev --> your Branch

Run Model

Run/Test newly added model on your local machine or on Gitpod or COLAB

source activate py36
cd mlmodels
python model_XXXX/yyyy.py  

Check Your Test Runs

https://github.com/arita37/mlmodels/actions?query=workflow%3Atest_custom_model

Issue A Pull Request

Once you have made the changes issue a PR.


Manual Installation

### On Linux/MacOS
pip install numpy<=1.17.0
pip install -e .  -r requirements.txt
pip install   -r requirements_fake.txt


### On Windows
Use WSL + Linux Installed.
pip install numpy<=1.17.0
pip install torch==1..1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -e .  -r requirements_wi.txt
pip install   -r requirements_fake.txt

Source Code Structure As Below

  • docs: documentation
  • mlmodels: interface wrapper for pytorch, keras, gluon, tf, transformer NLP for train, hyper-params searchi.
    • model_xxx: folders for each platform with same interface defined in template folder
    • dataset: store dataset files for test runs.
    • template: template interface wrapper which define common interfaces for whole platforms
    • ztest: testing output for each sample testing in model_xxx
  • ztest: testing output for each sample testing in model_xxx

How to define a custom model

1. Create a file mlmodels\model_XXXX\mymodel.py , XXX: tch: pytorch, tf:tensorflow, keras:keras, ....

  • Declare below classes/functions in the created file:

    Class Model()                                                  :   Model definition
          __init__(model_pars, data_pars, compute_pars)            :   
                                
    def fit(model, data_pars, model_pars, compute_pars, out_pars ) : Train the model
    def fit_metric(model, data_pars, compute_pars, out_pars )         : Measure the results
    def predict(model, sess, data_pars, compute_pars, out_pars )   : Predict the results
    
    
    def get_params(choice, data_path, config_mode)                                               : returnparameters of the model
    def get_dataset(data_pars)                                     : load dataset
    def test()                                                     : example running the model     
    def test_api()                                                 : example running the model in global settings  
    
    def save(model, session, save_pars)                            : save the model
    def load(load_pars)                                            : load the trained model
    
  • Infos

    model :         Model(model_pars), instance of Model() object
    sess  :         Session for TF model  or optimizer in PyTorch
    model_pars :    dict containing info on model definition.
    data_pars :     dict containing info on input data.
    compute_pars :  dict containing info on model compute.
    out_pars :      dict containing info on output folder.
    save_pars/load_pars : dict for saving or loading a model
    

2. Write your code and create test() to test your code.

  • Declare model definition in Class Model()
    self.model = DeepFM(linear_cols, dnn_cols, task=compute_pars['task']) # mlmodels/model_kera/01_deectr.py
    # Model Parameters such as `linear_cols, dnn_cols` is obtained from function `get_params` which return `model_pars, data_pars, compute_pars, out_pars`
  • Implement pre-process data in function get_dataset which return data for both training and testing dataset Depend on type of dataset, we could separate function with datatype as below example
    if data_type == "criteo":
        df, linear_cols, dnn_cols, train, test, target = _preprocess_criteo(df, **kw)

    elif data_type == "movie_len":
        df, linear_cols, dnn_cols, train, test, target = _preprocess_movielens(df, **kw)
  • Call fit/predict with initialized model and dataset
    # get dataset using function get_dataset
    data, linear_cols, dnn_cols, train, test, target = get_dataset(**data_pars)
    # fit data
     model.model.fit(train_model_input, train[target].values,
                        batch_size=m['batch_size'], epochs=m['epochs'], verbose=2,
                        validation_split=m['validation_split'], )
    # predict data
    pred_ans = model.model.predict(test_model_input, batch_size= compute_pars['batch_size'])
  • Calculate metric with predict output
    # input of metrics is predicted output and ground truth data
    def metrics(ypred, ytrue, data_pars, compute_pars=None, out_pars=None, **kwargs):

3. Create JSON config file

Create a JSON file inside /model_XXX/mymodel.json

  • Separate configure for staging development environment such as testing and production phase then for each staging, declare some specific parameters for model, dataset and also output
  • Examples
    {
        "test": {

              "hypermodel_pars":   {
             "learning_rate": {"type": "log_uniform", "init": 0.01,  "range" : [0.001, 0.1] },
             "num_layers":    {"type": "int", "init": 2,  "range" :[2, 4] },
             "size":    {"type": "int", "init": 6,  "range" :[6, 6] },
             "output_size":    {"type": "int", "init": 6,  "range" : [6, 6] },

             "size_layer":    {"type" : "categorical", "value": [128, 256 ] },
             "timestep":      {"type" : "categorical", "value": [5] },
             "epoch":         {"type" : "categorical", "value": [2] }
           },

            "model_pars": {
                "learning_rate": 0.001,     
                "num_layers": 1,
                "size": 6,
                "size_layer": 128,
                "output_size": 6,
                "timestep": 4,
                "epoch": 2
            },

            "data_pars" :{
              "path"            : 
              "location_type"   :  "local/absolute/web",
              "data_type"   :   "text" / "recommender"  / "timeseries" /"image",
              "data_loader" :  "pandas",
              "data_preprocessor" : "mlmodels.model_keras.prepocess:process",
              "size" : [0,1,2],
              "output_size": [0, 6]              
            },


            "compute_pars": {
                "distributed": "mpi",
                "epoch": 10
            },
            "out_pars": {
                "out_path": "dataset/",
                "data_type": "pandas",
                "size": [0, 0, 6],
                "output_size": [0, 6]
            }
        },
    
        "prod": {
            "model_pars": {},
            "data_pars": {}
        }
    }

#######################################################################################

③ Command Line Input tools: package provide below tools

https://github.com/arita37/mlmodels/blob/dev/docs/README_docs/README_usage_CLI.md

#######################################################################################

④ Interface

models.py

   module_load(model_uri)
   model_create(module)
   fit(model, module, session, data_pars, out_pars   )
   metrics(model, module, session, data_pars, out_pars)
   predict(model, module, session, data_pars, out_pars)
   save(model, path)
   load(model)

optim.py

   optim(modelname="model_tf.1_lstm.py",  model_pars= {}, data_pars = {}, compute_pars={"method": "normal/prune"}
       , save_folder="/mymodel/", log_folder="", ntrials=2) 

   optim_optuna(modelname="model_tf.1_lstm.py", model_pars= {}, data_pars = {}, compute_pars={"method" : "normal/prune"},
                save_folder="/mymodel/", log_folder="", ntrials=2) 

Generic parameters

   Define in models_config.json
   model_params      :  Relative to model definition 
   compute_pars      :  Relative to  the compute process
   data_pars         :  Relative to the input data
   out_pars          :  Relative to outout data

Sometimes, data_pars is required to setup the model (ie CNN with image size...)

#######################################################################################

⑥ Naming convention

Function naming

pd_   :  input is pandas dataframe
np_   :  input is numpy
sk_   :  inout is related to sklearn (ie sklearn model), input is numpy array
plot_


col_ :  function name for column list related.