This is an ensemble training benchmark consisting of four functions:
- The Driver orchestrates the entire flow. It starts by uploading the dataset for the trainers and the metatrainer, collects the final models.
- a set of Trainers that fit a model each (tested with 4 and 16 trainers, sequentially and in parallel)
- The Reducer collects the models and predictions from each trainer.
- The Metatrainer trains together with the trained models' layer, finalizing the 2-layer model.
The driver is the interface function and is invoked with a helloworld grpc call as standard. This benchmark is unique in that it relies on S3 transfer for saving and loading models, so inline transfer will not work.
-
Make sure to set the
BUCKET_NAME,AWS_ACCESS_KEY, andAWS_SECRET_KEYenvironment variables. The kn_deploy script will then substitute these values into the knative manifests. Example:export AWS_ACCESS_KEY=ABCDEFGHIJKLMNOPQRST export AWS_SECRET_KEY=ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMN
-
Deploy the necessary functions using the
kn_deployscript.../../tools/kn_deploy.sh ./knative_yamls/s3/*Only one set of manifests is provided by default for this benchmark. All 4 of the manifests in the
knative_yamls/s3folder must be deployed. These default manifests deploy functions with thes3transfer type enabled, and with tracing turned off. -
Invoke the benchmark. The interface function of this benchmark is named
driver. It can be invoked using the invoker or our test client, as described in the running benchmarks document.
Number of instances per function in a stable flow:
| Function | Instances | Is Configurable |
|---|---|---|
| Driver | 1 | No |
| Trainer | 4 | Yes - Set in trainer knative manifest and must equal TrainersNum driver env var |
| Reducer | 1 | No |
| Metatrainer | 1 | No |
tAddr- The address of the TrainerrAddr- The address of the ReducermAddr- The address of the MetatrainertrainersNum- The number of training modelssp- The port to which the driver will listen (which is used for invokation)zipkin- Address of the zipkin span collector
TRANSFER_TYPE- The transfer type to use. Can beINLINE(default),S3, orXDT. Not all benchmarks support all transfer types.AWS_ACCESS_KEY,AWS_SECRET_KEY,AWS_REGION- Standard s3 keys, only needed if the s3 transfer type is usedBUCKET_NAME- Set custom s3 bucket name, only needed if the s3 transfer type is used, default bucket name is set as 'vhive-stacking'ENABLE_TRACING- Toggles tracing.TrainersNum- The number of trainers to be used.CONCURRENT_TRAINING- Toggles concurrent training. When disabled, training is carried out for one model at a time.
