revise the helm chart and update the poerty install lists by xianglic · Pull Request #24 · cmusatyalab/openscout

xianglic · 2023-08-11T18:10:44Z

Revise the helm chart to make it compatible with sinphonia

Update the poetry toml for ultralytics from ^8.0.1 to ^8.0.145 to update the ultralytics package for new yolov5 model to use.

teiszler · 2023-08-15T13:26:15Z

The changes to the Dockerfile and pyproject.toml look fine. @jaharkes and or @jblakley, Can you review the helm chart changes. It mostly seems like removal of the namespace tag from every chart.

jblakley · 2023-08-15T14:37:45Z

I took a quick look and will take a closer look later. I'd also like to test it on the JIT Cloudlet to make sure they don't break anything. I get a little nervous when things get hardcoded.

…

On Tue, Aug 15, 2023 at 9:26 AM Thomas Eiszler ***@***.***> wrote: The changes to the Dockerfile and pyproject.toml look fine. @jaharkes <https://github.com/jaharkes> and or @jblakley <https://github.com/jblakley>, Can you review the helm chart changes. It mostly seems like removal of the namespace tag from every chart. — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTSNMCNMEE4P5XDUXV6M73XVN2IFANCNFSM6AAAAAA3NH2I7E> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Jim Blakley Living Edge Lab Associate Director Carnegie Mellon University

jaharkes · 2023-08-15T15:01:21Z

Removing the namespace is a good change because that way helm can define it and we can deploy multiple backends in the same cluster (for sinfonia deployments). I didn't like some of the hardcoded stuff like vm039.elijah and such. Not sure if that has been addressed. Jan

…

On August 15, 2023 10:37:57 AM EDT, Jim Blakley ***@***.***> wrote: I took a quick look and will take a closer look later. I'd also like to test it on the JIT Cloudlet to make sure they don't break anything. I get a little nervous when things get hardcoded. On Tue, Aug 15, 2023 at 9:26 AM Thomas Eiszler ***@***.***> wrote: > The changes to the Dockerfile and pyproject.toml look fine. @jaharkes > <https://github.com/jaharkes> and or @jblakley > <https://github.com/jblakley>, Can you review the helm chart changes. It > mostly seems like removal of the namespace tag from every chart. > > — > Reply to this email directly, view it on GitHub > <#24 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABTSNMCNMEE4P5XDUXV6M73XVN2IFANCNFSM6AAAAAA3NH2I7E> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

jblakley · 2023-08-15T15:33:14Z

What do you mean by "helm can define it"? The pre-PR method is to specify namespace in values.yaml. How is it set post-PR? On Tue, Aug 15, 2023 at 11:01 AM Jan Harkes ***@***.***> wrote:

…

Removing the namespace is a good change because that way helm can define it and we can deploy multiple backends in the same cluster (for sinfonia deployments). I didn't like some of the hardcoded stuff like vm039.elijah and such. Not sure if that has been addressed. Jan On August 15, 2023 10:37:57 AM EDT, Jim Blakley ***@***.***> wrote: >I took a quick look and will take a closer look later. I'd also like to >test it on the JIT Cloudlet to make sure they don't break anything. I get a >little nervous when things get hardcoded. > >On Tue, Aug 15, 2023 at 9:26 AM Thomas Eiszler ***@***.***> >wrote: > >> The changes to the Dockerfile and pyproject.toml look fine. @jaharkes >> <https://github.com/jaharkes> and or @jblakley >> <https://github.com/jblakley>, Can you review the helm chart changes. It >> mostly seems like removal of the namespace tag from every chart. >> >> — >> Reply to this email directly, view it on GitHub >> < #24 (comment)>, >> or unsubscribe >> < https://github.com/notifications/unsubscribe-auth/ABTSNMCNMEE4P5XDUXV6M73XVN2IFANCNFSM6AAAAAA3NH2I7E> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> > > — Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTSNMDQQMY4MEQZHAYU35TXVOFM3ANCNFSM6AAAAAA3NH2I7E> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Jim Blakley Living Edge Lab Associate Director Carnegie Mellon University

xianglic · 2023-08-15T15:38:19Z

Hi Jan and Jim,

Sorry for the confusion with hard-coded lines. There are the lines with vm039.elijah in the init container section of object engine deployment file. They are for downloading ultralytics model frames and models from the edge cloudlet(vm039.elijah) I am running.

For model frames, in this case master.zip file under prepare-model-2 section, I added that just for testing the compatibility of different versions of model frames with different models and ultralytics library verison. This whole section could be deleted entirely later on.

For models, in this case coco.pt under prepare-model-1 section, I think openscout object engine expects there are some models like coco.pt, robomaster.pt to be in the /openscout-server/models dir to start. Therefore, I have init container downloaded some model that I have exposed from the edge cloudlet. Users could just change the model url specified in the values.yml file if they need to choose other models.

Ant

jaharkes · 2023-08-15T16:17:06Z

What do you mean by "helm can define it"? The pre-PR method is to specify namespace in values.yaml. How is it set post-PR?

helm install --namespace ... and you can even add a --create-namespace flag to have helm create the namespace if it does not yet exist.

jaharkes · 2023-08-15T16:21:50Z

Hi Jan and Jim,

Sorry for the confusion with hard-coded lines. There are the lines with vm039.elijah in the init container section of object engine deployment file. They are for downloading ultralytics model frames and models from the edge cloudlet(vm039.elijah) I am running.

For model frames, in this case master.zip file under prepare-model-2 section, I added that just for testing the compatibility of different versions of model frames with different models and ultralytics library verison. This whole section could be deleted entirely later on.

For models, in this case coco.pt under prepare-model-1 section, I think openscout object engine expects there are some models like coco.pt, robomaster.pt to be in the /openscout-server/models dir to start. Therefore, I have init container downloaded some model that I have exposed from the edge cloudlet. Users could just change the model url specified in the values.yml file if they need to choose other models.

It may be useful to have a generic model like coco.pt as a default in the Docker image, basically anything we download anyway when the container is started, probably should be part of the container. If it is part of the image, it is effectively cached between invocations, but if we have to download anything on startup we always have to wait for that download to complete. If it is small download is cheap, but so is adding it to the container. And if it is large, having it pre-cached is better than having to re-download it every time.

xianglic · 2023-08-15T16:32:08Z

Hi Jan and Jim,
Sorry for the confusion with hard-coded lines. There are the lines with vm039.elijah in the init container section of object engine deployment file. They are for downloading ultralytics model frames and models from the edge cloudlet(vm039.elijah) I am running.
For model frames, in this case master.zip file under prepare-model-2 section, I added that just for testing the compatibility of different versions of model frames with different models and ultralytics library verison. This whole section could be deleted entirely later on.
For models, in this case coco.pt under prepare-model-1 section, I think openscout object engine expects there are some models like coco.pt, robomaster.pt to be in the /openscout-server/models dir to start. Therefore, I have init container downloaded some model that I have exposed from the edge cloudlet. Users could just change the model url specified in the values.yml file if they need to choose other models.

It may be useful to have a generic model like coco.pt as a default in the Docker image, basically anything we download anyway when the container is started, probably should be part of the container. If it is part of the image, it is effectively cached between invocations, but if we have to download anything on startup we always have to wait for that download to complete. If it is small download is cheap, but so is adding it to the container. And if it is large, having it pre-cached is better than having to re-download it every time.

IC. I'll try to make some default model in the image, and maybe add the if condition saying that if there is no specific url is defined by user, the openscout will load that default model.

jblakley · 2023-08-15T17:28:01Z

I tested the basic object detection function of the PR on the ARM JIT Cloudlet. I did no ELK breakage test but those containers deployed OK. Object detection is working.

The deployment, however, is going into default namespace rather than the openscout namespace. I'm assuming that the intent of the change is to force use of the --namespace option on the helm command line. If so, I don't like that since it creates a more complex deployment process and breaks the k8s/helm declarative model -- i.e., the deployment should be defined by the yaml files as much as possible. Or, am I missing something and there is a declarative (non-command line) way of specifying namespace?

Regardless, the PR broke my deployment by putting it in the wrong namespace.

A few other things I noticed:

The GPU resource usage declaration was removed from serve/charts/openscout/templates/openscout-object-engine.yaml While it was commented out, I have used this as a deployment time option that gets uncommented on a GPU machine. Admittedly, that is the wrong way to do it but I've always intended to go back and put in a values.yaml option to select if and how many GPUs.

          #resources:
           #limits:
           #   nvidia.com/gpu: 1

The arm docker build is failing with error: Unable to find installation candidates for nvidia-cuda-nvrtc-cu11. I tried backing out the base image from nvidia/cuda:11.5.2-cudnn8-devel-ubuntu18.04 to nvidia/cuda:11.5.1-cudnn8-devel-ubuntu18.04 but that image seems to be gone now! This is also a problem for the current master branch.
The default values.yaml file is pointing to xianglic/openscout image. Why can't this be cmusatyalab/openscout since the arm images are there?
The comment on vm039 hardcoding has already been noted. As a longer term thing, it seems like the models and the packages needed for models (e.g., ultralytics) needs a better solution.
On the positive side, I like how you cleaned up the volume mounts and the entrypoint arguments.

xianglic · 2023-08-15T18:33:57Z

I tested the basic object detection function of the PR on the ARM JIT Cloudlet. I did no ELK breakage test but those containers deployed OK. Object detection is working.

The deployment, however, is going into default namespace rather than the openscout namespace. I'm assuming that the intent of the change is to force use of the --namespace option on the helm command line. If so, I don't like that since it creates a more complex deployment process and breaks the k8s/helm declarative model -- i.e., the deployment should be defined by the yaml files as much as possible. Or, am I missing something and there is a declarative (non-command line) way of specifying namespace?

Regardless, the PR broke my deployment by putting it in the wrong namespace.

A few other things I noticed:

The GPU resource usage declaration was removed from serve/charts/openscout/templates/openscout-object-engine.yaml While it was commented out, I have used this as a deployment time option that gets uncommented on a GPU machine. Admittedly, that is the wrong way to do it but I've always intended to go back and put in a values.yaml option to select if and how many GPUs.
          #resources:
           #limits:
           #   nvidia.com/gpu: 1
The arm docker build is failing with error: Unable to find installation candidates for nvidia-cuda-nvrtc-cu11. I tried backing out the base image from nvidia/cuda:11.5.2-cudnn8-devel-ubuntu18.04 to nvidia/cuda:11.5.1-cudnn8-devel-ubuntu18.04 but that image seems to be gone now! This is also a problem for the current master branch.

The default values.yaml file is pointing to xianglic/openscout image. Why can't this be cmusatyalab/openscout since the arm images are there?

The comment on vm039 hardcoding has already been noted. As a longer term thing, it seems like the models and the packages needed for models (e.g., ultralytics) needs a better solution.

On the positive side, I like how you cleaned up the volume mounts and the entrypoint arguments.

Hi Jim,

I'm not sure about the namespace either. I'll take a look on it, and keep u updated.

For the things that u have noted:

I'll put the GPU specification back on using if defined condition.
Yes the nvidia/cuda:11.5.1-cudnn8-devel-ubuntu18.04 is not existed in the docker hub anymore (but I believe Tom's has that version cached in his local docker hub). I don't know how exactly the arm one fails to build. I'll take a look into it. But just from my building experience with 11.5.2, could u try using sudo docker build?
I have rebuild the docker image for openscout just for updating the verison of ultralytics lib, as the older version is not compatible with the updated yolov5 model frame anymore. It is now currently in my repo just for testing purpose. The docker image will be pushed to the cmusatyalab/openscout if it is tested to be okay. And I'll change the pointing address from values.yaml.
I'll revise the hard-coded lines following the advise from Jan.

Ant

jblakley · 2023-08-16T12:39:03Z

Anthony, that sounds good. I will look at creating a helmfile to allow namespace declaration. Once @jaharkes has rebuilt the ARM container, I can test out in the JIT Cloudlet.

jaharkes · 2023-08-16T16:08:56Z

     name: openscout
-     repository: cmusatyalab/openscout
-     tag: latest
+     repository: xianglic/openscout


We should not refer to a non-cmusatyalab docker registry here

jblakley · 2023-08-16T17:31:38Z

I looked at helmfile this morning and it looks like it can easily serve as the mechanism for customizing the namespace outside of the charts themselves. For example, this helmfile.yaml worked for me to deploy openscout and openrtist in their own namespaces:

values:
- openscout_namespace: openscout1
- openrtist_namespace: openrtist1
releases:
- name: openscout
  namespace: {{ .Values.openscout_namespace }}
  chart: ./openscout
  installed: true
  values:
  - openscout/values_arm64.yaml
- name: openrtist
  namespace: {{ .Values.openrtist_namespace }}
  chart: ./openrtist
  installed: true
  values:
  - openrtist/values.yaml

To deploy this, it becomes simply:

helmfile sync

To change the namespace on the command line:

helmfile sync --state-values-set openscout_namespace=my_openscout_namespace

With this capability, I'm comfortable removing the namespace definition from all of the charts. Up to you whether you want to include a default helmfile.yaml in the repository.

…te the prepare model2 init container; changed the prepare model 1 init container based on the rebuilt docker image; reposition the model url, name specification in the values.yml file

…models dir with models (coco.pt; robomaster.pt, and yolov5m) in the docker image; commented out the volume mount of ./models in the docker-compose file since we have now have one in the image itself

xianglic · 2023-08-16T21:34:43Z

Hi Jan, Jim, Tom,

I have revised several things as follow:

For the docker aspect:

Rebuild the docker image by bring the models dir in the docker container running time under openscout-server dir (noted: currently there are three models in the ./models dir, they are coco.pt, robomaster.pt and yolov5m. I'll trim them down to one default based on what u want to keep in the future)
delete the ./models in dockerignore file
comment out the volume mount for the ./models for object engine service

For the helm chart aspect:

Bring back the GPU specification
delete the prepare model 2 init container
delete the modeldir and cachedir volume mount in the object engine deployment file as we already have ./models in the rebuilt image and we don't need cachedir for prepare model 2 init container anymore
change the nodeport back for httpd service in the values.yml (I have overrided the cluster ip for this on recipes for sinphonia usage)
commented out the url under model, and I have re positioned the model section under images.objectengine as I presume it would be more logically related.
deleted tgz and index file

For other aspect:

deleted the git diff file

jblakley · 2023-08-17T13:25:39Z

My main remaining issue is being able to build a working ARM container image. I pushed the old base image to my dockerhub account so we don't lose the ability to recreate the current version. But, need to have a working updated version.

I don't have strong input on the implementation of the model folders. But, here are some thoughts on design requirements:

Should be a default "hello world" model bundled in the image. (e.g., coco.pt)
Should be a mounted folder where additional models can be provisioned from the host at deployment/runtime.
Fetching models at runtime from torch.hub or other central/public repositories should be deprecated. Specific required models should be loaded into the mounted folder by the user.
Documentation to access commonly used models (e.g., yolov5, yolov8) should be included in the README
I'm not sure the best way to handle packages required to access specific models (e.g., ultralytics). OOH, pre-installing all of them seems difficult and unbounded. OTOH, runtime installation into the container seems bad.

These are certainly debatable requirements. They assume a paradigm of an evolving and growing number of models as the cognitive engines evolve. Another paradigm assumes that we lock things down to a specific set of models and packages and leave extensions to a developer. Given the challenge of #5, that may be the only option anyway. This seems somewhat tied to the openscout approach of modular cognitive engines. If I'm a developer adding a new engine, what's the process?

jaharkes · 2023-08-17T14:59:45Z

My main remaining issue is being able to build a working ARM container image. .... But, need to have a working updated version.

I have a somewhat ugly but working fix for that, when i tried to do it the official way by adding the download.pytorch.org repositories it was still unable to find the right aarch64 torch wheels, so I just hardcoded the download urls for different python releases.

poetry lock is taking about 10-20 minutes to pin down the right package versions though, so right now I'm trying to see what is causing that and will probably tighten up some of the versions (or add some additional restrictions) it seems like it is cycling in the dependency resolver on botocore and roboflow which are indirect dependencies.

I don't have strong input on the implementation of the model folders. But, here are some thoughts on design requirements:

Complete agreement on the models folder, just include one example model and hopefully a relatively small one at that, but generic enough that people can test their openscout client/backend with common objects (person/keyboard/tv) which I guess would be the coco.pt model.

3. Fetching models at runtime from torch.hub or other central/public repositories should be deprecated. Specific required models should be loaded into the mounted folder by the user.

Actually there are 3 possible methods, each with their own advantages and disadvantages.

include model in Docker image. People can create their own container with a tiny dockerfile that is just 'FROM cmusatyalab/openscout COPY mymodel.pt /openscout/models', publish that container image and then deploy with right helm values override to install their own container.
Advantage: Automatically benefits from docker caching, it just adds the minimal possible layer and will persist on the host node between executions.
Disadvantage: Need to build/publish customized container.
Specify one or more urls to fetch from with helm values override.
Advantage: Very easy to add custom models.
Disadvantage: Models needs to be refetched every single time the container is restarted/redeployed. Models need to be (publically?) available for download.
Expose local directory as a mounted volume.
Advantage: Easy if you already have models on your host, avoids duplications.
Disadvantage: Will never work in an actual cloudlet environment because you simply don't get to choose which cloudlet to deploy to and even then you don't actually get access to the cloudlet's host environment to copy the models over.

revise the helm chart and update the poerty install lists

b8fa1c5

xianglic closed this Aug 15, 2023

xianglic deleted the server_ver_update branch August 15, 2023 16:29

xianglic restored the server_ver_update branch August 15, 2023 16:29

xianglic reopened this Aug 15, 2023

jaharkes requested changes Aug 16, 2023

View reviewed changes

Anthony Chen added 3 commits August 16, 2023 17:06

bring back the GPU specification under object engine deployment; dele…

e420480

…te the prepare model2 init container; changed the prepare model 1 init container based on the rebuilt docker image; reposition the model url, name specification in the values.yml file

deletes the ./models from dockerignore file so that we will have a ./…

97ab707

…models dir with models (coco.pt; robomaster.pt, and yolov5m) in the docker image; commented out the volume mount of ./models in the docker-compose file since we have now have one in the image itself

delete the git diff file

dfcb49e

add back the volume mount for ./models

0ddf4a2

Conversation

xianglic commented Aug 11, 2023

Uh oh!

teiszler commented Aug 15, 2023

Uh oh!

jblakley commented Aug 15, 2023 via email

Uh oh!

jaharkes commented Aug 15, 2023 via email

Uh oh!

jblakley commented Aug 15, 2023 via email

Uh oh!

xianglic commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaharkes commented Aug 15, 2023

Uh oh!

jaharkes commented Aug 15, 2023

Uh oh!

xianglic commented Aug 15, 2023

Uh oh!

jblakley commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xianglic commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jblakley commented Aug 16, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jaharkes Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jblakley commented Aug 16, 2023

Uh oh!

xianglic commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jblakley commented Aug 17, 2023

Uh oh!

jaharkes commented Aug 17, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xianglic commented Aug 15, 2023 •

edited

Loading

jblakley commented Aug 15, 2023 •

edited

Loading

xianglic commented Aug 15, 2023 •

edited

Loading

xianglic commented Aug 16, 2023 •

edited

Loading