included llama3.1 8b small llm training scripts by ZixianWangAMD · Pull Request #799 · mlcommons/training

ZixianWangAMD · 2025-06-25T05:21:46Z

No description provided.

github-actions · 2025-06-25T05:21:55Z

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
0 out of 1 committers have signed the MLCommons CLA.
❌ @zixian Wang
Zixian Wang seems not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request}

mmarcinkiewicz · 2025-06-25T15:43:03Z

@ZixianWangAMD may I ask for a dockerfile that I can use to test it on H100? Or at least a hand how to modify the existing dockerfile?

ShriyaRishab · 2025-06-26T20:11:55Z

Based on Training WG feedback, can you please change the folder name from small_language_model_pretraining to small_llm_pretraining since that is the agreed upon long name for this benchmark?
The short name (which will be used everywhere in the code + logging) will be llama3.1_8b

mmarcinkiewicz · 2025-07-25T06:54:36Z

+    # warmup_steps = math.ceil(57600 * 8192 / 8192 / gbs * 0.1)
+    # # 230k samples
+    max_steps = math.ceil(230000 * 8192 / 8192 / gbs)
+    warmup_steps = math.ceil(230000 * 8192 / 8192 / gbs * 0.1)


max_steps should be fixed at 1.2M
warmup_steps should be parametrizable from the config

…from scratch using nemo

chenyuxyz · 2025-08-06T18:42:09Z

+Once Rclone is installed, run the following command to authenticate with the bucket:
+
+```
+rclone config create mlc-training s3 provider=Cloudflare access_key_id=76ea42eadb867e854061a1806220ee1e secret_access_key=a53625c4d45e3ca8ac0df8a353ea3a41ffc3292aa25259addd8b7dc5a6ce2936 endpoint=https://c2686074cb2caf5cbaf6d134bdba8b47.r2.cloudflarestorage.com


will there be a new pre-tokenized dataset to download? this still points the dataset for 405B

Yes, this is being uploaded after which we will modify the instructions.

ShriyaRishab · 2025-08-08T15:47:04Z

@ZixianWangAMD - can you please sign the CLA?

ethanself · 2025-08-13T20:31:33Z

@ZixianWangAMD - can you please sign the CLA?

Zixian has signed the CLA with user ZixianWangAMD, but the check is failing due to an incorrect Git Config. Commits were made locally with "Zixian Wang". GitHub usernames cannot have spaces.

I did notice that Zixian appears to have multiple GitHub accounts.

To set correct global configuration:
git config --global user.name "Your Correct Name Here"
git config --global user.email "your.email@example.com"

OR

Set local configuration for a specific repo:
git config user.name "Project Specific Name"
git config user.email "project.email@example.com"

Once config is fixed, a rebase of the local repo will need to be done to fix the associated author of commits.

ShriyaRishab · 2025-08-15T18:11:46Z

Close as duplicate of #814

included small llm training scripts

a5538ad

ZixianWangAMD requested a review from a team as a code owner June 25, 2025 05:21

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/README.md Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/README.md Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/README.md Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/config_MI325X_1x8x1_8b.sh Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/pretrain_llama31.py Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/pretrain_llama31.py Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/pretrain_llama31.py Outdated

ShriyaRishab reviewed Jun 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/utils/download_hf_llama3.sh Outdated

ZixianWangAMD and others added 6 commits June 29, 2025 10:24

Update README.md

333605b

Update README.md

1b32ed1

Update README.md

43d381e

included steps for data processing

161babd

Update README. Included H200 dockerfile, verified running.

cadb53a

deleted if 70b and if 405b

9a4cdd6

ramgandikota reviewed Jul 7, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/config_H200_1x8x1_8b.sh Outdated

ZixianWangAMD added 2 commits July 7, 2025 14:00

Update config_H200_1x8x1_8b.sh to remove hard-coded seed

f6a5fe2

Update config_MI325X_1x8x1_8b.sh to remove hard-coded seed

1b44b91

This was referenced Jul 11, 2025

When can the reference of llama3.1_8b(small_llm_pretraining) be provided? #802

Closed

BERT input checkpoint gives errors while execution #794

Closed

ZixianWangAMD added 2 commits July 19, 2025 00:53

update with newest pretrained code with static 230k train samples

83d3e1a

resolve conflicts

a7c4dde

mmarcinkiewicz reviewed Jul 25, 2025

View reviewed changes

Comment thread small_language_model_pretraining/nemo/pretrain_llama31.py Outdated

mmarcinkiewicz reviewed Jul 25, 2025

View reviewed changes

mmarcinkiewicz and others added 2 commits July 28, 2025 14:17

SLURM, TP and LR fix

c72e54c

fixes

a79a7b3