Skip to content

Inter-SM scripts + L2 cache results#4

Merged
fotstrt merged 12 commits into
benchmarking_interfrom
benchmarking_inter_fix_scripts
Jul 8, 2025
Merged

Inter-SM scripts + L2 cache results#4
fotstrt merged 12 commits into
benchmarking_interfrom
benchmarking_inter_fix_scripts

Conversation

@elvingerpaul

@elvingerpaul elvingerpaul commented Jul 6, 2025

Copy link
Copy Markdown
Collaborator
  • separates L2 cache and Memory bandwidth into dedicated folders each with its own scripts (adds a bit of duplication but this way each folder should contain most of the scripts that it needs). I also slightly adjusted the scripts to simplify measuring for different batch sizes and prompt sizes otherwise they remain mostly the same
  • I removed the profile ranges when profiling with nsys
  • slightly modified script to extract kernel latencies from nsys sqlite file (with the offsets as suggested, indeed much easier :)) since we now have the entire trace and no longer only the decode part. Since we're mostly interested in the decode, we simply keep the last n kernels, where n is the number of kernels in the decode stage (changes based on batch size)
  • created a folder other which contains all the scripts which I wasn't sure whether we still need them. Happy to delete them but wanted to keep them around for now in case you need them.
  • L2 results (txt and csv files) for batch sizes 1, 8, 16 and prompt sizes 100, 1000

@elvingerpaul elvingerpaul requested a review from fotstrt July 6, 2025 13:22
@github-actions

github-actions Bot commented Jul 6, 2025

Copy link
Copy Markdown

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Comment on lines +36 to +50
if sort_lats:
# plot the largest latency first, to ensure all bars are visible
for i in range(num_kernels):
l = [(lats[trace][i], trace) for trace in traces if trace in lats]

sorted_l = sorted(l, key=lambda x: x[0], reverse=True)
if bar:
for lat, lo in sorted_l:
ax.bar(i, lat, color=colors[lo])

ax.legend(
handles=[plt.Rectangle((0,0),1,1, color=colors[trace]) for trace in traces if trace in lats],
labels=[f"{round(int(trace)*4/(1024*1024), 2)} MB" if trace != "iso" else "iso" for trace in traces if trace in lats],
)
else:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this in non-default mode. To be discussed, currently we have a fixed order in which the bars are plotted assuming that the highest bar is always plotted first in the background and that we plot the lower ones on top. This is not always the case.

The above sorts the latencies each time to make sure that we always plot the highest bar first. To be honest visually it looks a bit harder to understand afterwards, because sometimes the orange bar is on top, sometimes it is in the background etc.

The current alternative uses slighthly transparent bars (by setting alpha) which is the version I like most at this point, though not ideal. Let me know if you have a preference on this

@fotstrt fotstrt merged commit 4d0d317 into benchmarking_inter Jul 8, 2025
1 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants