Conversation
|
@jeremiedbb Before I vote, I have 4 remaining questions:
|
The maximum number of iterations is known in advance. So what you see is progress toward that maximum number of iterations. When the estimator stops after reaching it's convergence criterion, the progress bar is directly marked as 100% finished (jumping from it's current completion to 100%) There may be estimators where the maximum number of iterations itself is not known in advance. In that case the progress bar is displayed as an indeterminate progress bar.
Then the early stopping callback has no effect when set on such estimators. The implementation of callback support in estimators is such that some iteration loops are interruptable and the end of loop steps. When there are no such loops, there's nothing to interrupt.
Callback support in estimators consists in 2 things: creating callback contexts and calling callback hooks
Of course that overhead takes a more important proportion when fitting on very small datasets but I believe that 2us per iteration is very small, even for small datasets.
I expect that we implement callback support in some cython parts, in particular when the full iterative loop is cython (e.g.
They can add support for callbacks in their estimators or not. If they don't, using their estimators in our meta-estimators won't break. It's just that progress bars for instance will only display progress on the meta-estimator. On the other hand using our estimator in their meta-estimators won't break either. In that case callbacks will work but won't be able to provide their full capabilities.
I don't think that we're making callbacks mandatory to be named "compatible". We're making sure that composing elements that support callbacks and elements that don't doesn't crash. At the same time we are going to include a developer test suite in the callback module for third party devs to test their custom estimators and/or custom callbacks to make sure that they are compatible with the scikit-learn callback api.
estimators that support the scikit-learn callback api won't automatically support lightgbm (or lightning or tensorflow, ...) callbacks. Whether or not they'll add support for scikit-learn callbacks in their estimators, I can't tell. But if they show interest, we can help them implement that. Hope that I answered your questions. Don't hesitate to ask again if I wasn't clear enough in some of my answers :) |
jjerphan
left a comment
There was a problem hiding this comment.
Thank you for this SLEP and associated work, @jeremiedbb et al.!
|
For the record, I think the current design looks good, and I am confident that it strikes a good balance between simplicity and expressive power, but I want to wait a bit to follow and review some of the linked PRs before casting my vote in a week or two. |
|
@jeremiedbb Thanks for your detailed and yet concise answers. I don't think there is a show stopper. Still, I'm interested so I ask further, mostly about implementation details.
Would it make sense to apply the logic "don't create a callback context if no callback is registered" to all cases not just Cython code? with nogil:
for epoch in range(max_iter):
for i in range(n_samples):I guess the callback would be registered inside the epoch/max_iter loop but outside the i/n_samples loop. How do avoid to acquire the gil again (which is is done for the |
It depends:
I was going to still create it for every kind of estimator to not make the callback support more verbose, adding branches everywhere, but I think that with a little bit of magic we can do it seamlessly without adding verbosity. I will open a dedicated issue to figure out which option we chose. This is kind of an implementation detail though and can be decided later.
cdef int has_callbacks = len(callback_context._callbacks) > 0
with nogil:
for epoch in range(max_iter):
if has_callbacks:
with gil:
subcontext = callback_context.subcontext(...)
subcontext.call_on_fit_task_begin(...)
for i in range(n_samples):
<no callback stuff here cause we don't go deeper than iterations involving the full dataset>
if has_callbacks:
with gil:
subcontext.call_on_fit_task_end(...)(I edited the snippet to make it more realistic and accurate) |
|
I've posted a few questions in scikit-learn/scikit-learn#33322 - it felt like the right place |
|
One question in addition is about the callback context manager. We have something like the following: class MyEstimator:
...
@with_callbacks
def fit(self, X, y):
callback_ctx = self._init_callback_context(max_subtasks=self.max_iter)For builtin estimators the What I am wondering is why we have this duplication. Naively it seems something like the following should also work: class MyEstimator:
...
def fit(self, X, y):
with CallbackContext(max_subtasks=self.max_iter) as callback_ctx:
# do stuff with callback_ctxWhat am I missing? |
1 level of indentation. In scikit-learn it means in particular that PRs implementing callback support would have a huge unrelated diff because of the new indentation of the whole fit method. The long term preferred solution is the public vs private Note that we can still offer the context manager option if we get the request. |
|
Hi everyone, I opened a PR (#104) to change a small detail in the SLEP because we found out a performance concern that we wanted to tackle right away. The goal is to avoid unnecessary pickling of estimators which implies to add 1 extra arg to the callback hooks. I know that it's not the best practice to change the SLEP during the voting time but I believe that it's a very minor change that should not impact the decision. Please tell me if you think otherwise. (note that the PR #104 can be merged after the vote as a small amendment to the slep if you think that it's more appropriate) |
This PR collects the votes for SLEP023: Callback API.
This SLEP is being implemented in the
callbacksfeature branch. scikit-learn/scikit-learn#33322 keeps and updated diff againstmain. It currently contains the framework and a callback for ProgressBars.On top of that, there's ongoing work to:
BaseSearchCVscikit-learn#33533). We'll also draft PRs to add support in Pipeline and a couple of key estimators in the coming weeks.According to our governance model, the vote will be open for a month (untill April 20), and the motion is accepted if 2/3 of the cast votes are in favor.
@scikit-learn/core-devs