Benchmarked Resiliparse & added flag to evaluate parsers individually#25
Open
KhoomeiK wants to merge 2 commits intoscrapinghub:masterfrom
Open
Benchmarked Resiliparse & added flag to evaluate parsers individually#25KhoomeiK wants to merge 2 commits intoscrapinghub:masterfrom
KhoomeiK wants to merge 2 commits intoscrapinghub:masterfrom
Conversation
lopuhin
reviewed
Oct 11, 2024
Member
lopuhin
left a comment
There was a problem hiding this comment.
Thanks a lot for contributing a new extractor @KhoomeiK . I left a few small comments - also if you prefer I could merge your PR as-is and address them in another PR.
Besides that, do you mind also updating the README with the result of the new parser, adding a line at the end of Result of packages added after original evaluation: table?
| try: | ||
| extractor_module = importlib.import_module(f'extractors.run_{name}') | ||
| extractor_module.main() | ||
| except: |
Member
There was a problem hiding this comment.
I'd rather catch Exception here, e.g. see motivation in this (rejected) PEP https://peps.python.org/pep-0760/#motivation
Suggested change
| except: | |
| except Exception: |
| 'accuracy={accuracy:.3f} ± {accuracy_std:.3f} ' | ||
| .format(name=name, **metrics)) | ||
| metrics_by_name[name] = metrics | ||
| else: |
Member
There was a problem hiding this comment.
I think it would be best to refactor the code in a way which does not leave to having to repeat the reporting. For example, we could pass args.parser to evaluate function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resiliparse is actively used by some AI labs to extract web data for LLM pre-training, but it has not been publicly benchmarked alongside many other similar web parsing tools. I've added an eval script for Resiliparse as well as its data output. I also added a flag to eval individual parsers separately.