Update dependency datasets to v4#4460
Update dependency datasets to v4#4460renovate-bot wants to merge 1 commit intoGoogleCloudPlatform:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request updates the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the datasets dependency to version 4.6.1. This is a significant major version bump from 2.18.0, and the release notes for datasets v4.0.0 indicate several breaking changes. It is crucial to verify that the existing code, particularly in skills/vertex-tuning/scripts/prepare_dataset.py, remains fully compatible with the new library version to prevent runtime errors and ensure continued functionality.
Note: Security Review has been skipped due to the limited scope of the PR.
| numpy==2.4.2 | ||
| pandas==3.0.1 | ||
| datasets==2.18.0 | ||
| datasets==4.6.1 |
There was a problem hiding this comment.
The datasets library has been updated to a new major version (v4.x). The release notes for datasets v4.0.0 mention several breaking changes, including the removal of scripts, changes in audio/video decoding, and Sequence being replaced by List. Please verify that the existing code in skills/vertex-tuning/scripts/prepare_dataset.py is fully compatible with these changes. Specifically, ensure that functionalities like load_dataset, map, filter, train_test_split, and to_json work as expected with datasets==4.6.1.
265f805 to
0bead61
Compare
ebf0a09 to
936fcb7
Compare
936fcb7 to
717c324
Compare
This PR contains the following updates:
==2.18.0→==4.8.4Release Notes
huggingface/datasets (datasets)
v4.8.4Compare Source
What's Changed
Full Changelog: huggingface/datasets@4.8.3...4.8.4
v4.8.3Compare Source
What's Changed
Full Changelog: huggingface/datasets@4.8.2...4.8.3
v4.8.2Compare Source
What's Changed
Full Changelog: huggingface/datasets@4.8.1...4.8.2
v4.8.1Compare Source
What's Changed
Full Changelog: huggingface/datasets@4.8.0...4.8.1
v4.8.0Compare Source
Dataset Features
Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in #8064
This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumps
dillandmultiprocessversions to support python 3.14Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in #8068
max_shard_sizeto IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)zip://*.jsonl::hf://datasets/username/dataset-name/data.zipWhat's Changed
New Contributors
Full Changelog: huggingface/datasets@4.7.0...4.8.0
v4.7.0Compare Source
Datasets Features
Json()type by @lhoestq in #8027Json()type is used to store such data that would normally not be supported in Arrow/ParquetJson()type inFeatures()for any dataset, it is supported in any functions that acceptsfeatures=likeload_dataset(),.map(),.cast(),.from_dict(),.from_list()on_mixed_types="use_json"to automatically set theJson()type on mixed types in.from_dict(),.from_list()and.map()Examples:
You can use
on_mixed_types="use_json"or specifyfeatures=with a [Json] type:This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:
Another example with tool calling data and the
on_mixed_types="use_json"argument (useful to not have to specifyfeatures=manually):What's Changed
New Contributors
Full Changelog: huggingface/datasets@4.6.1...4.7.0
v4.6.1Compare Source
Bug fix
Full Changelog: huggingface/datasets@4.6.0...4.6.1
v4.6.0Compare Source
Dataset Features
Support Image, Video and Audio types in Lance datasets
Push to hub now supports Video types
Write image/audio/video blobs as is in parquet (PLAIN) in
push_to_hub()by @lhoestq in #7976Add
IterableDataset.reshard()by @lhoestq in #7992Reshard the dataset if possible, i.e. split the current shards further into more shards.
This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.
Equality may happen if no shard can be split further.
The resharding mechanism depends on the dataset file format:
What's Changed
transformers v5andhuggingface_hub v1by @hanouticelina in #7989New Contributors
Full Changelog: huggingface/datasets@4.5.0...4.6.0
v4.5.0Compare Source
Dataset Features
Add lance format support by @eddyxu in #7913
What's Changed
revisioninload_datasetby @Scott-Simmons in #7929New Contributors
Full Changelog: huggingface/datasets@4.4.2...4.5.0
v4.4.2Compare Source
Bug fixes
Minor additions
New Contributors
Full Changelog: huggingface/datasets@4.4.1...4.4.2
v4.4.1Compare Source
Bug fixes and improvements
Full Changelog: huggingface/datasets@4.4.0...4.4.1
v4.4.0Compare Source
Dataset Features
Add nifti support by @CloseChoice in #7815
Add num channels to audio by @CloseChoice in #7840
What's Changed
_batch_setitems()by @sghng in #7817New Contributors
Full Changelog: huggingface/datasets@4.3.0...4.4.0
v4.3.0Compare Source
Dataset Features
Enable large scale distributed dataset streaming:
These improvements require
huggingface_hub>=1.1.0to take full effectWhat's Changed
from_generatorby @simonreise in #7533New Contributors
Full Changelog: huggingface/datasets@4.2.0...4.3.0
v4.2.0Compare Source
Dataset Features
Sample without replacement option when interleaving datasets by @radulescupetru in #7786
Parquet: add
on_bad_filesargument to error/warn/skip bad files by @lhoestq in #7806Add parquet scan options and docs by @lhoestq in #7801
What's Changed
New Contributors
Full Changelog: huggingface/datasets@4.1.1...4.2.0
v4.1.1Compare Source
What's Changed
New Contributors
Full Changelog: huggingface/datasets@4.1.0...4.1.1
v4.1.0Compare Source
Dataset Features
feat: use content defined chunking by @kszucs in #7589
use_content_defined_chunking=Truewhen writing Parquet filesConcurrent push_to_hub by @lhoestq in #7708
Concurrent IterableDataset push_to_hub by @lhoestq in #7710
HDF5 support by @klamike in #7690
Other improvements and bug fixes
train_test_splitby @qgallouedec in #7736New Contributors
Full Changelog: huggingface/datasets@4.0.0...4.1.0
v4.0.0Compare Source
New Features
Add
IterableDataset.push_to_hub()by @lhoestq in #7595Add
num_proc=to.push_to_hub()(Dataset and IterableDataset) by @lhoestq in #7606New
ColumnobjectTorchcodec decoding by @TyTodd in #7616
torch>=2.7.0and FFmpeg >= 4datasets<4.0AudioDecoder:VideoDecoder:Breaking changes
Remove scripts altogether by @lhoestq in #7592
trust_remote_codeis no longer supportedTorchcodec decoding by @TyTodd in #7616
Replace Sequence by List by @lhoestq in #7634
ListtypeSequencewas a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns aListor adictdepending on the subfeatureOther improvements and bug fixes
Dataset.mapto reuse cache files mapped with differentnum_procby @ringohoffman in #7434RepeatExamplesIterableby @SilvanCodes in #7581_dill.pyto useco_linetablefor Python 3.10+ in place ofco_lnotabby @qgallouedec in #7609New Contributors
Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.