Conversation
TheCedarPrince
left a comment
There was a problem hiding this comment.
Thank you so much @ParamThakkar123 for your work! I left a bunch of comments based on our call; apologies for the delay in reviewing with being out sick! Very excited here and happy to review multiple times as needed. Just let me know via ping again!
I think we also will need to implement #14 first before continuing with this PR however.
|
Hey @ParamThakkar123 , I think you are unblocked on this PR now! Could you rebase and address the comments I left here? Thanks and keep up the good work! :D |
Yes @TheCedarPrince I will start working on this now!! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #13 +/- ##
======================================
Coverage ? 7.46%
======================================
Files ? 5
Lines ? 67
Branches ? 0
======================================
Hits ? 5
Misses ? 62
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
TheCedarPrince
left a comment
There was a problem hiding this comment.
Hey Param, very very close here! Some final comments; thanks for the hard work!
| using HealthSampleData | ||
| using Test | ||
|
|
||
| @testset "HealthSampleData.jl" begin |
There was a problem hiding this comment.
Add tests for downloading; you could upload a very small file to HF JuliaHealthDatasets and use that for testing.
There was a problem hiding this comment.
Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1
TheCedarPrince
left a comment
There was a problem hiding this comment.
Hey Param, very very close here! Some final comments; thanks for the hard work!
|
@TheCedarPrince I fixed the error which was occurring during our call yesterday. The |
|
Hey @ParamThakkar123 I tried out the code and it still doesn't work. I did this and it hung indefinitely until I pressed Ctrl+C: julia> HealthSampleData.download_hf_dataset("Test")
[ Info: Downloading Test dataset as DataDep...
[ Info: Resolving Hugging Face metadata for https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets
^C┌ Warning: Failed to resolve dataset metadata: HTTP.Exceptions.RequestError(HTTP.Messages.Request:
│ """
│ GET /api/datasets/https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/revision/main HTTP/1.1
│ Host: huggingface.co
│ Accept: */*
│ User-Agent: HTTP.jl/1.11.7
│ Content-Length: 0
│ Accept-Encoding: gzip
│
│ """, InterruptException()). Falling back to direct download.
└ @ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:23
[ Info: Downloading penguins.csv from https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets via HuggingFaceHub...
[ Info: Downloading https://huggingface.co/datasets/JuliaHealthOrg/JuliaHealthDatasets/resolve/main/penguins.csv -> /tmp/jl_p2qTDI/penguins.csv
^CERROR: InterruptException:
Stacktrace:
[1] open_nolock
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:35 [inlined]
[2] arg_write(f::Function, arg::String)
@ ArgTools ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/ArgTools/src/ArgTools.jl:103
[3] #download#2
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:258 [inlined]
[4] download
@ ~/.julia/juliaup/julia-1.11.7+0.x64.linux.gnu/share/julia/stdlib/v1.11/Downloads/src/Downloads.jl:247 [inlined]
[5] _huggingface_dataset_register(name::String, repo::String, filename::String)
@ HealthSampleData ~/FOSS/HealthSampleData.jl/src/huggingface.jl:42
[6] Test
@ ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:33 [inlined]
[7] download_hf_dataset(name::String)
@ HealthSampleData ~/FOSS/HealthSampleData.jl/src/HuggingFaceDatasets/data.jl:69
[8] top-level scope
@ REPL[5]:1I was looking at your code for this and it is doing a variety of weird things -- if you are using LLM-generated code here, the LLM is leading you wrong. Let me make some comments. |
|
I tried this out and it worked for me. I am able to download the datasets and store it into the right place |
|
Will take a look into this again |
TheCedarPrince
left a comment
There was a problem hiding this comment.
Hey @ParamThakkar123 , I left multiple comments as we are getting close. As I pointed out, if you were using LLMs, it led you to doing a lot of strange things in the code itself. Please review my comments carefully as this corrects a lot of what the LLM was having you do.
| try | ||
| # Prefer official HuggingFaceHub download if dataset info is available | ||
| if dataset !== nothing | ||
| localpath = HF.file_download(dataset, filename; progress = progress_fn) | ||
| else | ||
| # Direct fallback if HF.info failed | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest" | ||
| Downloads.download(url, dest; progress = progress_fn) | ||
| localpath = dest | ||
| end | ||
| @info "Downloaded to $localpath" | ||
| return localpath | ||
|
|
||
| catch e | ||
| msg = string(e) | ||
| if occursin("symlink", msg) || occursin("creating symlinks", msg) || | ||
| occursin("Administrator", msg) || occursin("operation not permitted", msg) | ||
|
|
||
| @warn "Symlink creation failed (likely Windows privilege issue). Falling back to direct HTTP download: $e" | ||
| url = "$repo/resolve/main/$filename" | ||
| tmpdir = mktempdir() | ||
| dest = joinpath(tmpdir, filename) | ||
| @info "Downloading $url -> $dest (no symlink)" | ||
| Downloads.download(url, dest; progress = progress_fn) | ||
| localpath = dest | ||
| @info "Fallback download complete: $localpath" | ||
| return localpath | ||
| else | ||
| rethrow(e) | ||
| end |
There was a problem hiding this comment.
I would instead just turn this into a simple try catch block that does:
try
localpath = HF.file_download(dataset, filename)
catch e
@error "Dataset X had an error; please open an issue at HEALTHSAMPLEDATA-URL"
endWe should be forcing it to use HuggingFaceHub and if it fails, that is an error from HF that we would need to address. The other content I really do not know what it is there for because DataDeps handles all the pathing.
There was a problem hiding this comment.
The other things are for symlink errors that occured frequently with windows
| using HealthSampleData | ||
| using Test | ||
|
|
||
| @testset "HealthSampleData.jl" begin |
There was a problem hiding this comment.
Hey @ParamThakkar123 , you didn't add a test for downloading the test file. Could you do that? Additionally, please note that you can set DataDeps to always be downloaded in CI; read here: https://www.oxinabox.net/DataDeps.jl/stable/z10-for-end-users/#Configuration-1
|
Hello @TheCedarPrince , Here is a complete issue log of the issues I am facing. I have updated the PR, so it will be easy to reproduce : The current issue is : HuggingFaceDatasets - Test dataset download: Error During Test at E:\HealthSampleData.jl\test\runtests.jl:11
Got exception outside of a @test
RequestError: URL rejected: No host part in the URL while requesting /api/resolve-cache/datasets/JuliaHealthOrg/JuliaHealthDatasets/8c44aa0cf9ce035ae820fb9ec2cedfba73101435/penguins.csv?%2Fdatasets%2FJuliaHealthOrg%2FJuliaHealthDatasets%2Fresolve%2F8c44aa0cf9ce035ae820fb9ec2cedfba73101435%2Fpenguins.csv=&etag=%2225b46d384bf81f8399188500ea54917bb49d8890%22
Stacktrace:
[1] (::Downloads.var"#23#24"{IOStream, Base.DevNull, String, Vector{Any}, Float64, Nothing, Bool, Nothing, Bool, Nothing, String, Bool, Bool})(easy::Downloads.Curl.Easy)
@ Downloads E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:452
[2] with_handle(f::Downloads.var"#23#24"{IOStream, Base.DevNull, String, Vector{Any}, Float64, Nothing, Bool, Nothing, Bool, Nothing, String, Bool, Bool}, handle::Downloads.Curl.Easy)
@ Downloads.Curl E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Curl\Curl.jl:105
[3] #21
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:363 [inlined]
[4] arg_write(f::Downloads.var"#21#22"{Base.DevNull, String, Vector{Any}, Float64, Nothing, Bool, Nothing, Bool, Nothing, String, Bool, Bool}, arg::IOStream)
@ ArgTools E:\Julia-1.12.1\share\julia\stdlib\v1.12\ArgTools\src\ArgTools.jl:134
[5] #19
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:362 [inlined]
[6] arg_read
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\ArgTools\src\ArgTools.jl:76 [inlined]
[7] request(url::String; input::Nothing, output::IOStream, method::String, headers::Vector{Any}, timeout::Float64, progress::Nothing, verbose::Bool, debug::Nothing, throw::Bool, downloader::Nothing, interrupt::Nothing)
@ Downloads E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:361
[8] request
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:328 [inlined]
[9] #4
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:259 [inlined]
[10] open(f::Downloads.var"#4#5"{String, Vector{Any}, Float64, Nothing, Bool, Nothing, Nothing, String}, args::String; kwargs::@Kwargs{write::Bool, lock::Bool})
@ Base .\io.jl:410
[11] open_nolock
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\ArgTools\src\ArgTools.jl:35 [inlined]
[12] arg_write(f::Function, arg::String)
@ ArgTools E:\Julia-1.12.1\share\julia\stdlib\v1.12\ArgTools\src\ArgTools.jl:103
[13] #download#2
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:258 [inlined]
[14] download
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Downloads\src\Downloads.jl:247 [inlined]
[15] file_download(repo::HuggingFaceHub.Dataset, path::String; client::HuggingFaceHub.Client, progress::Nothing)
@ HuggingFaceHub C:\Users\Hp\.julia\packages\HuggingFaceHub\dgGqM\src\files.jl:153
[16] file_download
@ C:\Users\Hp\.julia\packages\HuggingFaceHub\dgGqM\src\files.jl:53 [inlined]
[17] _huggingface_dataset_register(name::String, repo::String, filename::String)
@ HealthSampleData E:\HealthSampleData.jl\src\huggingface.jl:22
[18] Test()
@ HealthSampleData E:\HealthSampleData.jl\src\HuggingFaceDatasets\data.jl:21
[19] download_hf_dataset(name::String)
@ HealthSampleData E:\HealthSampleData.jl\src\HuggingFaceDatasets\data.jl:57
[20] top-level scope
@ E:\HealthSampleData.jl\test\runtests.jl:5
[21] macro expansion
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Test\src\Test.jl:1776 [inlined]
[22] macro expansion
@ E:\HealthSampleData.jl\test\runtests.jl:12 [inlined]
[23] macro expansion
@ E:\Julia-1.12.1\share\julia\stdlib\v1.12\Test\src\Test.jl:1776 [inlined]
[24] macro expansion
@ E:\HealthSampleData.jl\test\runtests.jl:12 [inlined]
[25] include(mapexpr::Function, mod::Module, _path::String)
@ Base .\Base.jl:307
[26] top-level scope
@ none:6
[27] eval(m::Module, e::Any)
@ Core .\boot.jl:489
[28] exec_options(opts::Base.JLOptions)
@ Base .\client.jl:283
[29] _start()
@ Base .\client.jl:550
Test Summary: | Pass Error Total Time
HealthSampleData.jl | 4 1 5 6.1s
HuggingFaceDatasets - Test dataset download | 1 1 4.4s
RNG of the outermost testset: Random.Xoshiro(0xee5a5de7590e9b2d, 0x744e9f964adc9fa0, 0xe20203f4ea99eee6, 0x17fdee9386999f4b, 0x5d41d67b7cd97f98)
ERROR: LoadError: Some tests did not pass: 4 passed, 0 failed, 1 errored, 0 broken.
in expression starting at E:\HealthSampleData.jl\test\runtests.jl:4
ERROR: Package HealthSampleData errored during testing |
@TheCedarPrince
This PR Fixes #12
Added a
_huggingface_dataset_registerto register datasets from huggingface hub using DataDeps.jl and implemented aloadfunction to load the dataset usingHealthSampleData.jlmodule for the users