Skip to content

Obtaining very different results when guessing or passing object directly to $create_dataset() #238

@luigidolcetti

Description

@luigidolcetti

Hi,

as mentioned in the title I am getting very different results when guessing data type, space, dims, and chunks with respect to passing directly an object to the create_dataset method. I attempted to identify my mistakes going through the method code but with poor results.
For example, if I do the following:

x <- data.frame(a=factor(letters[runif(n = 1E5,1,10)]),b=factor(letters[runif(n = 1E5,1,10)]))
dims <- hdf5r::guess_dim(x)
dtype <- hdf5r::guess_dtype(x,scalar=FALSE,string_len = Inf,ds_dim = dims)
nelem <- hdf5r::guess_nelem(x,dtype = dtype)
chunk_dims <- hdf5r::guess_chunks(space_maxdims = dims, dtype_size = nelem)
space <- hdf5r::guess_space(x,dtype = dtype,chunked = TRUE)

I get

> dims
[1] 100000
> dtype
Class: H5T_COMPOUND
Datatype: H5T_COMPOUND {
      H5T_ENUM {
         undefined integer;
         "a"                1;
         "b"                2;
         "c"                3;
         "d"                4;
         "e"                5;
         "f"                6;
         "g"                7;
         "h"                8;
         "i"                9;
      } "a" : 0;
      H5T_ENUM {
         undefined integer;
         "a"                1;
         "b"                2;
         "c"                3;
         "d"                4;
         "e"                5;
         "f"                6;
         "g"                7;
         "h"                8;
         "i"                9;
      } "b" : 1;
   }
> nelem
[1] 100000
> chunk_dims
[1] 0
> space
Class: H5S
Type: Simple
Dims: 100000
Maxdims: Inf

Clearly these results are going to cause troubles if I pass them to the method.... in particular, shouldn't be nelem give 2 instead of the number of rows of the data.frame?
On the contrary, if I do:

test <- hdf5r.Extra::h5TryOpen(tempfile(),'w')
test$create_dataset('auto',x)
test[['auto']]

more correctly I get:

Datatype: H5T_COMPOUND {
      H5T_ENUM {
         H5T_STD_U8LE;
         "a"                1;
         "b"                2;
         "c"                3;
         "d"                4;
         "e"                5;
         "f"                6;
         "g"                7;
         "h"                8;
         "i"                9;
      } "a" : 0;
      H5T_ENUM {
         H5T_STD_U8LE;
         "a"                1;
         "b"                2;
         "c"                3;
         "d"                4;
         "e"                5;
         "f"                6;
         "g"                7;
         "h"                8;
         "i"                9;
      } "b" : 1;
   }
Space: Type=Simple     Dims=100000     Maxdims=Inf
Chunk: 4096

I would appreaciate if you have any suggestion on how to reproduce the direct call results using instead the guessing path.

thank you in advance,
Luigi

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions