Skip to content

Lazy Tokenization of Meta Type Features Causes Inconsistent Encoding in Block Dataset Building #164

@koriyyy

Description

@koriyyy

Problem Description:
When building datasets in blocks, lazy tokenization of "meta" type features leads to inconsistent global mapping across different blocks, causing data mismatches in downstream tasks.

Context:
In fuxictr/preprocess/build_dataset.py, the build_dataset function calls transform, which processes data in blocks when block_size > 0. Each block is processed in parallel using transform_block.

Issue Details:
The problem occurs in fuxictr/preprocess/feature_processor.py during the transformation of "meta" type features. For each block:

  1. The tokenizer's encode_meta method is called
  2. A new vocabulary is updated per block
  3. The block data is encoded using this block-specific vocabulary
  4. The encoded block is saved to disk

This results in:

  • Different blocks having different token-to-ID mappings for the same meta feature
  • No consistent global vocabulary across the entire dataset
  • Potential mismatches when aggregating or analyzing data across blocks

Code References:

  1. Block Processing (build_dataset.py):
def transform_block(feature_encoder, df_block, filename):
    df_block = feature_encoder.transform(df_block)  # Problem occurs here
    # ... save to parquet

def transform(feature_encoder, ddf, filename, block_size=0):
    if block_size > 0:
        # Process each block independently
        pool.apply_async(transform_block, ...)  # Each block gets its own vocab
  1. Meta Feature Encoding (feature_processor.py):
# feature encoder's transform, here the ddf is block ddf
def transform(self, ddf):
    for feature, feature_spec in self.feature_map.features.items():
        if feature_type == "meta":
            tokenizer = self.processor_dict[feature + "::tokenizer"]
            ddf[feature] = tokenizer.encode_meta(col_series)  # Updates vocab per block
  1. Tokenizer Implementation (tokenizer.py):
def encode_meta(self, series):
        word_counts = dict(series.value_counts())
        if len(self.vocab) == 0:
            self.build_vocab(word_counts)
        else: # update vocabs with block data
            self.update_vocab(word_counts.keys())
        series = series.map(lambda x: self.vocab.get(x, self.vocab["__OOV__"]))
        return series.values

Suggested Solutions:
Pre-build global vocabulary before block processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions