Problem Description:
When building datasets in blocks, lazy tokenization of "meta" type features leads to inconsistent global mapping across different blocks, causing data mismatches in downstream tasks.
Context:
In fuxictr/preprocess/build_dataset.py, the build_dataset function calls transform, which processes data in blocks when block_size > 0. Each block is processed in parallel using transform_block.
Issue Details:
The problem occurs in fuxictr/preprocess/feature_processor.py during the transformation of "meta" type features. For each block:
- The tokenizer's
encode_meta method is called
- A new vocabulary is updated per block
- The block data is encoded using this block-specific vocabulary
- The encoded block is saved to disk
This results in:
- Different blocks having different token-to-ID mappings for the same meta feature
- No consistent global vocabulary across the entire dataset
- Potential mismatches when aggregating or analyzing data across blocks
Code References:
- Block Processing (
build_dataset.py):
def transform_block(feature_encoder, df_block, filename):
df_block = feature_encoder.transform(df_block) # Problem occurs here
# ... save to parquet
def transform(feature_encoder, ddf, filename, block_size=0):
if block_size > 0:
# Process each block independently
pool.apply_async(transform_block, ...) # Each block gets its own vocab
- Meta Feature Encoding (
feature_processor.py):
# feature encoder's transform, here the ddf is block ddf
def transform(self, ddf):
for feature, feature_spec in self.feature_map.features.items():
if feature_type == "meta":
tokenizer = self.processor_dict[feature + "::tokenizer"]
ddf[feature] = tokenizer.encode_meta(col_series) # Updates vocab per block
- Tokenizer Implementation (
tokenizer.py):
def encode_meta(self, series):
word_counts = dict(series.value_counts())
if len(self.vocab) == 0:
self.build_vocab(word_counts)
else: # update vocabs with block data
self.update_vocab(word_counts.keys())
series = series.map(lambda x: self.vocab.get(x, self.vocab["__OOV__"]))
return series.values
Suggested Solutions:
Pre-build global vocabulary before block processing
Problem Description:
When building datasets in blocks, lazy tokenization of "meta" type features leads to inconsistent global mapping across different blocks, causing data mismatches in downstream tasks.
Context:
In
fuxictr/preprocess/build_dataset.py, thebuild_datasetfunction callstransform, which processes data in blocks whenblock_size > 0. Each block is processed in parallel usingtransform_block.Issue Details:
The problem occurs in
fuxictr/preprocess/feature_processor.pyduring the transformation of "meta" type features. For each block:encode_metamethod is calledThis results in:
Code References:
build_dataset.py):feature_processor.py):tokenizer.py):Suggested Solutions:
Pre-build global vocabulary before block processing