Memory Leak in Large Dataset Processing #1306#1318
Conversation
|
Sorry it's taken me a while to reply. After reviewing the code here's what I see:
Taking those points into account, if you could just send the PR with the cursor iteration style change, we can quickly integrate that, and then we can discuss the other ideas, that will probably need some more discussion and/or work. |
|
Ok @bconstanzo I will update once its done. |
|
@bconstanzo I did the changes you can check now. |
|
@bconstanzo are you able to review the changes? |
|
Thanks for looking at memory usage on large datasets. Unfortunately this can't be merged as-is: it conflicts with |
#1306
Memory Leak in Large Dataset Processing
Problem Statement
Original Issue
cursor.fetchall()loaded entire SQLite result sets into memory at onceAffected Artifacts
Technical Solution
1. Replaced
fetchall()with Cursor IterationBefore (Memory-Intensive):
Problem: If you have 100,000 photos, this loads all 100,000 rows into memory simultaneously.
After (Memory-Efficient):
Benefit: Only one row is in memory at a time, drastically reducing peak memory usage.
2. Implementation Details
photosMetadata.py Changes
File:
scripts/artifacts/photosMetadata.pyChanges Made:
iOS 12-13 Branch (lines ~277-364):
all_rows = cursor.fetchall()withfor row in cursor:usageentries = len(all_rows)calculationrow_counttracking for reporting.bplistfiles after processingiOS 13-14 Branch (lines ~619-708):
iOS 14+ Branch (lines ~976-1042):
Memory Impact:
BeReal.py Changes
File:
scripts/artifacts/BeReal.pyFunctions Modified:
bereal_messages() (lines ~1910-1966):
get_sqlite_db_records()with direct cursor iterationbereal_chat_list() (lines ~1989-2049):
Implementation Pattern:
3. New Helper Functions
File:
scripts/ilapfuncs.pyAdded two new helper functions for memory-efficient SQLite querying:
get_sqlite_db_records_iter()Use Case: When you need batches of records but don't want all at once.
get_sqlite_db_cursor_iter()Use Case: Most memory-efficient option when processing records sequentially.
4. Resource Management Improvements
Database Connection Cleanup
finallyblocks to ensure connections are closed even if errors occurTemporary File Cleanup
Benefit: Files are deleted immediately after use, preventing disk and memory buildup.
Memory Usage Comparison
photosMetadata.py
BeReal.py
Code Changes Summary
Files Modified
scripts/ilapfuncs.py
get_sqlite_db_records_iter()functionget_sqlite_db_cursor_iter()functionget_sqlite_db_records()to properly close connectionsscripts/artifacts/photosMetadata.py
fetchall()with cursor iterationscripts/artifacts/BeReal.py
get_sqlite_db_records()with direct cursor iterationbereal_messages()bereal_chat_list()Lines Changed
Testing & Verification
Automated Tests
Run the verification script:
Expected Output:
Manual Testing
Monitor Memory Usage:
Test with Large Dataset:
Compare Before/After:
Performance Metrics
Processing Speed
Memory Stability
Scalability
Best Practices Implemented
fetchall()try/finallyblocks for guaranteed cleanupFuture Recommendations
Additional Optimizations (Not Yet Implemented)
Other Artifacts to Consider
These artifacts also use
fetchall()and may benefit from similar optimization:mediaLibrary.pygeodMapTiles.pychrome.py(multiple instances)slack.py(multiple instances)Troubleshooting
If Memory is Still High
Check for other fetchall() calls:
Monitor specific functions:
Check data_list accumulation:
data_listgrows very large, consider streaming output insteadCommon Issues
"Database is locked" errors:
Temporary files not cleaned up:
Processing still slow:
Technical Deep Dive
Why
fetchall()is Memory-IntensiveWhy Cursor Iteration is Efficient
Memory Footprint Breakdown
Before (10,000 photos):
After (10,000 photos):
References
Changelog
2024 - Memory Optimization Implementation