Replies: 1 comment 1 reply
-
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm a regulatory policy researcher working on an academic project linking FJC IDB appellate cases to CourtListener opinions and Federal Register rules. I've been working through the bulk data approach following #3173: installed Postgres on an EC2 instance, loaded the schema, and began loading the dockets CSV using
\COPY. However, before completing that process I ran into the opinions file size problem.I have ~35,000–40,000 target cluster_ids and only need plain_text for those rows. The file downloaded successfully from S3, but decompressing it with
bzip2 -don my EC2 instance (t3.large, ~200GB EBS) caused the output file to grow to ~78GB before running out of disk space and killing the process (larger than I expected).Two questions:
Streaming filter: Is it feasible to filter without fully decompressing (e.g., bzcat opinions.csv.bz2 | python filter_by_cluster_id.py > filtered_opinions.csv)? If so, has anyone done this successfully with the CourtListener opinions file?
Expected size: What is the current expected decompressed size of the opinions CSV? I was working with
opinions-2025-12-02.csv.bz2. Is ~78GB accurate for that snapshot, or does that suggest something went wrong during decompression?Thanks for maintaining this resource! It's been invaluable for my research.
Beta Was this translation helpful? Give feedback.
All reactions