Filtering bulk opinions to a subset of cases: streaming approach? #7168

H-Hirsch · 2026-03-31T03:12:13Z

H-Hirsch
Mar 31, 2026

Hi,

I'm a regulatory policy researcher working on an academic project linking FJC IDB appellate cases to CourtListener opinions and Federal Register rules. I've been working through the bulk data approach following #3173: installed Postgres on an EC2 instance, loaded the schema, and began loading the dockets CSV using \COPY. However, before completing that process I ran into the opinions file size problem.

I have ~35,000–40,000 target cluster_ids and only need plain_text for those rows. The file downloaded successfully from S3, but decompressing it with bzip2 -d on my EC2 instance (t3.large, ~200GB EBS) caused the output file to grow to ~78GB before running out of disk space and killing the process (larger than I expected).

Two questions:

Streaming filter: Is it feasible to filter without fully decompressing (e.g., bzcat opinions.csv.bz2 | python filter_by_cluster_id.py > filtered_opinions.csv)? If so, has anyone done this successfully with the CourtListener opinions file?
Expected size: What is the current expected decompressed size of the opinions CSV? I was working with opinions-2025-12-02.csv.bz2. Is ~78GB accurate for that snapshot, or does that suggest something went wrong during decompression?

Thanks for maintaining this resource! It's been invaluable for my research.

quevon24 · 2026-04-01T22:30:59Z

quevon24
Apr 1, 2026
Collaborator

I haven't tried that. I usually unzip the file.
For opinions the decompressed file is about ~350GB. The COPY approach takes a few hours to load all the opinions data. Clusters and dockets load faster.

1 reply

H-Hirsch Apr 2, 2026
Author

Thanks, that's very helpful. It seems likely that my EBS volume wasn't large enough to hold the ~350GB decompressed file. I'll get more storage and try again. Appreciate the quick response!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filtering bulk opinions to a subset of cases: streaming approach? #7168

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Filtering bulk opinions to a subset of cases: streaming approach? #7168

Uh oh!

H-Hirsch Mar 31, 2026

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

quevon24 Apr 1, 2026 Collaborator

Uh oh!

H-Hirsch Apr 2, 2026 Author

H-Hirsch
Mar 31, 2026

Replies: 1 comment 1 reply

quevon24
Apr 1, 2026
Collaborator

H-Hirsch Apr 2, 2026
Author