A Google Cloud Function that processes podcast RSS feeds, downloads audio files to Google Cloud Storage, and manages podcast metadata in BigQuery.
- Cloud Functions (2nd Gen): HTTP-triggered function for podcast processing
- Google Cloud Storage: Audio file storage
- BigQuery: Podcast metadata and episode data storage
- Firebase Authentication: Request authentication and authorization
-
Google Cloud Project with the following APIs enabled:
- Cloud Functions API
- Cloud Build API
- Cloud Run API
- Cloud Storage API
- BigQuery API
- Firebase Authentication API
-
Required IAM Roles for deployment:
- Cloud Functions Admin
- Cloud Run Admin
- Cloud Build Service Account
- Storage Admin
- BigQuery Admin
-
Local Development Tools:
- Google Cloud SDK (
gcloudCLI) - Python 3.11+
- Git
- Google Cloud SDK (
git clone <repository-url>
cd seekerCreate or update the following configuration files:
PROJECT_ID=your-project-id
SERVICE_NAME=seeker-podcast-processor
REGION=us-central1
MEMORY=2Gi
CPU=2
TIMEOUT=3600
CONCURRENCY=1000
MAX_INSTANCES=5
PLATFORM=linux/amd64GCP_PROJECT_ID="your-project-id"
GCS_BUCKET_NAME="your-bucket-name"
BIGQUERY_DATASET_ID="your-dataset-id"{
"All In": {
"rss": "https://allinchamathjason.libsyn.com/rss"
},
"Lex Fridman": {
"rss": "https://lexfridman.com/feed/podcast/"
}
}- Create a service account for the Cloud Function:
gcloud iam service-accounts create seeker-podcast-processor \
--display-name="Seeker Podcast Processor"- Grant necessary permissions:
# BigQuery permissions
gcloud projects add-iam-policy-binding your-project-id \
--member="serviceAccount:seeker-podcast-processor@your-project-id.iam.gserviceaccount.com" \
--role="roles/bigquery.admin"
# Cloud Storage permissions
gcloud projects add-iam-policy-binding your-project-id \
--member="serviceAccount:seeker-podcast-processor@your-project-id.iam.gserviceaccount.com" \
--role="roles/storage.admin"
# Firebase Admin permissions
gcloud projects add-iam-policy-binding your-project-id \
--member="serviceAccount:seeker-podcast-processor@your-project-id.iam.gserviceaccount.com" \
--role="roles/firebase.sdkAdminServiceAgent"Grant Cloud Build service account the necessary permissions to deploy Cloud Functions:
# Get your project number
PROJECT_NUMBER=$(gcloud projects describe your-project-id --format="value(projectNumber)")
# Grant Cloud Run Admin role to Cloud Build service account
gcloud projects add-iam-policy-binding your-project-id \
--member="serviceAccount:${PROJECT_NUMBER}-compute@developer.gserviceaccount.com" \
--role="roles/run.admin"Create the required BigQuery dataset and tables:
# Create dataset
bq mk --dataset your-project-id:your-dataset-id
# Create tables (schema definitions needed - see BigQuery Schema section)- Authenticate with Google Cloud:
gcloud auth login
gcloud config set project your-project-id- Deploy using Cloud Build:
gcloud builds submit --config=cloudbuild.yamlgcloud functions deploy seeker-podcast-processor \
--gen2 \
--runtime=python311 \
--region=us-central1 \
--source=. \
--entry-point=cloud_function_entrypoint \
--trigger-http \
--allow-unauthenticated \
--memory=2Gi \
--timeout=3600s \
--max-instances=5 \
--concurrency=1000gcloud functions describe seeker-podcast-processor \
--region=us-central1 \
--format="value(state,url)"# Test authentication (should return 401)
curl -X POST https://us-central1-your-project-id.cloudfunctions.net/seeker-podcast-processor \
-H "Content-Type: application/json" \
-d '{"podcast_name": "All In", "num_episodes": 1}'
# Expected response: {"message":"Authorization header missing."}gcloud functions logs read seeker-podcast-processor \
--region=us-central1 \
--limit=10Endpoint: https://us-central1-your-project-id.cloudfunctions.net/seeker-podcast-processor
Method: POST
Headers:
Content-Type: application/json
Authorization: Bearer <firebase-id-token>
Request Body:
{
"podcast_name": "All In",
"num_episodes": 2
}Response (Success):
{
"message": "Successfully processed 2 episodes for 'All In'."
}The function requires Firebase Authentication in production. To bypass authentication for local testing:
- The function automatically detects local vs. deployed environment
- Authentication is skipped when
K_SERVICEenvironment variable is not present
pip install -r requirements.txt# Set up Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/credentials.json"
# Set environment variables
export GCP_PROJECT_ID="your-project-id"
export GCS_BUCKET_NAME="your-bucket-name"
export BIGQUERY_DATASET_ID="your-dataset-id"cd src
python main.py-
Import Errors:
- Ensure all imports use absolute paths (
from src.module import ...) - Check that all required dependencies are in
requirements.txt
- Ensure all imports use absolute paths (
-
Permission Errors:
- Verify Cloud Build service account has
roles/run.admin - Check that function service account has BigQuery and Storage permissions
- Verify Cloud Build service account has
-
Container Startup Failures:
- Check that
--entry-point=cloud_function_entrypointis specified - Verify the function listens on
PORT=8080
- Check that
-
Authentication Issues:
- Ensure Firebase project is properly configured
- Verify service account has Firebase admin permissions
View Function Logs:
gcloud functions logs read seeker-podcast-processor --region=us-central1View Build Logs:
gcloud builds list --limit=5
gcloud builds log <build-id>Monitor Cloud Run Service:
gcloud run services describe seeker-podcast-processor --region=us-central1The function expects the following BigQuery tables in your dataset:
id(STRING)title(STRING)sanitizedTitle(STRING)description(STRING)imageUrl(STRING)rssUrl(STRING)websiteUrl(STRING)language(STRING)tags(STRING, REPEATED)lastUpdated(TIMESTAMP)
id(STRING)showId(STRING)title(STRING)sanitizedTitle(STRING)description(STRING)publishedDate(TIMESTAMP)durationSeconds(INTEGER)originalAudioUrl(STRING)audioId(STRING)
id(STRING)gcsBucket(STRING)gcsObjectPath(STRING)fileSize(INTEGER)
id(STRING)full_name(STRING)aliases(STRING)audioId(STRING)
seeker/
├── README.md # This file
├── requirements.txt # Python dependencies
├── cloudbuild.yaml # Cloud Build configuration
├── main.py # Root entry point
├── .env # Runtime environment variables
├── .env.deploy # Deployment configuration
├── config/
│ └── podcasts.json # Podcast RSS configurations
└── src/
├── main.py # Main function implementation
├── auth_handler.py # Firebase authentication
├── rss_parser.py # RSS feed parsing
├── gcs_handler.py # Google Cloud Storage operations
├── bq_handler.py # BigQuery operations
├── logger.py # Logging configuration
└── utils.py # Utility functions
- Follow the existing code structure and import patterns
- Update
requirements.txtfor new dependencies - Test locally before deploying
- Update this README for any configuration changes
[Your License Here]