QA Portal Scripts Documentation

⚠️ PRODUCTION ONLY - DO NOT RUN LOCALLY

These scripts run ONLY on the production server. They perform database writes that must be executed by a single instance to prevent:

Race conditions
Duplicate data entries
Database constraint violations

For local development:

❌ Do NOT run any scraper scripts
❌ Do NOT set up cron jobs for these scripts
✅ Use the web application only (read-only operations)

Overview

This directory contains scripts for managing the CloudStack QA Portal's GitHub PR scraper and monitoring systems.

Core Scraper Scripts

scrape-github-prs.js

Purpose: Main GitHub PR scraper - collects PR data including approvals, smoke tests, code coverage, and labels.

Usage:

# Scrape all open PRs (default)
node scripts/scrape-github-prs.js

# Scrape specific PR
node scripts/scrape-github-prs.js --pr-number=12345

# Scrape all PRs (including those without health checks)
node scripts/scrape-github-prs.js --all

Features:

Scrapes LGTM approvals/reviews/rejections
Collects codecov coverage data
Extracts Trillian smoketest results per hypervisor
Updates PR states (open/closed/merged)
Rate limit aware with delays

Requirements:

GitHub token in server/.env
Database connection
Runs automatically via cron every 30 minutes

update-pr-states.js

Purpose: Batch update PR states (open/closed/merged) for all PRs in database.

Usage:

# Update oldest 50 PRs (default)
node scripts/update-pr-states.js

# Update specific number of PRs
node scripts/update-pr-states.js --batch=100

# Update all PRs (slow!)
node scripts/update-pr-states.js --all

Features:

Rate limit aware (200ms delay between requests)
Shows remaining API quota
Updates state and last_checked timestamp
Processes oldest PRs first

Management Scripts

manage-scraper.sh

Purpose: Manage the GitHub scraper lifecycle.

Usage:

# Show status
./scripts/manage-scraper.sh status

# Start scraper cron job
./scripts/manage-scraper.sh start

# Stop scraper cron job
./scripts/manage-scraper.sh stop

# View logs
./scripts/manage-scraper.sh logs

# Run once manually
./scripts/manage-scraper.sh run-once

monitor-scraper.sh

Purpose: Monitor scraper health and restart if crashed. Runs via cron every 30 minutes.

Features:

Checks if scraper is running
Verifies it's not hung/frozen
Restarts if issues detected
Logs to /tmp/scraper-monitor.log

Manual run:

./scripts/monitor-scraper.sh

scraper-cron.sh

Purpose: Cron wrapper for the main scraper. Handles environment setup and logging.

Features:

Sets up PATH and environment
Logs to /var/log/cloudstack-pr-scraper.log
Keeps log file manageable (last 1000 lines)
Called by cron every 30 minutes

Setup Scripts

setup-cron.sh

Purpose: Install cron jobs for scraper and monitoring.

Usage:

./scripts/setup-cron.sh

Installs:

Scraper: Every 30 minutes
Monitor: Every 30 minutes
Portal health check: Every 5 minutes

setup-monitoring.sh

Purpose: Setup monitoring and health check systems.

Usage:

./scripts/setup-monitoring.sh

Utility Scripts

scraper_health_check.sh (in /tmp)

Purpose: Comprehensive health check for scraper system.

Usage:

/tmp/scraper_health_check.sh

Checks:

Environment files (.env)
Cron job configuration
Last scraper run status
GitHub API rate limits
Authentication status

Directory Structure

/root/QA-Portal/scripts/
├── README.md                    (this file)
│
├── scrape-github-prs.js        (main scraper)
├── update-pr-states.js         (state updater)
│
├── manage-scraper.sh           (lifecycle management)
├── monitor-scraper.sh          (health monitor)
├── scraper-cron.sh            (cron wrapper)
│
├── setup-cron.sh              (cron installer)
└── setup-monitoring.sh        (monitoring setup)

Environment Variables

All scripts require environment variables in server/.env:

# GitHub API
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxx

# Database
DB_HOST=10.0.113.145
DB_PORT=3306
DB_NAME=cloudstack_tests
DB_USER=results
DB_PASSWORD=P@ssword123

Cron Schedule

Current cron jobs (via crontab -l):

# Scraper - every 30 minutes
*/30 * * * * /root/QA-Portal/scripts/scraper-cron.sh

# Scraper Monitor - every 30 minutes
*/30 * * * * /root/QA-Portal/scripts/monitor-scraper.sh

# Portal Health Check - every 5 minutes
*/5 * * * * /usr/local/bin/qa-portal-monitor.sh

Logs

Scraper Logs

# Main scraper log
tail -f /var/log/cloudstack-pr-scraper.log

# Monitor log
tail -f /tmp/scraper-monitor.log

# Portal health check
tail -f /var/log/qa-portal-monitor.log

Troubleshooting

Rate Limit Errors (403)

Symptoms: "API rate limit exceeded for [IP]"

Solutions:

Check token is set: grep GITHUB_TOKEN /root/QA-Portal/server/.env

Verify authentication:

TOKEN=$(grep GITHUB_TOKEN /root/QA-Portal/server/.env | cut -d'=' -f2)
curl -H "Authorization: token $TOKEN" https://api.github.com/rate_limit

Should show 5000 limit (authenticated), not 60 (unauthenticated)

Missing PRs in Portal

Solution: Scraper runs every 30 minutes and will pick up new PRs automatically.

Manual sync of specific PR:

cd /root/QA-Portal
node scripts/scrape-github-prs.js --pr-number=XXXXX

Scraper Not Running

Check:

./scripts/manage-scraper.sh status

Restart:

./scripts/manage-scraper.sh start

Database Connection Issues

Check:

mysql -h 10.0.113.145 -u results -p'P@ssword123' cloudstack_tests -e "SHOW TABLES;"

Testing

Run comprehensive test:

cd /root/QA-Portal

# Test individual PR
node scripts/scrape-github-prs.js --pr-number=12300

# Test state update
node scripts/update-pr-states.js --batch=10

# Test management
./scripts/manage-scraper.sh status

# Health check
/tmp/scraper_health_check.sh

Maintenance Tasks

Daily

Check logs for errors
Verify scraper is running

Weekly

Check rate limit usage
Review logs for any issues

Monthly

Rotate logs
Review cron job performance
Update documentation

Development

Adding New Features

Edit scrape-github-prs.js
Test with single PR: node scripts/scrape-github-prs.js --pr-number=XXXX
Check database for changes
Test with cron wrapper: ./scripts/scraper-cron.sh
Monitor logs: tail -f /var/log/cloudstack-pr-scraper.log

Database Schema

See database tables:

-- PR states
SELECT * FROM pr_states LIMIT 5;

-- PR approvals
SELECT * FROM pr_approvals LIMIT 5;

-- Smoke tests
SELECT * FROM pr_smoketests LIMIT 5;

-- Health labels
SELECT * FROM pr_health_labels LIMIT 5;

Quick Reference

Task	Command
Manual scrape	`node scripts/scrape-github-prs.js`
Scrape specific PR	`node scripts/scrape-github-prs.js --pr-number=XXXXX`
Update states	`node scripts/update-pr-states.js`
Check status	`./scripts/manage-scraper.sh status`
View logs	`tail -f /var/log/cloudstack-pr-scraper.log`
Health check	`/tmp/scraper_health_check.sh`
Restart scraper	`./scripts/manage-scraper.sh start`

Support

For issues or questions:

Check logs first
Run health check: /tmp/scraper_health_check.sh
Review this documentation
Check TROUBLESHOOTING.md (if exists)

Last Updated: 2025-12-19 Version: 1.0 Maintainer: CloudStack QA Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Portal Scripts Documentation

⚠️ PRODUCTION ONLY - DO NOT RUN LOCALLY

Overview

Core Scraper Scripts

scrape-github-prs.js

update-pr-states.js

Management Scripts

manage-scraper.sh

monitor-scraper.sh

scraper-cron.sh

Setup Scripts

setup-cron.sh

setup-monitoring.sh

Utility Scripts

scraper_health_check.sh (in /tmp)

Directory Structure

Environment Variables

Cron Schedule

Logs

Scraper Logs

Troubleshooting

Rate Limit Errors (403)

Missing PRs in Portal

Scraper Not Running

Database Connection Issues

Testing

Maintenance Tasks

Daily

Weekly

Monthly

Development

Adding New Features

Database Schema

Quick Reference

Support

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

QA Portal Scripts Documentation

⚠️ PRODUCTION ONLY - DO NOT RUN LOCALLY

Overview

Core Scraper Scripts

scrape-github-prs.js

update-pr-states.js

Management Scripts

manage-scraper.sh

monitor-scraper.sh

scraper-cron.sh

Setup Scripts

setup-cron.sh

setup-monitoring.sh

Utility Scripts

scraper_health_check.sh (in /tmp)

Directory Structure

Environment Variables

Cron Schedule

Logs

Scraper Logs

Troubleshooting

Rate Limit Errors (403)

Missing PRs in Portal

Scraper Not Running

Database Connection Issues

Testing

Maintenance Tasks

Daily

Weekly

Monthly

Development

Adding New Features

Database Schema

Quick Reference

Support