These scripts run ONLY on the production server. They perform database writes that must be executed by a single instance to prevent:
- Race conditions
- Duplicate data entries
- Database constraint violations
For local development:
- ❌ Do NOT run any scraper scripts
- ❌ Do NOT set up cron jobs for these scripts
- ✅ Use the web application only (read-only operations)
This directory contains scripts for managing the CloudStack QA Portal's GitHub PR scraper and monitoring systems.
Purpose: Main GitHub PR scraper - collects PR data including approvals, smoke tests, code coverage, and labels.
Usage:
# Scrape all open PRs (default)
node scripts/scrape-github-prs.js
# Scrape specific PR
node scripts/scrape-github-prs.js --pr-number=12345
# Scrape all PRs (including those without health checks)
node scripts/scrape-github-prs.js --allFeatures:
- Scrapes LGTM approvals/reviews/rejections
- Collects codecov coverage data
- Extracts Trillian smoketest results per hypervisor
- Updates PR states (open/closed/merged)
- Rate limit aware with delays
Requirements:
- GitHub token in
server/.env - Database connection
- Runs automatically via cron every 30 minutes
Purpose: Batch update PR states (open/closed/merged) for all PRs in database.
Usage:
# Update oldest 50 PRs (default)
node scripts/update-pr-states.js
# Update specific number of PRs
node scripts/update-pr-states.js --batch=100
# Update all PRs (slow!)
node scripts/update-pr-states.js --allFeatures:
- Rate limit aware (200ms delay between requests)
- Shows remaining API quota
- Updates state and last_checked timestamp
- Processes oldest PRs first
Purpose: Manage the GitHub scraper lifecycle.
Usage:
# Show status
./scripts/manage-scraper.sh status
# Start scraper cron job
./scripts/manage-scraper.sh start
# Stop scraper cron job
./scripts/manage-scraper.sh stop
# View logs
./scripts/manage-scraper.sh logs
# Run once manually
./scripts/manage-scraper.sh run-oncePurpose: Monitor scraper health and restart if crashed. Runs via cron every 30 minutes.
Features:
- Checks if scraper is running
- Verifies it's not hung/frozen
- Restarts if issues detected
- Logs to
/tmp/scraper-monitor.log
Manual run:
./scripts/monitor-scraper.shPurpose: Cron wrapper for the main scraper. Handles environment setup and logging.
Features:
- Sets up PATH and environment
- Logs to
/var/log/cloudstack-pr-scraper.log - Keeps log file manageable (last 1000 lines)
- Called by cron every 30 minutes
Purpose: Install cron jobs for scraper and monitoring.
Usage:
./scripts/setup-cron.shInstalls:
- Scraper: Every 30 minutes
- Monitor: Every 30 minutes
- Portal health check: Every 5 minutes
Purpose: Setup monitoring and health check systems.
Usage:
./scripts/setup-monitoring.shPurpose: Comprehensive health check for scraper system.
Usage:
/tmp/scraper_health_check.shChecks:
- Environment files (.env)
- Cron job configuration
- Last scraper run status
- GitHub API rate limits
- Authentication status
/root/QA-Portal/scripts/
├── README.md (this file)
│
├── scrape-github-prs.js (main scraper)
├── update-pr-states.js (state updater)
│
├── manage-scraper.sh (lifecycle management)
├── monitor-scraper.sh (health monitor)
├── scraper-cron.sh (cron wrapper)
│
├── setup-cron.sh (cron installer)
└── setup-monitoring.sh (monitoring setup)
All scripts require environment variables in server/.env:
# GitHub API
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxx
# Database
DB_HOST=10.0.113.145
DB_PORT=3306
DB_NAME=cloudstack_tests
DB_USER=results
DB_PASSWORD=P@ssword123Current cron jobs (via crontab -l):
# Scraper - every 30 minutes
*/30 * * * * /root/QA-Portal/scripts/scraper-cron.sh
# Scraper Monitor - every 30 minutes
*/30 * * * * /root/QA-Portal/scripts/monitor-scraper.sh
# Portal Health Check - every 5 minutes
*/5 * * * * /usr/local/bin/qa-portal-monitor.sh# Main scraper log
tail -f /var/log/cloudstack-pr-scraper.log
# Monitor log
tail -f /tmp/scraper-monitor.log
# Portal health check
tail -f /var/log/qa-portal-monitor.logSymptoms: "API rate limit exceeded for [IP]"
Solutions:
- Check token is set:
grep GITHUB_TOKEN /root/QA-Portal/server/.env - Verify authentication:
TOKEN=$(grep GITHUB_TOKEN /root/QA-Portal/server/.env | cut -d'=' -f2) curl -H "Authorization: token $TOKEN" https://api.github.com/rate_limit
- Should show 5000 limit (authenticated), not 60 (unauthenticated)
Solution: Scraper runs every 30 minutes and will pick up new PRs automatically.
Manual sync of specific PR:
cd /root/QA-Portal
node scripts/scrape-github-prs.js --pr-number=XXXXXCheck:
./scripts/manage-scraper.sh statusRestart:
./scripts/manage-scraper.sh startCheck:
mysql -h 10.0.113.145 -u results -p'P@ssword123' cloudstack_tests -e "SHOW TABLES;"Run comprehensive test:
cd /root/QA-Portal
# Test individual PR
node scripts/scrape-github-prs.js --pr-number=12300
# Test state update
node scripts/update-pr-states.js --batch=10
# Test management
./scripts/manage-scraper.sh status
# Health check
/tmp/scraper_health_check.sh- Check logs for errors
- Verify scraper is running
- Check rate limit usage
- Review logs for any issues
- Rotate logs
- Review cron job performance
- Update documentation
- Edit
scrape-github-prs.js - Test with single PR:
node scripts/scrape-github-prs.js --pr-number=XXXX - Check database for changes
- Test with cron wrapper:
./scripts/scraper-cron.sh - Monitor logs:
tail -f /var/log/cloudstack-pr-scraper.log
See database tables:
-- PR states
SELECT * FROM pr_states LIMIT 5;
-- PR approvals
SELECT * FROM pr_approvals LIMIT 5;
-- Smoke tests
SELECT * FROM pr_smoketests LIMIT 5;
-- Health labels
SELECT * FROM pr_health_labels LIMIT 5;| Task | Command |
|---|---|
| Manual scrape | node scripts/scrape-github-prs.js |
| Scrape specific PR | node scripts/scrape-github-prs.js --pr-number=XXXXX |
| Update states | node scripts/update-pr-states.js |
| Check status | ./scripts/manage-scraper.sh status |
| View logs | tail -f /var/log/cloudstack-pr-scraper.log |
| Health check | /tmp/scraper_health_check.sh |
| Restart scraper | ./scripts/manage-scraper.sh start |
For issues or questions:
- Check logs first
- Run health check:
/tmp/scraper_health_check.sh - Review this documentation
- Check TROUBLESHOOTING.md (if exists)
Last Updated: 2025-12-19 Version: 1.0 Maintainer: CloudStack QA Team