GoogleCloudPlatform · vipnydav · Jun 18, 2025 · Jun 10, 2025 · Jun 17, 2025
diff --git a/read_stall_retry/analysis/README.md b/read_stall_retry/analysis/README.md
@@ -0,0 +1,111 @@
+GCSFuse Stalled Read Retry Analysis Toolkit
+============================================
+
+This directory contains a set of scripts designed to fetch and analyze Google Cloud Storage FUSE (GCSFuse) logs from Google Cloud Logging. The tools help in identifying and visualizing the frequency and distribution of retries caused by stalled read requests.
+
+The workflow is flexible: use `fetch_logs.sh` to download logs from Google Cloud Logging to a location of your choice. Then, use the Python analysis scripts (`retries_per_interval.py` and `requests_per_retry_count.py`) to process the downloaded log file by providing its path.
+
+
+Scripts Overview
+----------------
+
+1. `fetch_logs.sh`: A bash script that queries Google Cloud Logging for "stalled read-req" log entries from pods matching a given regex. It saves these logs to a specified output path or to a default location in `/tmp/` if no path is provided.
+2. `retries_per_interval.py`: A Python script that takes a path to a CSV log file as input and aggregates the "stalled read-req" retry counts into user-defined time intervals. It generates a new CSV file and a bar chart visualization of the results.
+3. `requests_per_retry_count.py`: A Python script that processes a given CSV log file to analyze "stalled read-req" retries. It identifies unique requests by their UUID and provides a summary table showing how many requests were retried a specific number of times.
+
+
+Prerequisites
+-------------
+
+Before you begin, ensure you have the following installed and configured:
+
+* Google Cloud SDK (`gcloud`): Required by `fetch_logs.sh` to query logs from Google Cloud Logging. You must be authenticated (`gcloud auth login`) and have the necessary permissions.
+  - Installation Guide (https://cloud.google.com/sdk/docs/install)
+
+* Python 3: Required to run the analysis scripts.
+
+* Python Libraries: The analysis scripts depend on `pandas` and `matplotlib`. You can install them using the provided requirements.txt file:
+
+    `pip install -r requirements.txt`
+
+Usage Workflow
+--------------
+
+The intended workflow is a two-step process:
+
+**Step 1: Fetch Logs**
+
+First, use the `fetch_logs.sh` script to download the GCSFuse logs for a specific set of pods using a regular expression.
+
+**Note:** Before running, make sure the script is executable: `chmod +x fetch_logs.sh`
+
+Syntax:
+
+    ./fetch_logs.sh <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]
+
+Examples:
+
+- Fetch logs to a default path:
+
+        ./fetch_logs.sh xpk-large-scale-usc1f-a sample "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30"
+    This will save the logs to the file named `/tmp/sample-logs.csv`.
+
+- Fetch logs to a custom path:
+
+        ./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30" "/var/log/gcsfuse/my_analysis.csv"
+    This will save the logs to the file named `/var/log/gcsfuse/my_analysis.csv`.
+
+**Step 2: Analyze the logs**
+
+Once you have the log file, you can use the Python scripts to analyze it.
+
+- **Analyze and Visualize Retries Over Time**
+
+    Use `retries_per_interval.py` to see how the frequency of retries changes over time. Provide the path to the log file created in Step 1.
+
+    Syntax:
+
+        python retries_per_interval.py <log_file_path> [--interval <interval>]
+
+    Example:
+
+        python retries_per_interval.py /tmp/sample-job-logs.csv --interval 5m
+
+    Output:
+
+    - `<output_prefix>-retries.csv`: A CSV file in your current directory with the following format:
+        ```
+        Interval Start (UTC),Retries
+        2025-05-14 16:31:00,4
+        2025-05-14 16:32:00,14
+        2025-05-14 16:33:00,12
+        2025-05-14 16:34:00,2299
+        2025-05-14 16:35:00,606
+        ```
+    - `<output_prefix>-retries.png`: A bar chart visualizing the retries over the specified intervals, saved in your current directory which looks like this:
+    ![Retries at an interval of 7 minutes](sample-retries_7m.png)
+
+- **Analyze Unique Requests per Retry Count**
+
+    Use `requests_per_retry_count.py` to determine how many unique requests were retried a specific number of times. Provide the path to the log file from Step 1.
+
+    Syntax:
+
+        python requests_per_retry_count.py <log_file_path>
+
+    Example:
+
+        python requests_per_retry_count.py /tmp/sample-logs.csv
+
+    Output:
+
+    - A summary table printed to the console:
+
+        ```
+        Processing file: /tmp/sample-logs.csv
+        Retries    | Requests
+        -----------+----------
+        1          | 248
+        2          | 32
+        3          | 10
+        ```
diff --git a/read_stall_retry/analysis/fetch_logs.sh b/read_stall_retry/analysis/fetch_logs.sh
@@ -0,0 +1,120 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ------------------------------------------------------------------------------
+# Description:
+#   This script queries Google Cloud Logging to extract GCSFuse related
+#   "stalled read-req" log entries for pods matching a regex in a GKE cluster.
+#
+#   It fetches logs in 30-minute intervals over a given time range and writes
+#   the combined result to a CSV file.
+#
+#   It uses a default output path unless an optional path is provided.
+#
+# Usage:
+#   ./fetch_logs.sh <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]
+#
+# Example (Default Path):
+#   ./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" \
+#       "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30"
+#
+# Example (Custom Path):
+#   ./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" \
+#       "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30" "/var/log/gcsfuse/sample_job_stalled_reads.csv"
+#
+# Output:
+#   - CSV log file at the specified or default location.
+#   - Prints total number of retry events due to stalled read requests.
+# ------------------------------------------------------------------------------
+
+# Check for correct number of arguments
+if [ "$#" -lt 4 ] || [ "$#" -gt 5 ]; then
+    echo "Usage: $0 <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]"
+    echo "Example: $0 xpk-large-scale-usc1a-a \"sample-job-.*\" \"2025-06-16T02:05:00-07:00\" \"2025-06-16T03:05:00-07:00\""
+    exit 1
+fi
+
+# Input parameters
+cluster_name="$1"
+pod_name_regex="$2"
+starttime="$3"
+endtime="$4"
+
+# Set output file path
+if [ "$#" -eq 5 ]; then
+    log_filename="$5"
+else
+    # Sanitize the regex to create a valid filename prefix
+    output_prefix=$(echo "$pod_name_regex" | tr -dc '[:alnum:]_-')
+    if [ -z "$output_prefix" ]; then
+        output_prefix="gcsfuse-sidecar"
+    fi
+    log_filename="/tmp/${output_prefix}-logs.csv"
+fi
+
+# Get the directory part of the log filename
+dir_path=$(dirname "$log_filename")
+
+# Create the directory if it doesn't exist
+if [ ! -d "$dir_path" ]; then
+    echo "Output directory '$dir_path' not found. Creating it."
+    if ! mkdir -p "$dir_path"; then
+        echo "Failed to create directory '$dir_path'. Exiting."
+        exit 1
+    fi
+fi
+
+# Write CSV header once (overwrite any existing file)
+echo "timestamp,textPayload" > "$log_filename"
+
+# Convert input times to Unix timestamps
+start_timestamp=$(date -d "$starttime" +%s)
+end_timestamp=$(date -d "$endtime" +%s)
+
+# Iterate through time range in 30-minute intervals
+current_start_time=$start_timestamp
+
+while [ $current_start_time -lt $end_timestamp ]; do
+    current_end_time=$((current_start_time + 1800 > end_timestamp ? end_timestamp : current_start_time + 1800)) # 30 minutes
+
+    # Format timestamps for gcloud logging query (ISO 8601 / RFC 3339)
+    start_time_formatted=$(date -d @$current_start_time --utc +%FT%T%:z)
+    end_time_formatted=$(date -d @$current_end_time --utc +%FT%T%:z)
+
+    # Temporary file to hold the output of one gcloud call
+    temp_output=$(mktemp)
+
+    # The gcloud logging read command can be used from the CLI to read logs.
+    if ! gcloud logging read \
+        "resource.labels.cluster_name=\"$cluster_name\" AND resource.labels.container_name=\"gke-gcsfuse-sidecar\" AND resource.labels.pod_name=~\"$pod_name_regex\" AND timestamp>=\"$start_time_formatted\" AND timestamp<=\"$end_time_formatted\" AND \"stalled read-req\"" \
+        --order=ASC \
+        --format='csv(timestamp,textPayload)' > "$temp_output"; then
+        echo "gcloud command failed for the interval starting at $start_time_formatted. Exiting."
+        rm "$temp_output"
+        exit 1
+    fi
+
+    # Append logs but skip the header line from gcloud output
+    tail -n +2 "$temp_output" >> "$log_filename"
+    rm "$temp_output"
+
+    current_start_time=$current_end_time
+done
+
+echo "Logs have been saved at: $log_filename"
+
+# Calculate total number of retries (lines - 1 for header)
+total_retries=$(( $(wc -l < "$log_filename") - 1 ))
+echo "Total number of retries due to stalled read request = $total_retries"
diff --git a/read_stall_retry/analysis/requests_per_retry_count.py b/read_stall_retry/analysis/requests_per_retry_count.py
@@ -0,0 +1,123 @@
+#!/usr/bin/env python3
+
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+r"""
+This script processes a CSV log file generated by `fetch_logs.sh` and analyzes
+"stalled read-req" events in GCSFuse. It identifies unique read requests by
+UUID and counts how many times each request was retried.
+
+It then summarizes the number of requests that were retried 1, 2, 3, ... times.
+
+The input log file path is provided as a command-line argument.
+
+Usage:
+    python requests_per_retry_count.py <path_to_log_file>
+
+Example:
+    python requests_per_retry_count.py /tmp/sample-logs.csv
+
+Output:
+    Processing file: /tmp/sample-logs.csv
+
+    Retries    | Requests
+    -----------+----------
+    1          | 248
+    2          | 32
+    3          | 10
+
+    Each row in the table shows how many requests had a specific number of retries.
+"""
+
+import pandas as pd
+import re
+import sys
+import argparse
+from collections import defaultdict
+
+LOG_PATTERN = re.compile(r'\[(.*?)\] stalled read-req cancelled after')
+
+def main():
+    parser = argparse.ArgumentParser(
+        description=(
+            "Analyze 'stalled read-req' log messages in GCSFuse logs.\n"
+            "Counts retries per unique request (UUID) and summarizes the request distribution over retry frequency."
+        ),
+        epilog=(
+            "Example usage:\n"
+            "  python requests_per_retry_count.py /tmp/sample-logs.csv\n\n"
+            "This will read the specified CSV file and output a summary of retry counts."
+        ),
+        formatter_class=argparse.RawTextHelpFormatter
+    )
+    parser.add_argument(
+        "log_file_path",
+        help="Path to the CSV log file to be processed."
+    )
+
+    args = parser.parse_args()
+
+    log_file = args.log_file_path
+
+    print(f"Processing file: {log_file}")
+
+    try:
+        df = pd.read_csv(log_file)
+    except FileNotFoundError:
+        print(f"Error: Log file '{log_file}' not found.")
+        sys.exit(1)
+    except Exception as e:
+        print(f"Error reading log file '{log_file}': {e}")
+        sys.exit(1)
+
+    # Sanitize header for case-insensitive and whitespace-proof comparison.
+    sanitized_header = [h.strip().lower() for h in df.columns]
+
+    # Validate that the header has the expected columns.
+    if len(sanitized_header) < 2 or sanitized_header[0] != 'timestamp' or sanitized_header[1] != 'textpayload':
+        print(f"Error: Invalid CSV header in '{log_file}'.", file=sys.stderr)
+        print("Expected header to start with 'timestamp,textPayload'.", file=sys.stderr)
+        print(f"Actual header: {','.join(df.columns)}", file=sys.stderr)
+        sys.exit(1)
+
+    retry_counts = defaultdict(int)
+
+    for text in df['textPayload']:
+        # Ensure text is a string before searching
+        if not isinstance(text, str):
+            continue
+        match = LOG_PATTERN.search(text)
+        if match:
+            uuid = match.group(1)
+            retry_counts[uuid] += 1
+
+    if not retry_counts:
+        print("No retries found in the log file.")
+        sys.exit(0)
+
+    frequency_counts = defaultdict(int)
+    for count in retry_counts.values():
+        frequency_counts[count] += 1
+
+    # Print formatted table
+    print(f"\n{'Retries':<10} | {'Requests':<10}")
+    print(f"{'-'*10}-+-{'-'*10}")
+
+    for retries_count in sorted(frequency_counts.keys()):
+        requests_count = frequency_counts[retries_count]
+        print(f"{retries_count:<10} | {requests_count:<10}")
+
+if __name__ == "__main__":
+    main()
diff --git a/read_stall_retry/analysis/requirements.txt b/read_stall_retry/analysis/requirements.txt
@@ -0,0 +1,19 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Used for data analysis and reading CSV files in requests_per_retry_count.py
+pandas
+
+# Used for creating the bar chart visualization in retries_per_interval.py
+matplotlib