Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 111 additions & 0 deletions read_stall_retry/analysis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
GCSFuse Stalled Read Retry Analysis Toolkit
============================================

This directory contains a set of scripts designed to fetch and analyze Google Cloud Storage FUSE (GCSFuse) logs from Google Cloud Logging. The tools help in identifying and visualizing the frequency and distribution of retries caused by stalled read requests.

The workflow is flexible: use `fetch_logs.sh` to download logs from Google Cloud Logging to a location of your choice. Then, use the Python analysis scripts (`retries_per_interval.py` and `requests_per_retry_count.py`) to process the downloaded log file by providing its path.


Scripts Overview
----------------

1. `fetch_logs.sh`: A bash script that queries Google Cloud Logging for "stalled read-req" log entries from pods matching a given regex. It saves these logs to a specified output path or to a default location in `/tmp/` if no path is provided.
2. `retries_per_interval.py`: A Python script that takes a path to a CSV log file as input and aggregates the "stalled read-req" retry counts into user-defined time intervals. It generates a new CSV file and a bar chart visualization of the results.
3. `requests_per_retry_count.py`: A Python script that processes a given CSV log file to analyze "stalled read-req" retries. It identifies unique requests by their UUID and provides a summary table showing how many requests were retried a specific number of times.


Prerequisites
-------------

Before you begin, ensure you have the following installed and configured:

* Google Cloud SDK (`gcloud`): Required by `fetch_logs.sh` to query logs from Google Cloud Logging. You must be authenticated (`gcloud auth login`) and have the necessary permissions.
- Installation Guide (https://cloud.google.com/sdk/docs/install)

* Python 3: Required to run the analysis scripts.

* Python Libraries: The analysis scripts depend on `pandas` and `matplotlib`. You can install them using the provided requirements.txt file:

`pip install -r requirements.txt`

Usage Workflow
--------------

The intended workflow is a two-step process:

**Step 1: Fetch Logs**

First, use the `fetch_logs.sh` script to download the GCSFuse logs for a specific set of pods using a regular expression.

**Note:** Before running, make sure the script is executable: `chmod +x fetch_logs.sh`

Syntax:

./fetch_logs.sh <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]

Examples:

- Fetch logs to a default path:

./fetch_logs.sh xpk-large-scale-usc1f-a sample "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30"
This will save the logs to the file named `/tmp/sample-logs.csv`.

- Fetch logs to a custom path:

./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30" "/var/log/gcsfuse/my_analysis.csv"
This will save the logs to the file named `/var/log/gcsfuse/my_analysis.csv`.

**Step 2: Analyze the logs**

Once you have the log file, you can use the Python scripts to analyze it.

- **Analyze and Visualize Retries Over Time**

Use `retries_per_interval.py` to see how the frequency of retries changes over time. Provide the path to the log file created in Step 1.

Syntax:

python retries_per_interval.py <log_file_path> [--interval <interval>]

Example:

python retries_per_interval.py /tmp/sample-job-logs.csv --interval 5m

Output:

- `<output_prefix>-retries.csv`: A CSV file in your current directory with the following format:
```
Interval Start (UTC),Retries
2025-05-14 16:31:00,4
2025-05-14 16:32:00,14
2025-05-14 16:33:00,12
2025-05-14 16:34:00,2299
2025-05-14 16:35:00,606
```
- `<output_prefix>-retries.png`: A bar chart visualizing the retries over the specified intervals, saved in your current directory which looks like this:
![Retries at an interval of 7 minutes](sample-retries_7m.png)

- **Analyze Unique Requests per Retry Count**

Use `requests_per_retry_count.py` to determine how many unique requests were retried a specific number of times. Provide the path to the log file from Step 1.

Syntax:

python requests_per_retry_count.py <log_file_path>

Example:

python requests_per_retry_count.py /tmp/sample-logs.csv

Output:

- A summary table printed to the console:

```
Processing file: /tmp/sample-logs.csv
Retries | Requests
-----------+----------
1 | 248
2 | 32
3 | 10
```
120 changes: 120 additions & 0 deletions read_stall_retry/analysis/fetch_logs.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
#!/bin/bash
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ------------------------------------------------------------------------------
# Description:
# This script queries Google Cloud Logging to extract GCSFuse related
# "stalled read-req" log entries for pods matching a regex in a GKE cluster.
#
# It fetches logs in 30-minute intervals over a given time range and writes
# the combined result to a CSV file.
#
# It uses a default output path unless an optional path is provided.
#
# Usage:
# ./fetch_logs.sh <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]
#
# Example (Default Path):
# ./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" \
# "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30"
#
# Example (Custom Path):
# ./fetch_logs.sh xpk-large-scale-usc1f-a "sample-job-.*" \
# "2025-02-04T18:00:00+05:30" "2025-02-05T10:00:00+05:30" "/var/log/gcsfuse/sample_job_stalled_reads.csv"
#
# Output:
# - CSV log file at the specified or default location.
# - Prints total number of retry events due to stalled read requests.
# ------------------------------------------------------------------------------

# Check for correct number of arguments
if [ "$#" -lt 4 ] || [ "$#" -gt 5 ]; then
echo "Usage: $0 <cluster_name> <pod_name_regex> <start_time> <end_time> [output_file_path]"
echo "Example: $0 xpk-large-scale-usc1a-a \"sample-job-.*\" \"2025-06-16T02:05:00-07:00\" \"2025-06-16T03:05:00-07:00\""
exit 1
fi

# Input parameters
cluster_name="$1"
pod_name_regex="$2"
starttime="$3"
endtime="$4"

# Set output file path
if [ "$#" -eq 5 ]; then
log_filename="$5"
else
# Sanitize the regex to create a valid filename prefix
output_prefix=$(echo "$pod_name_regex" | tr -dc '[:alnum:]_-')
if [ -z "$output_prefix" ]; then
output_prefix="gcsfuse-sidecar"
fi
log_filename="/tmp/${output_prefix}-logs.csv"
fi

# Get the directory part of the log filename
dir_path=$(dirname "$log_filename")

# Create the directory if it doesn't exist
if [ ! -d "$dir_path" ]; then
echo "Output directory '$dir_path' not found. Creating it."
if ! mkdir -p "$dir_path"; then
echo "Failed to create directory '$dir_path'. Exiting."
exit 1
fi
fi

# Write CSV header once (overwrite any existing file)
echo "timestamp,textPayload" > "$log_filename"

# Convert input times to Unix timestamps
start_timestamp=$(date -d "$starttime" +%s)
end_timestamp=$(date -d "$endtime" +%s)

# Iterate through time range in 30-minute intervals
current_start_time=$start_timestamp

while [ $current_start_time -lt $end_timestamp ]; do
current_end_time=$((current_start_time + 1800 > end_timestamp ? end_timestamp : current_start_time + 1800)) # 30 minutes

# Format timestamps for gcloud logging query (ISO 8601 / RFC 3339)
start_time_formatted=$(date -d @$current_start_time --utc +%FT%T%:z)
end_time_formatted=$(date -d @$current_end_time --utc +%FT%T%:z)

# Temporary file to hold the output of one gcloud call
temp_output=$(mktemp)

# The gcloud logging read command can be used from the CLI to read logs.
if ! gcloud logging read \
"resource.labels.cluster_name=\"$cluster_name\" AND resource.labels.container_name=\"gke-gcsfuse-sidecar\" AND resource.labels.pod_name=~\"$pod_name_regex\" AND timestamp>=\"$start_time_formatted\" AND timestamp<=\"$end_time_formatted\" AND \"stalled read-req\"" \
--order=ASC \
--format='csv(timestamp,textPayload)' > "$temp_output"; then
echo "gcloud command failed for the interval starting at $start_time_formatted. Exiting."
rm "$temp_output"
exit 1
fi

# Append logs but skip the header line from gcloud output
tail -n +2 "$temp_output" >> "$log_filename"
rm "$temp_output"

current_start_time=$current_end_time
done

echo "Logs have been saved at: $log_filename"

# Calculate total number of retries (lines - 1 for header)
total_retries=$(( $(wc -l < "$log_filename") - 1 ))
echo "Total number of retries due to stalled read request = $total_retries"
123 changes: 123 additions & 0 deletions read_stall_retry/analysis/requests_per_retry_count.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
#!/usr/bin/env python3

# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

r"""
This script processes a CSV log file generated by `fetch_logs.sh` and analyzes
"stalled read-req" events in GCSFuse. It identifies unique read requests by
UUID and counts how many times each request was retried.

It then summarizes the number of requests that were retried 1, 2, 3, ... times.

The input log file path is provided as a command-line argument.

Usage:
python requests_per_retry_count.py <path_to_log_file>

Example:
python requests_per_retry_count.py /tmp/sample-logs.csv

Output:
Processing file: /tmp/sample-logs.csv

Retries | Requests
-----------+----------
1 | 248
2 | 32
3 | 10

Each row in the table shows how many requests had a specific number of retries.
"""

import pandas as pd
import re
import sys
import argparse
from collections import defaultdict

LOG_PATTERN = re.compile(r'\[(.*?)\] stalled read-req cancelled after')

def main():
parser = argparse.ArgumentParser(
description=(
"Analyze 'stalled read-req' log messages in GCSFuse logs.\n"
"Counts retries per unique request (UUID) and summarizes the request distribution over retry frequency."
),
epilog=(
"Example usage:\n"
" python requests_per_retry_count.py /tmp/sample-logs.csv\n\n"
"This will read the specified CSV file and output a summary of retry counts."
),
formatter_class=argparse.RawTextHelpFormatter
)
parser.add_argument(
Comment thread
vipnydav marked this conversation as resolved.
"log_file_path",
help="Path to the CSV log file to be processed."
)

args = parser.parse_args()

log_file = args.log_file_path

print(f"Processing file: {log_file}")

try:
df = pd.read_csv(log_file)
except FileNotFoundError:
print(f"Error: Log file '{log_file}' not found.")
sys.exit(1)
except Exception as e:
print(f"Error reading log file '{log_file}': {e}")
sys.exit(1)

# Sanitize header for case-insensitive and whitespace-proof comparison.
sanitized_header = [h.strip().lower() for h in df.columns]

# Validate that the header has the expected columns.
if len(sanitized_header) < 2 or sanitized_header[0] != 'timestamp' or sanitized_header[1] != 'textpayload':
print(f"Error: Invalid CSV header in '{log_file}'.", file=sys.stderr)
print("Expected header to start with 'timestamp,textPayload'.", file=sys.stderr)
print(f"Actual header: {','.join(df.columns)}", file=sys.stderr)
sys.exit(1)

retry_counts = defaultdict(int)

for text in df['textPayload']:
# Ensure text is a string before searching
if not isinstance(text, str):
continue
match = LOG_PATTERN.search(text)
if match:
uuid = match.group(1)
retry_counts[uuid] += 1

if not retry_counts:
print("No retries found in the log file.")
sys.exit(0)

frequency_counts = defaultdict(int)
for count in retry_counts.values():
frequency_counts[count] += 1

# Print formatted table
print(f"\n{'Retries':<10} | {'Requests':<10}")
print(f"{'-'*10}-+-{'-'*10}")

for retries_count in sorted(frequency_counts.keys()):
requests_count = frequency_counts[retries_count]
print(f"{retries_count:<10} | {requests_count:<10}")

if __name__ == "__main__":
main()
19 changes: 19 additions & 0 deletions read_stall_retry/analysis/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Used for data analysis and reading CSV files in requests_per_retry_count.py
Comment thread
vipnydav marked this conversation as resolved.
pandas

# Used for creating the bar chart visualization in retries_per_interval.py
matplotlib
Loading