I'm using perfetto for the first time so apologies if I'm just misunderstanding intended usage, but I've been digging into things for two days and can't understand why my traces are getting corrupted. I'm using the C++ SDK and doing nothing other than the default TRACE_EVENT[_BEGIN/END] datasource instrumentation. So far, I haven't seen a problem running single threaded, but when I run tasks in parallel on a large enough workload, I almost always end up with dropped packets which render the result useless.
Here is the relevant data, so far as I understand, from my investigation:
- I see various mixtures of the following data loss warnings:
traced_buf_trace_writer_packet_loss, traced_buf_sequence_packet_loss and traced_buf_incremental_sequences_dropped. Maybe just one, maybe two or all three. Yet whenever there are any such loss events, the timeline shows up malformed, with some events being marked as incomplete.
- Sometimes the timeline length seems as it should be, with the incomplete slices just extending to the end and being far longer than they should be. Other times the incomplete slices just cut short and the trace timeline is much shorter than it should be.
- Often I see
traced_flushes_succeeded being one less than traced_flushes_requested, but this appears to have no relation to whether or not the results are malformed.
- When it is malformed,
traced_buf_bytes_read is slightly smaller than traced_buf_bytes_written.
- To give an idea of my data volume, the traces I've been doing have been around 2-3 minutes, generating trace files of around half a gigabyte.
- Slice nesting depth reaches around 100.
- I ran
traceconv text as suggested in another issue, and found some packets like the following:
packet {
previous_packet_dropped: true
timestamp: 85501
sequence_flags: 2
track_event {
type: TYPE_SLICE_END
}
trusted_uid: 0
trusted_packet_sequence_id: 3
trusted_pid: 0
}
To give a bit more info on my setup. I trigger a single tracing session, which
- calls
perfetto::Tracing::NewTrace(perfetto::kInProcessBackend)
- calls
Setup() (passing a file) and then StartBlocking
- the main thread then spawns N workers (
std::jthread), all of which along with the main thread run the same code which can emit events
- the main thread will join with all the worker threads, which are then destroyed
- only then will the main thread call
FlushBlocking (unsure if needed) before StopBlocking and then closing the file
Results are unpredictable, but so far I've seen packet loss for N >= 1, but not for N == 0.
I've experimented with various config values based on suggestions in the docs and elsewhere:
- I've bumped up the buffer size (
add_buffers()->set_size_kb()) to 1GB, which from the stats is more than is being written throughout the whole trace, so there is no overflow happening there.
- Tried increasing
shmem_size_hint_kb and shmem_page_size_hint_kb to various degrees.
- Experimented with
file_write_period and flush_period.
- Also tried to use
BufferExhaustedPolicy::kStall as noted somewhere.
Am I right in thinking that under these conditions, there's really no reason I should be seeing packet loss like this? Any suggestions as to what the cause could be, or ways to debug, would be much appreciated.
Stats from one malformed trace:
clock_sync_cache_miss | 77 | info (analysis)
guess_trace_type_duration_ns | 110163100 | info (analysis)
packet_skipped_seq_needs_incremental_state_invalid | 9715362 | info (analysis)
parse_trace_duration_ns | 7912450300 | info (analysis)
traced_buf_buffer_size[0] | 1073741824 | info (trace)
traced_buf_bytes_read[0] | 363528192 | info (trace)
traced_buf_bytes_written[0] | 363560960 | info (trace)
traced_buf_chunks_read[0] | 11094 | info (trace)
traced_buf_chunks_rewritten[0] | 35 | info (trace)
traced_buf_chunks_written[0] | 11095 | info (trace)
traced_buf_patches_succeeded[0] | 354 | info (trace)
traced_buf_readaheads_failed[0] | 25 | info (trace)
traced_buf_readaheads_succeeded[0] | 234 | info (trace)
traced_buf_sequence_packet_loss[0] | 14 | data_loss (analysis)
traced_chunks_discarded | 19 | info (trace)
traced_data_sources_registered | 1 | info (trace)
traced_data_sources_seen | 20 | info (trace)
traced_flushes_requested | 15 | info (trace)
traced_flushes_succeeded | 15 | info (trace)
traced_producers_connected | 1 | info (trace)
traced_producers_seen | 1 | info (trace)
traced_total_buffers | 1 | info (trace)
traced_tracing_sessions | 1 | info (trace)
I'm using perfetto for the first time so apologies if I'm just misunderstanding intended usage, but I've been digging into things for two days and can't understand why my traces are getting corrupted. I'm using the C++ SDK and doing nothing other than the default
TRACE_EVENT[_BEGIN/END]datasource instrumentation. So far, I haven't seen a problem running single threaded, but when I run tasks in parallel on a large enough workload, I almost always end up with dropped packets which render the result useless.Here is the relevant data, so far as I understand, from my investigation:
traced_buf_trace_writer_packet_loss,traced_buf_sequence_packet_lossandtraced_buf_incremental_sequences_dropped. Maybe just one, maybe two or all three. Yet whenever there are any such loss events, the timeline shows up malformed, with some events being marked as incomplete.traced_flushes_succeededbeing one less thantraced_flushes_requested, but this appears to have no relation to whether or not the results are malformed.traced_buf_bytes_readis slightly smaller thantraced_buf_bytes_written.traceconv textas suggested in another issue, and found some packets like the following:To give a bit more info on my setup. I trigger a single tracing session, which
perfetto::Tracing::NewTrace(perfetto::kInProcessBackend)Setup()(passing a file) and thenStartBlockingstd::jthread), all of which along with the main thread run the same code which can emit eventsFlushBlocking(unsure if needed) beforeStopBlockingand then closing the fileResults are unpredictable, but so far I've seen packet loss for N >= 1, but not for N == 0.
I've experimented with various config values based on suggestions in the docs and elsewhere:
add_buffers()->set_size_kb()) to 1GB, which from the stats is more than is being written throughout the whole trace, so there is no overflow happening there.shmem_size_hint_kbandshmem_page_size_hint_kbto various degrees.file_write_periodandflush_period.BufferExhaustedPolicy::kStallas noted somewhere.Am I right in thinking that under these conditions, there's really no reason I should be seeing packet loss like this? Any suggestions as to what the cause could be, or ways to debug, would be much appreciated.
Stats from one malformed trace: