-
Notifications
You must be signed in to change notification settings - Fork 50
Add info for debugging high CPU utilization #191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
gtjoseph
wants to merge
1
commit into
asterisk:main
Choose a base branch
from
gtjoseph:main-cpu-troubleshooting
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
173 changes: 173 additions & 0 deletions
173
docs/Deployment/Troubleshooting/Troubleshooting-High-CPU-Utilization.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| # Troubleshooting High CPU Utilization | ||
|
|
||
| /// note | ||
| We'll cover high memory utilization in a separate document. | ||
| /// | ||
|
|
||
| ## What's "High"??? | ||
|
|
||
| It sounds silly but if you can't define what "high" means to you then | ||
| how will you know when you've "fixed" it? | ||
|
|
||
| High CPU utilization isn't usually an issue by itself. The real issue would be | ||
| the side effects like poor call quality or long call setup times, etc. | ||
| Understanding the issue you're tring to solve and what your goals are should | ||
| be the first step in any investigation. | ||
|
|
||
| ## Understand the environment | ||
|
|
||
| The next step in the investigation should be understanding your environment thoroughly. | ||
| These are just a few things to keep in your mind: | ||
|
|
||
| * Is Asterisk running in a virtual machine or a container? | ||
| * What else is competing for resources? | ||
| * Is the filesystem directly attached or remote via a SAN, etc.? | ||
| * What else is sharing the same CPU? | ||
| * Databases | ||
| * ARI/AMI/AGI applications | ||
| * Unrelated applications or servers | ||
| * Are you using a realtime database? | ||
| * Where is it hosted? | ||
| * What else is using the same database instance? | ||
| * What's the network configuration? | ||
| * Network interfaces. | ||
| * Firewalls. | ||
| * DNS infrastructure. | ||
| * Proxies and session border controllers. | ||
|
|
||
| There's more but you get the idea. | ||
|
|
||
| ## Identify the actors | ||
|
|
||
| Identifying _why_ you're seeing high CPU utilization can be complicated but let's start | ||
| with the _what_ first because it might not even be Asterisk itself. The tool of choice | ||
| here is the venerable `top` or it's relatives `htop` and `btop`. | ||
|
|
||
| So who are the culprits? | ||
|
|
||
| ### Database | ||
|
|
||
| If the database has nothing to do with Asterisk itself then it probably shouln't be running | ||
| in the same OS instance. If the database _is_ hosting your Asterisk configuration check where | ||
| you're storing your incoming PJSIP registrations (non permanent contacts). The default is to | ||
| store them in the "astdb" (Sqlite3 database at `/var/lib/asterisk/astdb.sqlite3`) but some people | ||
| use sorcery.conf and extconfig.conf to move them to an external database believing it's "better". | ||
| Unless you need to expose the active registrations to another entity like Kamailio, leaving them | ||
| in the astdb is better for performance because the database is all in-memory. | ||
|
|
||
| Another thing to investigate is | ||
| [Sorcery Caching](/Fundamentals/Asterisk-Configuration/Sorcery/Sorcery-Caching/). | ||
| If your expiration times are too low, you may be running more transactions to the database | ||
| than necessary. | ||
|
|
||
| Finally, consider moving the database to another OS instance, even if that other instance is | ||
| another container or virtual machine on the same physical host. Remember, database access is | ||
| usually limited to call setup but if Asterisk is processing the call media (which it usually is) | ||
| it's more important for it to have a higher CPU resource allocation priority. Moving the database | ||
| someplace else gives you an opportunity to control that allocation. Keep it local however and ideally | ||
| use a dedicated LAN segment with a 10G or greater rate. | ||
|
|
||
| ### ARI/AMI/AGI Applications | ||
|
|
||
| Take a close look at what the applications are doing and try to streamline. If you're using "classic" | ||
| AGI (`AGI(/path/to/script)`) switch to using [FastAGI](/Latest_API/API_Documentation/Dialplan_Applications/AGI/) | ||
| to avoid the overhead of spawning a new process for every call. As with a database, moving the applications to another OS instance can give you more control over resource allocation. | ||
|
|
||
| ### Asterisk | ||
|
|
||
| #### Transcoding | ||
|
|
||
| The number one use of CPU resources by Asterisk is usually handling media. | ||
| Actually, simply forwarding media between two channels that use the same codec is pretty cheap. | ||
| Once you start transcoding however, the number of CPU cycles needed can increase drastically. | ||
|
|
||
| ``` | ||
| *CLI> core show translation | ||
| Translation times between formats (in microseconds) for one second of data | ||
| Source Format (Rows) Destination Format (Columns) | ||
|
|
||
| ulaw alaw gsm slin8 slin16 slin32 slin48 g729 g722 opus | ||
| ulaw - 9150 15000 9000 17000 17000 17000 15000 17250 23000 | ||
| alaw 9150 - 15000 9000 17000 17000 17000 15000 17250 23000 | ||
| gsm 15000 15000 - 9000 17000 17000 17000 15000 17250 23000 | ||
| slin8 6000 6000 6000 - 8000 8000 8000 6000 8250 14000 | ||
| slin16 14500 14500 14500 8500 - 8000 8000 14500 6000 14000 | ||
| slin32 14500 14500 14500 8500 8500 - 8000 14500 14500 14000 | ||
| slin48 14500 14500 14500 8500 8500 8500 - 14500 14500 6000 | ||
| g729 15000 15000 15000 9000 17000 17000 17000 - 17250 23000 | ||
| g722 15600 15600 15600 9600 9000 17000 17000 15600 - 23000 | ||
| opus 23500 23500 23500 17500 17500 17500 9000 23500 23500 - | ||
| ``` | ||
|
|
||
| The absolute numbers are specific to the machine I ran the command on but look at the relative | ||
| differences, especially when opus is used on one channel and the other uses something | ||
| else. | ||
|
|
||
| Here's a different look at opus and what the path is between it and other codecs: | ||
|
|
||
| ``` | ||
| *CLI> core show translation paths opus | ||
| --- Translation paths SRC Codec "opus" sample rate 48000 --- | ||
| opus:48000 To ulaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(ulaw@8000) | ||
| opus:48000 To alaw:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(alaw@8000) | ||
| opus:48000 To gsm:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(gsm@8000) | ||
| opus:48000 To slin:8000 : (opus@48000)->(slin@48000)->(slin@8000) | ||
| opus:48000 To slin:16000 : (opus@48000)->(slin@48000)->(slin@16000) | ||
| opus:48000 To slin:32000 : (opus@48000)->(slin@48000)->(slin@32000) | ||
| opus:48000 To slin:48000 : (opus@48000)->(slin@48000) | ||
| opus:48000 To g729:8000 : (opus@48000)->(slin@48000)->(slin@8000)->(g729@8000) | ||
| opus:48000 To g722:16000 : (opus@48000)->(slin@48000)->(slin@16000)->(g722@16000) | ||
| *CLI> | ||
| ``` | ||
|
|
||
| The simple answer is "don't use opus" but of course that's not realistic especially | ||
| if you have WebRTC endpoints. The more general answer is "don't use expensive codecs | ||
| if you don't need to". A good example of this is when using chan_websocket on one leg | ||
| of a call to interact with an AI agent. We've seen some folks using opus on the websocket | ||
| leg to the agent even when the incoming callers are all using ulaw or alaw and the agent | ||
| platform will happily accept 8K signed-linear. | ||
|
|
||
| #### Sounds, Music on Hold, Recording, Voicemail | ||
|
|
||
| The term "Transcoding" (and all that it implies) applies whenever you have to | ||
| convert between two codecs and that includes playing sounds, playing music-on-hold, | ||
| recording calls, saving and playing voicemails, etc. | ||
|
|
||
| For sounds and MOH, ensure you have files in the formats you most expect your caller | ||
| channels to be using. When you have lots of different codecs in use and you're short | ||
| on disk space, having at least slin8 and slin16 versions of the files gives you the | ||
| least transcoding cost for everything except for opus. | ||
|
|
||
| For recording and voicemail, use the translation cost matrix to determine the most | ||
| efficient formats to save files in. | ||
|
|
||
| #### Logging | ||
|
|
||
| Logging is especially costly because it takes CPU cycles to create and manage the | ||
| messages plus I/O operations to write them out. | ||
|
|
||
| A performance analysis of Asterisk versions prior to 23.3.0, 22.9.0 and 20.19.0, showed | ||
| that almost 40% of the CPU instructions executed by the Asterisk process were attributed | ||
| to Channel Event Logging. That was more than the instructions used to actually process | ||
| calls. We made significant CEL performance improvements in versions 23.3.0, 22.9.0 and 20.19.0 | ||
| which brought that percentage to below 10% but if you don't need Channel Event logging | ||
| _turn it off_!. | ||
|
|
||
| Another culprit to watch out for is VERBOSE logging. When VERBOSE logging is enabled, | ||
| a log message is generated for every line traversed in the dialplan. On a busy system | ||
| this can result in hundreds of messages per second. Unless you have a good reason for | ||
| seeing all those messages, you should limit both console and file logging to NOTICE, | ||
| WARNING and ERROR. As for DEBUG logging, well enabling debug messages on production | ||
| system is just silly unless you're actively trying to diagnose a specific issue. | ||
|
|
||
| #### Still can't figure it out? | ||
|
|
||
| If Asterisk is still using more CPU than you think is necessary, it's time to ask for help. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before doing this I'd add a paragraph about reducing the environment (their specific usage) to specific parts which can be disabled and using that to identify the aspects that are contributing to high CPU usage, and consequences, in their environment. |
||
| Start with the [Community Forums](https://community.asterisk.org) and provide as much | ||
| detail as you can about your symptomns, environment and expectations. | ||
|
|
||
| If you want to dive deeper into the internals of Asterisk, take a look at the | ||
| [Function Tracing](/Development/Debugging/Function-Tracing) | ||
| page in the [Development/Debugging](/Development/Debugging) section. | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
|
|
||
| # Function Tracing | ||
|
|
||
| Function tracing can sometimes be a useful tool when CPU utilization is an issue. | ||
| For that, the `perf` suite of Linux performance tools is your best bet. A full discussion | ||
| of using perf is beyond the scope of this document but here are some basic instructions | ||
| for using `perf record` and `perf report` which can give you a good idea of CPU instruction | ||
| counts by Asterisk function. In most cases, `perf` doesn't require special compilation | ||
| settings but you may get more accurate results by enabling the recently added | ||
| `KEEP_FRAME_POINTERS` menuselect option under `Compiler Flags`. | ||
|
|
||
| ```shell | ||
| # Start perf-record on a running Asterisk process to capture instruction counts and | ||
| # latency by function. | ||
| # Let it run for a few minutes then use ctrl-c to stop it. | ||
| $ sudo perf record -p `pidof asterisk` -q -e instructions --latency --call-graph fp -o perf.out | ||
|
|
||
| # Run perf-report to generate the results report. | ||
| $ sudo perf report --force -i perf.out --call-graph=none -c asterisk --percentage relative > report.txt | ||
|
|
||
| # Filter out kernel and C library calls. | ||
| $ sed -i -r -e "/kernel|libc.so.6/d" report.txt | ||
| ``` | ||
|
|
||
| The output file will be a few thousand lines long but here's an example: | ||
|
|
||
| ``` | ||
| # | ||
| # comm: asterisk | ||
| # | ||
| # Total Lost Samples: 0 | ||
| # | ||
| # Samples: 33K of event 'instructions' | ||
| # Event count (approx.): 6561076235 | ||
| # | ||
| # Children Latency Self Latency Shared Object Symbol | ||
| # ........ ....... ........ ........ ...................... ................. | ||
| # | ||
| 99.46% 99.56% 0.00% 0.00% asterisk [.] dummy_start | ||
| 70.01% 68.68% 0.02% 0.01% asterisk [.] default_tps_processing_function | ||
| 62.30% 62.25% 0.19% 0.15% asterisk [.] ast_taskprocessor_execute | ||
| 38.04% 42.77% 0.01% 0.01% asterisk [.] execute_tasks | ||
| 22.81% 21.61% 0.75% 0.71% asterisk [.] internal_ao2_traverse | ||
| 22.73% 21.57% 0.18% 0.14% asterisk [.] __ao2_callback | ||
| 21.51% 16.72% 0.04% 0.03% asterisk [.] dispatch_exec_async | ||
| 19.33% 14.91% 0.11% 0.08% asterisk [.] subscription_invoke | ||
| 18.73% 14.40% 0.04% 0.03% asterisk [.] router_dispatch | ||
| 17.85% 21.51% 0.00% 0.00% res_pjsip.so [.] distribute | ||
| 17.77% 21.43% 0.03% 0.04% libasteriskpj.so.2 [.] pjsip_endpt_process_rx_data | ||
| 15.40% 14.69% 0.00% 0.00% asterisk [.] __ast_pbx_run | ||
| 15.40% 14.69% 0.00% 0.00% asterisk [.] pbx_thread | ||
| 14.82% 14.27% 0.00% 0.00% asterisk [.] pbx_extension_helper | ||
| 14.77% 14.23% 0.00% 0.00% asterisk [.] ast_spawn_extension | ||
| 14.71% 14.15% 0.00% 0.00% asterisk [.] pbx_exec | ||
| 14.61% 14.05% 0.00% 0.00% app_dial.so [.] dial_exec | ||
| 14.60% 14.04% 0.01% 0.01% app_dial.so [.] dial_exec_full | ||
| 12.51% 15.40% 0.00% 0.00% res_pjsip_session.so [.] session_on_rx_request | ||
| 12.51% 15.40% 0.00% 0.00% res_pjsip_session.so [.] handle_new_invite_request | ||
| 11.82% 9.36% 0.00% 0.00% asterisk [.] cel_report_event | ||
| 11.35% 13.95% 0.00% 0.00% res_pjsip_session.so [.] new_invite | ||
| 10.82% 9.97% 4.81% 4.70% asterisk [.] __ao2_ref | ||
| 10.79% 13.57% 0.00% 0.00% res_pjsip_sdp_rtp.so [.] create_rtp | ||
| 10.54% 12.81% 0.00% 0.01% asterisk [.] taskpool_sync_task | ||
| 10.42% 13.15% 0.00% 0.00% asterisk [.] ast_rtp_instance_new | ||
| 10.39% 13.12% 0.00% 0.00% res_rtp_asterisk.so [.] ast_rtp_new | ||
| 9.86% 12.54% 0.00% 0.00% res_rtp_asterisk.so [.] rtp_allocate_transport | ||
| 9.24% 7.31% 0.01% 0.01% res_cdrel_custom.so [.] cdrel_logger | ||
| 9.18% 7.26% 0.29% 0.22% res_cdrel_custom.so [.] log_advanced_record | ||
| 8.92% 11.37% 0.01% 0.01% res_rtp_asterisk.so [.] ice_create | ||
| 8.85% 7.15% 0.16% 0.14% asterisk [.] __ao2_find | ||
| 8.64% 6.90% 0.00% 0.00% asterisk [.] cel_backend_send_cb | ||
| 8.63% 6.89% 0.00% 0.00% cel_custom.so [.] custom_log | ||
| 8.20% 6.73% 0.01% 0.01% asterisk [.] cel_snapshot_update_cb | ||
| 7.98% 6.54% 0.02% 0.01% asterisk [.] cel_channel_state_change | ||
| 7.61% 6.34% 0.02% 0.01% asterisk [.] ast_sem_wait | ||
| 7.36% 6.66% 0.00% 0.00% asterisk [.] ast_bridge_call | ||
| 7.36% 6.66% 0.00% 0.00% asterisk [.] ast_bridge_call_with_flags | ||
| 7.06% 6.21% 0.37% 0.34% asterisk [.] __ao2_cleanup_debug | ||
| 6.88% 8.69% 0.00% 0.00% chan_pjsip.so [.] call | ||
| 6.77% 8.54% 0.01% 0.01% res_pjsip_session.so [.] handle_incoming_sdp | ||
| 6.71% 8.46% 0.00% 0.00% res_pjsip_sdp_rtp.so [.] negotiate_incoming_sdp_stream | ||
| 6.62% 6.06% 0.07% 0.05% asterisk [.] pbx_builtin_setvar_helper | ||
| 6.49% 4.81% 0.00% 0.00% asterisk [.] do_devstate_changes | ||
| 6.22% 4.59% 0.00% 0.00% asterisk [.] do_state_change | ||
| 6.20% 7.94% 0.01% 0.02% res_rtp_asterisk.so [.] rtp_add_candidates_to_ice | ||
| 6.13% 4.52% 0.01% 0.01% asterisk [.] _ast_device_state | ||
| 6.06% 4.46% 0.06% 0.04% chan_pjsip.so [.] chan_pjsip_devicestate | ||
| 5.94% 4.94% 0.00% 0.00% chan_pjsip.so [.] hangup | ||
|
|
||
| ``` | ||
|
|
||
| What you're seeing is a list of asterisk functions and what percentage of the total | ||
| CPU instructions and wall clock time they're responsible for. The percentages won't | ||
| add up to 100 because of the recursive nature of function calls. The top dozen or | ||
| so are wrapper functions which basically run everything but things get interesting | ||
| after that. | ||
|
|
||
| If you want to see the results without recussion, add the `--no-children` option to | ||
| `perf report`: | ||
|
|
||
| ```shell | ||
| $ sudo perf report -i perf.out --call-graph=none -c asterisk --percentage relative --no-children > report-no-children.txt | ||
| ``` | ||
|
|
||
| ``` | ||
| # To display the perf.data header info, please use --header/--header-only options. | ||
| # | ||
| # comm: asterisk | ||
| # | ||
| # Total Lost Samples: 0 | ||
| # | ||
| # Samples: 33K of event 'instructions' | ||
| # Event count (approx.): 6561076235 | ||
| # | ||
| # Overhead Latency Shared Object Symbol | ||
| # ........ ........ ..................................... .......................................... | ||
| # | ||
| 4.81% 4.70% asterisk [.] __ao2_ref | ||
| 2.57% 4.30% asterisk [.] hash_ao2_find_next | ||
| 1.96% 1.65% libc.so.6 [.] __printf_buffer | ||
| 1.77% 1.48% libc.so.6 [.] _int_free_chunk | ||
| 1.63% 1.62% libc.so.6 [.] _int_malloc | ||
| 1.54% 1.40% libc.so.6 [.] pthread_mutex_lock@@GLIBC_2.2.5 | ||
| 1.54% 1.49% libc.so.6 [.] __GI___pthread_mutex_unlock_usercnt | ||
| 1.53% 1.59% libc.so.6 [.] __strcmp_avx2_rtm | ||
| 1.32% 1.56% [kernel.kallsyms] [k] perf_iterate_ctx | ||
| 1.30% 1.14% asterisk [.] __ao2_lock | ||
| 1.21% 1.48% libc.so.6 [.] __random_r | ||
| 1.13% 1.45% [kernel.kallsyms] [k] __snmp6_fill_stats64.constprop.0 | ||
| 1.12% 1.09% asterisk [.] hash_ao2_find_first | ||
| 1.05% 0.98% libc.so.6 [.] __memmove_avx_unaligned_erms_rtm | ||
| 0.91% 0.80% libc.so.6 [.] tolower | ||
| 0.91% 0.89% libc.so.6 [.] __libc_calloc | ||
| 0.91% 0.97% [kernel.kallsyms] [k] memset_orig | ||
| 0.88% 0.74% libc.so.6 [.] __strlen_avx2_rtm | ||
| 0.85% 0.72% libc.so.6 [.] __printf_buffer_write | ||
| 0.83% 0.74% asterisk [.] __ao2_unlock | ||
| 0.79% 0.63% asterisk [.] ast_str_case_hash | ||
| 0.79% 0.68% libc.so.6 [.] cfree@GLIBC_2.2.5 | ||
| 0.75% 0.71% asterisk [.] internal_ao2_traverse | ||
| 0.70% 0.54% asterisk [.] ast_event_iterator_next | ||
| 0.70% 0.76% [kernel.kallsyms] [k] psi_group_change | ||
| 0.68% 0.59% asterisk [.] utf8_check_first | ||
| 0.60% 0.72% libasteriskpj.so.2 [.] pj_pool_alloc_from_block | ||
|
|
||
| ``` | ||
|
|
||
| Now the report shows only the instructions directly executed by the function. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So who are the most common culprits?