.. _docs/totalcompute/tc3/profiling_and_tracing: System profiling, Applications tracing and Trace analysis ========================================================= This section provides information related with tools, methodologies and features present and supported in the current Total Compute release, that allow to profile the performance and behaviour of the system and/or applications. Simpleperf ---------- ``Simpleperf`` is a native CPU profiling tool for Android, which can be used to profile both Android applications and native processes running on Android. Detailed documentation about this tool can be found on `link `__. The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the ``perf_event_open`` system call, which is used by ``simpleperf`` to collect and process the event's data. The following subsections do present some ``simpleperf`` command examples that can be used to obtain useful information to troubleshoot and validate the development. Detailed information on the available commands supported, and their usage can be obtained on `link `__. .. note:: Although not all, most of the following commands do require root privileges in order to retrieve some of the system wide information. Failing to do so, will result in the following error message *"System wide profiling needs root privilege"*. To obtain root privileges, simply run the command ``su 0`` before running any of the following examples. Simpleperf List ############### A list of the different supported events can be obtained by running the command ``simpleperf list``. An example of the output provided by the command is presented below for reference: :: console:/ $ simpleperf list List of hw-cache events: # More cache events are available in `simpleperf list raw`. branch-load-misses branch-loads dTLB-load-misses dTLB-loads iTLB-load-misses iTLB-loads L1-dcache-load-misses L1-dcache-loads L1-icache-load-misses L1-icache-loads LLC-load-misses LLC-loads List of coresight etm events: cs_etm/autofdo/ cs-etm # CoreSight ETM instruction tracing List of hardware events: branch-instructions branch-misses bus-cycles cache-misses cache-references cpu-cycles instructions stalled-cycles-backend stalled-cycles-frontend List of pmu events: arm_cspmu_0/cycles/ arm_cspmu_1/cycles/ arm_cspmu_2/cycles/ arm_cspmu_3/cycles/ arm_dsu_0/bus_access/ arm_dsu_0/bus_cycles/ arm_dsu_0/cycles/ arm_dsu_0/memory_error/ armv9_cortex_a520/br_immed_retired/ armv9_cortex_a520/br_mis_pred/ (...) armv9_cortex_a520/trcextout3/ armv9_cortex_a520/ttbr_write_retired/ armv9_cortex_a725/br_immed_retired/ armv9_cortex_a725/br_mis_pred/ (...) armv9_cortex_a725/ttbr_write_retired/ armv9_cortex_a725/unaligned_ldst_retired/ armv9_cortex_x925/br_immed_retired/ armv9_cortex_x925/br_mis_pred/ (...) armv9_cortex_x925/ttbr_write_retired/ armv9_cortex_x925/unaligned_ldst_retired/ List of raw events provided by cpu pmu: # Please refer to "PMU common architectural and microarchitectural event numbers" # and "ARM recommendations for IMPLEMENTATION DEFINED event numbers" listed in # ARMv8 manual for details. # A possible link is https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile. raw-ase-spec (may not supported) # Operation speculatively executed, Advanced SIMD instruction raw-br-immed-retired (may not supported) # Instruction architecturally executed, immediate branch raw-br-immed-spec (may not supported) # Branch speculatively executed, immediate branch (...) raw-unaligned-st-spec (may not supported) # Unaligned access, write raw-vfp-spec (may not supported) # Operation speculatively executed, floating-point instruction List of software events: alignment-faults context-switches cpu-clock cpu-migrations emulation-faults major-faults minor-faults page-faults task-clock (...) Simpleperf Stat ############### The ``stat`` command can be used to get event counter values of the profiled processes. The command can be customised to filter which events to use, which processes/threads to monitor, how to monitor and what print interval to adopt. Some command examples are presented below. Get system wide event counts for a specific duration and print at a specific interval ************************************************************************************* The following command allows to get the system wide default event counts, considering a duration of 1s and printing counts every 50ms: :: console:/ # simpleperf stat -a --duration 1 --interval 0.05 Performance counter statistics: # count event_name # count / runtime 1,562,624 cpu-cycles # 0.012290 GHz 0 stalled-cycles-frontend # 0.000 /sec 0 stalled-cycles-backend # 0.000 /sec 1,427,862 instructions # 1.094380 cycles per instruction 45,105 branch-instructions # 392.411 K/sec 0 branch-misses # 0.000000% miss rate 106.490720(ms) task-clock # 88.921592 cpus used 14 context-switches # 139.296 /sec 11 page-faults # 114.277 /sec Total test time: 0.001198 seconds. Performance counter statistics: # count event_name # count / runtime 3,394,688 cpu-cycles # 0.015067 GHz 0 stalled-cycles-frontend # 0.000 /sec 0 stalled-cycles-backend # 0.000 /sec 3,309,558 instructions # 1.025722 cycles per instruction 162,493 branch-instructions # 770.683 K/sec 0 branch-misses # 0.000000% miss rate 199.783110(ms) task-clock # 13.914760 cpus used 54 context-switches # 278.607 /sec 13 page-faults # 68.572 /sec Total test time: 0.014358 seconds. (...) Get event counts for a specific process within a duration ********************************************************* The following command allows to get the default event counts for the process ``system_server`` considering a duration of 50ms: :: console:/ # ps -A | grep system_server system 477 318 19313020 338596 do_epoll_wait 0 S system_server console:/ # console:/ # simpleperf stat -p 477 --duration 0.05 Performance counter statistics: # count event_name # count / runtime 0 cpu-cycles # 0 stalled-cycles-frontend # 0 stalled-cycles-backend # 0 instructions # 0 branch-instructions # 0 branch-misses # 0.000000(ms) task-clock # 0.000000 cpus used 0 context-switches # 0 page-faults # Total test time: 0.052116 seconds. console:/sdcard # Get specific events for a particular process ******************************************** :: # this example assumes the previous "system_server" process with PID 477 console:/ # simpleperf stat -e cpu-cycles -p 477 --duration 0.05 Performance counter statistics: # count event_name # count / runtime 6,351,580 cpu-cycles # 0.099210 GHz Total test time: 0.050210 seconds. console:/sdcard # # Additional examples: console:/ # simpleperf stat -e cache-references,cache-misses -p 477 --duration 0.05 console:/ # simpleperf stat -e cache-references,cache-misses ls Similarly to filtering events for a particular process using ``-p `` option, filtering events for specific threads can be achieved by using ``-t `` argument. Get non-CPU PMU events ********************** :: console:/ # simpleperf stat -a -e arm_dsu_0/cycles/ -- sleep 0.01 Performance counter statistics: # count event_name # count / runtime 9,223,372,036,854,775,809 arm_dsu_0/cycles/ # 469025720716.176 G/sec Total test time: 0.019609 seconds. console:/ # simpleperf stat -a -e arm_cspmu_0/cycles/ -- sleep 0.01 Performance counter statistics: # count event_name # count / runtime 9,223,372,036,854,775,809 arm_cspmu_0/cycles/ # 469169585366.791 G/sec Total test time: 0.019612 seconds. console:/sdcard # .. note:: Non-CPU PMU events are not supported in per-process due to perf or simpleperf not being able to attach events to a process. Collect event counters using event-groups ***************************************** :: console:/ # simpleperf stat --group cpu-cycles,instructions -- ls acct debug_ramdisk lost+found second_stage_resources apex dev mnt storage bin etc odm sys bugreports fstab.total_compute odm_dlkm system cache init oem system_dlkm config init.common.rc postinstall system_ext d init.environ.rc proc vendor data init.total_compute.rc product vendor_dlkm data_mirror linkerconfig sdcard Performance counter statistics: # count event_name # count / runtime 27,147,316 cpu-cycles # 2.603942 GHz 27,147,250 instructions # 1.000002 cycles per instruction Total test time: 0.010935 seconds. console:/sdcard # Simpleperf Record ################# The ``record`` command is used to dump samples of the profiled processes. The following example provides a very basic usage scenario of the command: :: console:/sdcard # pwd /sdcard console:/sdcard # simpleperf record ls Alarms DCIM Movies Pictures Ringtones Android Documents Music Podcasts TemporaryFile-t57Mnj Audiobooks Download Notifications Recordings perf.data simpleperf I cmd_record.cpp:798] Recorded for 0.0108992 seconds. Start post processing. simpleperf I cmd_record.cpp:891] Samples recorded: 37. Samples lost: 0. console:/sdcard # Some additional command usage examples may include: :: # Record individual process for a specific duration: console:/ # simpleperf record -p --duration # Record set of processes for a specific duration: console:/ # simpleperf record -p , --duration # Spawn workload as a child process and record it: console:/ # simpleperf record # Frequency of the record can be set using -f or -c option, where # '-f 1000' means collecting 1000 records every second, and # '-c 1000' means collecting 1 record when 1000 events are hit. console:/ # simpleperf record -f -p --duration console:/ # simpleperf record -c -p --duration Simpleperf Report ################# The ``report`` command is used to report profiling data generated by the ``record`` command. The following example assumes being executed following the previous ``simpleperf record ls`` command example: :: # this example assumes and follows the run of the previous "simpleperf record ls" example console:/sdcard # simpleperf report Cmdline: /system/bin/simpleperf record ls Arch: arm64 Event: cpu-cycles (type 0, config 0) Samples: 37 Event count: 25996499 Overhead Command Pid Tid Shared Object Symbol 38.72% ls 2145 2145 /system/lib64/libcrypto.so sha256_block_data_order 16.70% ls 2145 2145 [kernel.kallsyms] invoke_syscall 7.56% ls 2145 2145 [kernel.kallsyms] perf_output_end 6.67% ls 2145 2145 /apex/com.android.runtime/bin/linker64 [linker]soinfo::lookup_version_info(VersionTracker const&, unsigned int, char const*, version_info const**) 5.23% ls 2145 2145 [kernel.kallsyms] vm_area_free 4.64% ls 2145 2145 /apex/com.android.runtime/bin/linker64 [linker]ElfReader::ReadDynamicSection() 3.45% ls 2145 2145 [kernel.kallsyms] el0_da 3.20% ls 2145 2145 /apex/com.android.runtime/lib64/bionic/libc.so __aarch64_cas4_acq 3.16% ls 2145 2145 [kernel.kallsyms] mt_find 3.13% ls 2145 2145 [kernel.kallsyms] mas_wr_walk 2.97% ls 2145 2145 [kernel.kallsyms] mas_next_node 2.91% ls 2145 2145 [kernel.kallsyms] __rcu_read_unlock 1.56% ls 2145 2145 [kernel.kallsyms] mas_destroy 0.10% ls 2145 2145 [kernel.kallsyms] mas_walk 0.01% ls 2145 2145 [kernel.kallsyms] __pte_alloc 0.00% ls 2145 2145 [kernel.kallsyms] down_write_killable 0.00% ls 2145 2145 [kernel.kallsyms] setup_new_exec 0.00% simpleperf 2145 2145 [kernel.kallsyms] __rcu_read_lock console:/sdcard # Perf ---- ``Perf`` is a profiler tool for Linux based systems that abstracts CPU hardware differences in Linux performance measurements, while presenting a simple command-line interface. More information on the tool can be found on `link `__. The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the ``perf_event_open`` system call, which is used by ``perf`` to collect and process the event's data. List of available events ######################## A list of the different supported events can be obtained by running the command ``perf list``. An example of the output provided by the command is presented below for reference: :: # perf list List of pre-defined events (to be used in -e or -M): branch-instructions OR branches [Hardware event] branch-misses [Hardware event] bus-cycles [Hardware event] cache-misses [Hardware event] cache-references [Hardware event] cpu-cycles OR cycles [Hardware event] instructions [Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend [Hardware event] alignment-faults [Software event] bpf-output [Software event] cgroup-switches [Software event] context-switches OR cs [Software event] cpu-clock [Software event] cpu-migrations OR migrations [Software event] dummy [Software event] emulation-faults [Software event] major-faults [Software event] minor-faults [Software event] page-faults OR faults [Software event] task-clock [Software event] duration_time [Tool event] user_time [Tool event] system_time [Tool event] L1-dcache-load-misses [Hardware cache event] L1-dcache-loads [Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-loads [Hardware cache event] LLC-load-misses [Hardware cache event] LLC-loads [Hardware cache event] branch-load-misses [Hardware cache event] branch-loads [Hardware cache event] dTLB-load-misses [Hardware cache event] dTLB-loads [Hardware cache event] iTLB-load-misses [Hardware cache event] iTLB-loads [Hardware cache event] br_immed_retired OR armv9_cortex_a520/br_immed_retired/ [Kernel PMU event] br_immed_retired OR armv9_cortex_a725/br_immed_retired/ [Kernel PMU event] br_immed_retired OR armv9_cortex_x925/br_immed_retired/ [Kernel PMU event] br_mis_pred OR armv9_cortex_a520/br_mis_pred/ [Kernel PMU event] br_mis_pred OR armv9_cortex_a725/br_mis_pred/ [Kernel PMU event] br_mis_pred OR armv9_cortex_x925/br_mis_pred/ [Kernel PMU event] (...) ttbr_write_retired OR armv9_cortex_a520/ttbr_write_retired/ [Kernel PMU event] ttbr_write_retired OR armv9_cortex_a725/ttbr_write_retired/ [Kernel PMU event] ttbr_write_retired OR armv9_cortex_x925/ttbr_write_retired/ [Kernel PMU event] unaligned_ldst_retired OR armv9_cortex_a520/unaligned_ldst_retired/ [Kernel PMU event] unaligned_ldst_retired OR armv9_cortex_a725/unaligned_ldst_retired/ [Kernel PMU event] arm_cspmu_0/cycles/ [Kernel PMU event] arm_cspmu_1/cycles/ [Kernel PMU event] arm_cspmu_2/cycles/ [Kernel PMU event] arm_cspmu_3/cycles/ [Kernel PMU event] arm_dsu_0/bus_access/ [Kernel PMU event] arm_dsu_0/bus_cycles/ [Kernel PMU event] arm_dsu_0/cycles/ [Kernel PMU event] arm_dsu_0/memory_error/ [Kernel PMU event] arm_spe_0// [Kernel PMU event] arm_spe_1// [Kernel PMU event] cs_etm// [Kernel PMU event] cs_etm/autofdo/ [Kernel PMU event] (...) alarmtimer:alarmtimer_cancel [Tracepoint event] alarmtimer:alarmtimer_fired [Tracepoint event] alarmtimer:alarmtimer_start [Tracepoint event] alarmtimer:alarmtimer_suspend [Tracepoint event] (...) .. note:: The previous command may present its output in an unformatted way when running on the FVP. It may be desirable to instead redirect its output to a file and then list the contents of that file by running the following command sequence: :: # perf list > perf_list.txt # cat perf_list.txt Perf Stat ######### The ``stat`` command can be used to get event counter values of the profiled processes. Some examples of its usage are presented following, as well as some considerations to take into account when considering TC3, with direct implications on the ``perf stat`` command. Special considerations considering TC3 and implications on the ``perf stat`` command ************************************************************************************ TC3 defines per-microarchitecture PMU instances. As a result, the Kernel CPU PMU events will be displayed for each CPU micro-architecture during ``perf list``, as illustrated on the following excerpt: :: (...) cpu_cycles OR armv9_cortex_a520/cpu_cycles/ [Kernel PMU event] cpu_cycles OR armv9_cortex_a725/cpu_cycles/ [Kernel PMU event] cpu_cycles OR armv9_cortex_x925/cpu_cycles/ [Kernel PMU event] (...) When considering Kernel 6.1 and for situations where the ``perf`` command is executed as a task-bound (``cpu==-1``), the event is opened on an arbitrary CPU PMU and will only count on a subset of CPUs. This means, for example, that it might open on a "big" PMU and only count while the task is running on "big" CPUs, but not while the task is running on "little" CPUs. The following excerpt illustrates one such situation, where cycles are not counted for the command ``ls``, as the command did execute on the CPUs whose PMU were not selected by ``perf`` to open the events: :: # perf stat -e cycles -- ls arm-ffa-tee.ko Performance counter stats for 'ls': cycles (0.00%) 0.000509460 seconds time elapsed 0.000044000 seconds user 0.000000000 seconds sys # To overcome this implication and always ensure the retrieval of meaningful data, ``perf`` commands should be executed in one of two possible ways: #. providing to the ``perf`` command the individual CPU PMU events to count: :: # perf stat -e armv9_cortex_x925/cpu_cycles/,armv9_cortex_a725/cpu_cycles/,armv9_cortex_a520/cpu_cycles/ -- ls arm-ffa-tee.ko Performance counter stats for 'ls': 1224520 armv9_cortex_x925/cpu_cycles/ armv9_cortex_a725/cpu_cycles/ (0.00%) armv9_cortex_a520/cpu_cycles/ (0.00%) 0.000970750 seconds time elapsed 0.001074000 seconds user 0.000000000 seconds sys # #. providing to the ``perf`` command a CPU mask so that the event is opened on all CPU PMUs: :: # perf stat -C 0-7 -e instructions,cycles -- ls arm-ffa-tee.ko Performance counter stats for 'CPU(s) 0-7': 1768977 instructions # 1.00 insn per cycle 1764843 cycles 0.000740800 seconds time elapsed # As can also be seen on the previous example, the instructions are not broken down to specific PMU type (CPU type). This might be ambiguous for users to read the result, as instructions/cycles on different CPUs do have different performance meaning. This issue seems to have been fixed in newer Kernel versions (>=6.6) and when running the ``perf`` command with the default event names (without providing the CPU mask). Therefore, a possible solution could be to compile ``perf`` from newer source code, and copy the resulting binary into the rootfs before booting the image, or alternatively use the ``scp`` command to upload the binary to a booted system. Additional ``perf stat`` command examples are illustrated following, where the ``-C 0-7`` argument was used as a workaround for the above-mentioned issue (TC3 FVP has 8 CPUs): :: ## Single event # perf stat -C 0-7 -e -- ## Multiple events # perf stat -C 0-7 -e ,,..., -- ## Event grouping # perf stat -C 0-7 -e '{,,...,}' -- ## Attaching to the existing process; 'sleep X' is passed to run perf for a specific duration # perf stat -C 0-7 -e -p -- sleep 1 .. note:: DSU and MCN PMU driver do not support all possible events by name. For cases where data for a particular event is not visible, ``perf stat`` can be used with a raw event ID. Some examples of how to read the non-CPU PMU event counters are presented below (the values ``0xa2`` and ``0x182`` are obtained from the respective component TRM documentation): :: # perf stat -e arm_dsu_0/cycles/,arm_dsu_0/memory_error/ -- sleep 0.01 Performance counter stats for 'system wide': 2 arm_dsu_0/cycles/ 0 arm_dsu_0/memory_error/ 0.010749870 seconds time elapsed # ## Additional examples: ## Count DSU cache read refills # perf stat -e arm_dsu_0/event=0xa2/ -- sleep 0.01 ## Count MCN MCTL_write_req # perf stat -e arm_cspmu_0/event=0x182/ -- sleep 0.01 Perf Record, Report and Annotate ################################ Running ``perf record`` will collect and generate a ``perf.data`` file containing the sampling data of one or more events. This data can be later analysed using ``perf report`` or ``perf annotate`` commands. By default, the ``perf record`` uses cycles as a default event. To modify the sampling period while running ``perf record``, two approaches can be followed: #. **frequency**: specifies the average rate of samples/sec (``-F`` option); #. **count**: enforces sampling at the specifies event period (``-c`` option). Some command examples illustrating this usage are presented below: :: # Sample on event cycles at the default frequency perf record -C 0-7 # Sample on event instructions at 1000 samples/sec perf record -C 0-7 -e instructions -F 1000 # Sample on event instructions at every 2000 occurrences of event perf record -C 0-7 -e instructions -c 2000 Perf and Arm SPE extension ************************** The Arm Statistical Profiling Extension (SPE) feature provides a hardware assisted CPU operation profiling mechanism. This provides accurate attribution of latencies and events down to individual instructions. The general ``perf record`` command usage with SPE on TC23 platform looks like: :: perf record -e arm_spe_// -- taskset -c TC23 supports SPE only on Mid and Big CPUs and not on small CPUs, there are 2 SPE instances, ``arm_spe_0`` for Mid CPUs (CPUs 2-5) and ``arm_spe_1`` for big CPUs (CPUs 6-7). When workload needs to be analyzed using SPE, it should be bound to CPUs which have the SPE capability using taskset. So on TC23 platform workloads should be bound to CPUs 2-5 when using ``arm_spe_0`` and workloads should be bound to CPUs 6-7 when using ``arm_spe_1``. ``min_latency=0`` config parameter is mandatory to provide with any perf-spe command. The following listing illustrates how to record SPE samples on Mid CPUs with ``arm_spe_0``: :: # perf record -e arm_spe_0/min_latency=0/ -- taskset -c 2-5 ls arm-tstee.ko build_env.cfg perf.data perf.data.old [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.225 MB perf.data ] # The previously recorded data (``perf.data``) can then be analyzed using the ``perf report`` command as follows: :: # perf report Warning: Please install libunwind or libdw development packages during the perf build. Only instruction-based sampling period is currently supported by Arm SPE. # To display the perf.data header info, please use --header/--header-only option # # # Total Lost Samples: 0 # # Samples: 853 of event 'l1d-access' # Event count (approx.): 853 # # Children Self Command Shared Object Symbol # ........ ........ ....... ..................... .......................... # 31.42% 31.42% ls [kernel.kallsyms] [k] __sanitizer_cov_trace_ 9.03% 9.03% taskset [kernel.kallsyms] [k] __sanitizer_cov_trace_ 1.88% 1.88% ls [kernel.kallsyms] [k] next_uptodate_page 1.52% 1.52% ls [kernel.kallsyms] [k] do_set_pte 1.41% 1.41% ls [kernel.kallsyms] [k] search_cmp_ftr_reg 1.41% 1.41% ls [kernel.kallsyms] [k] unmap_page_range 1.06% 1.06% ls [kernel.kallsyms] [k] __pi_copy_page 1.06% 1.06% ls [kernel.kallsyms] [k] __rcu_read_unlock (...) The previous example only specify ``min_latency=0`` required config parameter. However, there can be situations where making usage of the other config parameters may help to filter profiling information. Complementing the previous example, let's assume it would be desirable to make usage of the config parameter ``event_filter=2``, which discards all samples which do not have retired instructions events. The following command listing illustrates the command usage considering this scenario: :: # perf record -e arm_spe_0/min_latency=0,event_filter=2/ -- taskset -c 2 ls arm-tstee.ko build_env.cfg perf.data perf.data.old [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.225 MB perf.data ] # Detailed information regarding the config parameters can be found at `link `__. Kernel config and prerequisites for enabling Arm SPE can also be found in the Kernel documentation. Perfetto -------- Perfetto is an open-source stack developed for performance instrumentation and trace analysis. It offers services and libraries for recording system-level and app-level traces, native and Java heap profiling, a library for analysing traces using SQL, and a web-based user interface (UI) that allows to visualize and explore the collected traces. It has support for ``ftrace``, ``atrace``, ``/proc/{stat,vmstat,pid}/*`` and ``perf_event`` as data sources to collect system level traces. A data source can be seen as the capability, exposed by a producer, of providing some tracing data. The producer is an entity that offers the ability to contribute to the trace, advertising this ability with one or more data sources. A consumer is an entity that controls the tracing service, provides to the tracing service the trace configuration and reads back the trace buffers. The tracing service is a long-lived entity (i.e. a system daemon on Linux/Android) which handles the tracing sessions, routes trace configuration from consumer to producers, and manages trace buffers. The data source defines its own schema (a protobuf) consisting of data source trace config (what kind of input config it would expect from the consumer) and trace packets (what kind of data it would output into the trace). Some examples of data sources advertised by different producers to collect system-level traces are listed below: * linux.process_stats * linux.ftrace * linux.sys_stats * linux.perf Recording and Visualising Traces with Perfetto ############################################## Perfetto can record traces, using either the UI (available at `https://ui.perfetto.dev/#!/record `__) or the command line. An example of how perfetto can be used to collect traces using the command line is present below: :: # the following commands are intended to be run on the host PC; # only applicable for the following command, the current path is assumed to be export PATH="$(pwd)/src/android/out/host/linux-x86/bin:${PATH}" adb connect localhost: adb devices adb -s localhost: push config.txt /data/local/tmp/config.txt adb -s localhost: shell perfetto -o /data/misc/perfetto-traces/trace_file.perfetto-trace --txt -c /data/local/tmp/config.txt adb -s localhost: pull /data/misc/perfetto-traces/trace_file.perfetto-trace ./ Some complementing considerations regarding the previous presented command listing: * the use of ``-s localhost:`` can be ignored if there is only one ADB instance available for debug to the host; * the default ADB port is 5555; however, in cases where there are more than one ADB instance available for debug to the host, the port may change; in these situations refer to the output of the ``adb devices`` command or to the FVP-model start up log information to understand which port was assigned as replacement; the :ref:`ADB connection on Android ` section provides additional information that can be useful to troubleshoot the connection; * ``config.txt`` contains the perfetto trace config; some examples of this config will be presented in the :ref:`Trace config examples ` subsection. Once the perfetto trace file is collected and downloaded to the host, it can be loaded into the perfetto UI (available at `https://ui.perfetto.dev/ `__) using the option "open trace file", as illustrated on the following image: .. figure:: perfetto_load_trace_file.png Detailed information regarding perfetto can be found on the official documentation available at `https://perfetto.dev/docs/ `__. .. _docs/totalcompute/tc3/perfetto-trace-config-examples: Trace config examples ##################### This subsection provides three trace config examples that can be used to control the tracing service and influence the sampled data on the TC3 platform. Alongside to each trace configuration, examples of the visualisation of the respective captured trace data using the Perfetto UI are also included for reference. Additional examples of data source trace configurations for different supported data sources can be found at `https://perfetto.dev/docs/ `__ (please refer to the "Data sources" section). Some additional config examples can be found in ``test/configs/`` directory in perfetto source code. A full list of supported ``ftrace`` events can be found in file ``protos/perfetto/trace/ftrace/ftrace_event.proto`` in perfetto source code. A full list of supported ``meminfo`` and ``vmstat`` counters can be found in file ``protos/perfetto/common/sys_stats_counters.proto`` in perfetto source code. Example 1: collect ``ftrace`` scheduling events, process stats and system stats counters every 1000ms: ****************************************************************************************************** **Trace configuration file:** :: buffers { size_kb: 16384 fill_policy: RING_BUFFER } buffers { size_kb: 16384 fill_policy: RING_BUFFER } data_sources { config { name: "linux.ftrace" target_buffer: 0 ftrace_config { # Scheduling information and process tracking. Useful for: # - what is happening on each CPU at each moment # - why a thread was de-scheduled # - parent/child relationships between processes and threads. ftrace_events: "sched/sched_switch" ftrace_events: "power/suspend_resume" ftrace_events: "sched/sched_process_exit" ftrace_events: "sched/sched_process_free" ftrace_events: "task/task_newtask" ftrace_events: "task/task_rename" # Wakeup info. Allows to compute how long a task was # blocked due to CPU contention. ftrace_events: "sched/sched_wakeup" # os.Trace markers: ftrace_events: "ftrace/print" # RSS and ION buffer events: ftrace_events: "mm_event/mm_event_record" ftrace_events: "kmem/rss_stat" ftrace_events: "kmem/ion_heap_grow" ftrace_events: "kmem/ion_heap_shrink" } } } data_sources { config { name: "linux.sys_stats" target_buffer: 1 sys_stats_config { meminfo_period_ms: 100 meminfo_counters: MEMINFO_MEM_AVAILABLE meminfo_counters: MEMINFO_BUFFERS meminfo_counters: MEMINFO_CACHED meminfo_counters: MEMINFO_SWAP_CACHED meminfo_counters: MEMINFO_ACTIVE meminfo_counters: MEMINFO_INACTIVE meminfo_counters: MEMINFO_ACTIVE_ANON meminfo_counters: MEMINFO_INACTIVE_ANON meminfo_counters: MEMINFO_ACTIVE_FILE meminfo_counters: MEMINFO_INACTIVE_FILE meminfo_counters: MEMINFO_UNEVICTABLE vmstat_period_ms: 100 vmstat_counters: VMSTAT_NR_FREE_PAGES vmstat_counters: VMSTAT_NR_ALLOC_BATCH vmstat_counters: VMSTAT_NR_INACTIVE_ANON vmstat_counters: VMSTAT_NR_VMSCAN_WRITE vmstat_counters: VMSTAT_NR_VMSCAN_IMMEDIATE_RECLAIM vmstat_counters: VMSTAT_NR_WRITEBACK_TEMP stat_period_ms: 100 stat_counters: STAT_CPU_TIMES stat_counters: STAT_IRQ_COUNTS stat_counters: STAT_FORK_COUNT } } } data_sources: { config { name: "linux.process_stats" target_buffer: 0 process_stats_config { scan_all_processes_on_start: true proc_stats_poll_ms: 1000 } } } duration_ms: 1000 **Perfetto UI visualisation:** .. figure:: perfetto_ex1_ui_visualisation.png :alt: Example 1 - collect ``ftrace`` scheduling events, process stats and system stats counters every 1000ms :align: center :target: ../../_images/perfetto_ex1_ui_visualisation.png .. raw:: html

Example 2: collect ``cpu_cycles`` and instructions CPU PMU counters on all CPUs: ******************************************************************************** **Trace configuration file:** :: buffers { size_kb: 10240 fill_policy: RING_BUFFER } data_sources { config { name: "linux.perf" target_buffer: 0 perf_event_config { all_cpus: true timebase { frequency: 99 counter: HW_CPU_CYCLES timestamp_clock: PERF_CLOCK_MONOTONIC } } } } data_sources { config { name: "linux.perf" target_buffer: 0 perf_event_config { all_cpus: true timebase { frequency: 99 counter: HW_INSTRUCTIONS timestamp_clock: PERF_CLOCK_MONOTONIC } } } } duration_ms: 1000 **Perfetto UI visualisation:** .. figure:: perfetto_ex2_ui_visualisation.png :alt: Example 2 - collect cpu_cycles and instructions CPU PMU counters on all CPUs :align: center :target: ../../_images/perfetto_ex2_ui_visualisation.png .. raw:: html

Example 3: call stack sampling of processes: ******************************************** **Trace configuration file:** :: buffers { size_kb: 10240 fill_policy: RING_BUFFER } data_sources { config { name: "linux.perf" target_buffer: 0 perf_event_config { timebase { frequency: 99 timestamp_clock: PERF_CLOCK_MONOTONIC } callstack_sampling { kernel_frames: true } } } } duration_ms: 1000 **Perfetto UI visualisation:** .. figure:: perfetto_ex3_ui_visualisation.png :alt: Example 3 - call stack sampling of processes :align: center :target: ../../_images/perfetto_ex3_ui_visualisation.png -------------- *Copyright (c) 2022-2024, Arm Limited. All rights reserved.*