System profiling, Applications tracing and Trace analysis
This section provides information related with tools, methodologies and features present and supported in the current Total Compute release, that allow to profile the performance and behaviour of the system and/or applications.
Simpleperf
Simpleperf
is a native CPU profiling tool for Android, which can be used to profile both Android applications and native processes running on Android.
Detailed documentation about this tool can be found on link.
The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the perf_event_open
system call, which is used by simpleperf
to collect and process the event’s data.
The following subsections do present some simpleperf
command examples that can be used to obtain useful information to troubleshoot and validate the development.
Detailed information on the available commands supported, and their usage can be obtained on link.
Note
Although not all, most of the following commands do require root privileges in order to retrieve some of the system wide information.
Failing to do so, will result in the following error message “System wide profiling needs root privilege”.
To obtain root privileges, simply run the command su 0
before running any of the following examples.
Simpleperf List
A list of the different supported events can be obtained by running the command simpleperf list
.
An example of the output provided by the command is presented below for reference:
console:/ $ simpleperf list
List of hw-cache events:
# More cache events are available in `simpleperf list raw`.
branch-load-misses
branch-loads
dTLB-load-misses
dTLB-loads
iTLB-load-misses
iTLB-loads
L1-dcache-load-misses
L1-dcache-loads
L1-icache-load-misses
L1-icache-loads
LLC-load-misses
LLC-loads
List of coresight etm events:
cs_etm/autofdo/
cs-etm # CoreSight ETM instruction tracing
List of hardware events:
branch-instructions
branch-misses
bus-cycles
cache-misses
cache-references
cpu-cycles
instructions
stalled-cycles-backend
stalled-cycles-frontend
List of pmu events:
arm_cspmu_0/cycles/
arm_cspmu_1/cycles/
arm_cspmu_2/cycles/
arm_cspmu_3/cycles/
arm_dsu_0/bus_access/
arm_dsu_0/bus_cycles/
arm_dsu_0/cycles/
arm_dsu_0/memory_error/
armv9_cortex_a520/br_immed_retired/
armv9_cortex_a520/br_mis_pred/
(...)
armv9_cortex_a520/trcextout3/
armv9_cortex_a520/ttbr_write_retired/
armv9_cortex_a725/br_immed_retired/
armv9_cortex_a725/br_mis_pred/
(...)
armv9_cortex_a725/ttbr_write_retired/
armv9_cortex_a725/unaligned_ldst_retired/
armv9_cortex_x925/br_immed_retired/
armv9_cortex_x925/br_mis_pred/
(...)
armv9_cortex_x925/ttbr_write_retired/
armv9_cortex_x925/unaligned_ldst_retired/
List of raw events provided by cpu pmu:
# Please refer to "PMU common architectural and microarchitectural event numbers"
# and "ARM recommendations for IMPLEMENTATION DEFINED event numbers" listed in
# ARMv8 manual for details.
# A possible link is https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile.
raw-ase-spec (may not supported) # Operation speculatively executed, Advanced SIMD instruction
raw-br-immed-retired (may not supported) # Instruction architecturally executed, immediate branch
raw-br-immed-spec (may not supported) # Branch speculatively executed, immediate branch
(...)
raw-unaligned-st-spec (may not supported) # Unaligned access, write
raw-vfp-spec (may not supported) # Operation speculatively executed, floating-point instruction
List of software events:
alignment-faults
context-switches
cpu-clock
cpu-migrations
emulation-faults
major-faults
minor-faults
page-faults
task-clock
(...)
Simpleperf Stat
The stat
command can be used to get event counter values of the profiled processes.
The command can be customised to filter which events to use, which processes/threads to monitor, how to monitor and what print interval to adopt.
Some command examples are presented below.
Get system wide event counts for a specific duration and print at a specific interval
The following command allows to get the system wide default event counts, considering a duration of 1s and printing counts every 50ms:
console:/ # simpleperf stat -a --duration 1 --interval 0.05
Performance counter statistics:
# count event_name # count / runtime
1,562,624 cpu-cycles # 0.012290 GHz
0 stalled-cycles-frontend # 0.000 /sec
0 stalled-cycles-backend # 0.000 /sec
1,427,862 instructions # 1.094380 cycles per instruction
45,105 branch-instructions # 392.411 K/sec
0 branch-misses # 0.000000% miss rate
106.490720(ms) task-clock # 88.921592 cpus used
14 context-switches # 139.296 /sec
11 page-faults # 114.277 /sec
Total test time: 0.001198 seconds.
Performance counter statistics:
# count event_name # count / runtime
3,394,688 cpu-cycles # 0.015067 GHz
0 stalled-cycles-frontend # 0.000 /sec
0 stalled-cycles-backend # 0.000 /sec
3,309,558 instructions # 1.025722 cycles per instruction
162,493 branch-instructions # 770.683 K/sec
0 branch-misses # 0.000000% miss rate
199.783110(ms) task-clock # 13.914760 cpus used
54 context-switches # 278.607 /sec
13 page-faults # 68.572 /sec
Total test time: 0.014358 seconds.
(...)
Get event counts for a specific process within a duration
The following command allows to get the default event counts for the process system_server
considering a duration of 50ms:
console:/ # ps -A | grep system_server
system 477 318 19313020 338596 do_epoll_wait 0 S system_server
console:/ #
console:/ # simpleperf stat -p 477 --duration 0.05
Performance counter statistics:
# count event_name # count / runtime
0 cpu-cycles #
0 stalled-cycles-frontend #
0 stalled-cycles-backend #
0 instructions #
0 branch-instructions #
0 branch-misses #
0.000000(ms) task-clock # 0.000000 cpus used
0 context-switches #
0 page-faults #
Total test time: 0.052116 seconds.
console:/sdcard #
Get specific events for a particular process
# this example assumes the previous "system_server" process with PID 477
console:/ # simpleperf stat -e cpu-cycles -p 477 --duration 0.05
Performance counter statistics:
# count event_name # count / runtime
6,351,580 cpu-cycles # 0.099210 GHz
Total test time: 0.050210 seconds.
console:/sdcard #
# Additional examples:
console:/ # simpleperf stat -e cache-references,cache-misses -p 477 --duration 0.05
console:/ # simpleperf stat -e cache-references,cache-misses ls
Similarly to filtering events for a particular process using -p <PID>
option, filtering events for specific threads can be achieved by using -t <TID>
argument.
Get non-CPU PMU events
console:/ # simpleperf stat -a -e arm_dsu_0/cycles/ -- sleep 0.01
Performance counter statistics:
# count event_name # count / runtime
9,223,372,036,854,775,809 arm_dsu_0/cycles/ # 469025720716.176 G/sec
Total test time: 0.019609 seconds.
console:/ # simpleperf stat -a -e arm_cspmu_0/cycles/ -- sleep 0.01
Performance counter statistics:
# count event_name # count / runtime
9,223,372,036,854,775,809 arm_cspmu_0/cycles/ # 469169585366.791 G/sec
Total test time: 0.019612 seconds.
console:/sdcard #
Note
Non-CPU PMU events are not supported in per-process due to perf or simpleperf not being able to attach events to a process.
Collect event counters using event-groups
console:/ # simpleperf stat --group cpu-cycles,instructions -- ls
acct debug_ramdisk lost+found second_stage_resources
apex dev mnt storage
bin etc odm sys
bugreports fstab.total_compute odm_dlkm system
cache init oem system_dlkm
config init.common.rc postinstall system_ext
d init.environ.rc proc vendor
data init.total_compute.rc product vendor_dlkm
data_mirror linkerconfig sdcard
Performance counter statistics:
# count event_name # count / runtime
27,147,316 cpu-cycles # 2.603942 GHz
27,147,250 instructions # 1.000002 cycles per instruction
Total test time: 0.010935 seconds.
console:/sdcard #
Simpleperf Record
The record
command is used to dump samples of the profiled processes.
The following example provides a very basic usage scenario of the command:
console:/sdcard # pwd
/sdcard
console:/sdcard # simpleperf record ls
Alarms DCIM Movies Pictures Ringtones
Android Documents Music Podcasts TemporaryFile-t57Mnj
Audiobooks Download Notifications Recordings perf.data
simpleperf I cmd_record.cpp:798] Recorded for 0.0108992 seconds. Start post processing.
simpleperf I cmd_record.cpp:891] Samples recorded: 37. Samples lost: 0.
console:/sdcard #
Some additional command usage examples may include:
# Record individual process for a specific duration:
console:/ # simpleperf record -p <PID> --duration <DURATION IN SECONDS>
# Record set of processes for a specific duration:
console:/ # simpleperf record -p <PID1>,<PID2> --duration <DURATION IN SECONDS>
# Spawn workload as a child process and record it:
console:/ # simpleperf record <WORKLOAD APPLICATION>
# Frequency of the record can be set using -f or -c option, where
# '-f 1000' means collecting 1000 records every second, and
# '-c 1000' means collecting 1 record when 1000 events are hit.
console:/ # simpleperf record -f <FREQUENCY> -p <PID> --duration <DURATION IN SECONDS>
console:/ # simpleperf record -c <COUNT> -p <PID> --duration <DURATION IN SECONDS>
Simpleperf Report
The report
command is used to report profiling data generated by the record
command.
The following example assumes being executed following the previous simpleperf record ls
command example:
# this example assumes and follows the run of the previous "simpleperf record ls" example
console:/sdcard # simpleperf report
Cmdline: /system/bin/simpleperf record ls
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 37
Event count: 25996499
Overhead Command Pid Tid Shared Object Symbol
38.72% ls 2145 2145 /system/lib64/libcrypto.so sha256_block_data_order
16.70% ls 2145 2145 [kernel.kallsyms] invoke_syscall
7.56% ls 2145 2145 [kernel.kallsyms] perf_output_end
6.67% ls 2145 2145 /apex/com.android.runtime/bin/linker64 [linker]soinfo::lookup_version_info(VersionTracker const&, unsigned int, char const*, version_info const**)
5.23% ls 2145 2145 [kernel.kallsyms] vm_area_free
4.64% ls 2145 2145 /apex/com.android.runtime/bin/linker64 [linker]ElfReader::ReadDynamicSection()
3.45% ls 2145 2145 [kernel.kallsyms] el0_da
3.20% ls 2145 2145 /apex/com.android.runtime/lib64/bionic/libc.so __aarch64_cas4_acq
3.16% ls 2145 2145 [kernel.kallsyms] mt_find
3.13% ls 2145 2145 [kernel.kallsyms] mas_wr_walk
2.97% ls 2145 2145 [kernel.kallsyms] mas_next_node
2.91% ls 2145 2145 [kernel.kallsyms] __rcu_read_unlock
1.56% ls 2145 2145 [kernel.kallsyms] mas_destroy
0.10% ls 2145 2145 [kernel.kallsyms] mas_walk
0.01% ls 2145 2145 [kernel.kallsyms] __pte_alloc
0.00% ls 2145 2145 [kernel.kallsyms] down_write_killable
0.00% ls 2145 2145 [kernel.kallsyms] setup_new_exec
0.00% simpleperf 2145 2145 [kernel.kallsyms] __rcu_read_lock
console:/sdcard #
Perf
Perf
is a profiler tool for Linux based systems that abstracts CPU hardware differences in Linux performance measurements, while presenting a simple command-line interface.
More information on the tool can be found on link.
The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the perf_event_open
system call, which is used by perf
to collect and process the event’s data.
List of available events
A list of the different supported events can be obtained by running the command perf list
.
An example of the output provided by the command is presented below for reference:
# perf list
List of pre-defined events (to be used in -e or -M):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
stalled-cycles-backend OR idle-cycles-backend [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
bpf-output [Software event]
cgroup-switches [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
page-faults OR faults [Software event]
task-clock [Software event]
duration_time [Tool event]
user_time [Tool event]
system_time [Tool event]
L1-dcache-load-misses [Hardware cache event]
L1-dcache-loads [Hardware cache event]
L1-icache-load-misses [Hardware cache event]
L1-icache-loads [Hardware cache event]
LLC-load-misses [Hardware cache event]
LLC-loads [Hardware cache event]
branch-load-misses [Hardware cache event]
branch-loads [Hardware cache event]
dTLB-load-misses [Hardware cache event]
dTLB-loads [Hardware cache event]
iTLB-load-misses [Hardware cache event]
iTLB-loads [Hardware cache event]
br_immed_retired OR armv9_cortex_a520/br_immed_retired/ [Kernel PMU event]
br_immed_retired OR armv9_cortex_a725/br_immed_retired/ [Kernel PMU event]
br_immed_retired OR armv9_cortex_x925/br_immed_retired/ [Kernel PMU event]
br_mis_pred OR armv9_cortex_a520/br_mis_pred/ [Kernel PMU event]
br_mis_pred OR armv9_cortex_a725/br_mis_pred/ [Kernel PMU event]
br_mis_pred OR armv9_cortex_x925/br_mis_pred/ [Kernel PMU event]
(...)
ttbr_write_retired OR armv9_cortex_a520/ttbr_write_retired/ [Kernel PMU event]
ttbr_write_retired OR armv9_cortex_a725/ttbr_write_retired/ [Kernel PMU event]
ttbr_write_retired OR armv9_cortex_x925/ttbr_write_retired/ [Kernel PMU event]
unaligned_ldst_retired OR armv9_cortex_a520/unaligned_ldst_retired/ [Kernel PMU event]
unaligned_ldst_retired OR armv9_cortex_a725/unaligned_ldst_retired/ [Kernel PMU event]
arm_cspmu_0/cycles/ [Kernel PMU event]
arm_cspmu_1/cycles/ [Kernel PMU event]
arm_cspmu_2/cycles/ [Kernel PMU event]
arm_cspmu_3/cycles/ [Kernel PMU event]
arm_dsu_0/bus_access/ [Kernel PMU event]
arm_dsu_0/bus_cycles/ [Kernel PMU event]
arm_dsu_0/cycles/ [Kernel PMU event]
arm_dsu_0/memory_error/ [Kernel PMU event]
arm_spe_0// [Kernel PMU event]
arm_spe_1// [Kernel PMU event]
cs_etm// [Kernel PMU event]
cs_etm/autofdo/ [Kernel PMU event]
(...)
alarmtimer:alarmtimer_cancel [Tracepoint event]
alarmtimer:alarmtimer_fired [Tracepoint event]
alarmtimer:alarmtimer_start [Tracepoint event]
alarmtimer:alarmtimer_suspend [Tracepoint event]
(...)
Note
The previous command may present its output in an unformatted way when running on the FVP. It may be desirable to instead redirect its output to a file and then list the contents of that file by running the following command sequence:
# perf list > perf_list.txt
# cat perf_list.txt
Perf Stat
The stat
command can be used to get event counter values of the profiled processes. Some examples of its usage are presented following, as well as some considerations to take into account when considering TC3, with direct implications on the perf stat
command.
Special considerations considering TC3 and implications on the perf stat
command
TC3 defines per-microarchitecture PMU instances. As a result, the Kernel CPU PMU events will be displayed for each CPU micro-architecture during perf list
, as illustrated on the following excerpt:
(...)
cpu_cycles OR armv9_cortex_a520/cpu_cycles/ [Kernel PMU event]
cpu_cycles OR armv9_cortex_a725/cpu_cycles/ [Kernel PMU event]
cpu_cycles OR armv9_cortex_x925/cpu_cycles/ [Kernel PMU event]
(...)
When considering Kernel 6.1 and for situations where the perf
command is executed as a task-bound (cpu==-1
), the event is opened on an arbitrary CPU PMU and will only count on a subset of CPUs. This means, for example, that it might open on a “big” PMU and only count while the task is running on “big” CPUs, but not while the task is running on “little” CPUs.
The following excerpt illustrates one such situation, where cycles are not counted for the command ls
, as the command did execute on the CPUs whose PMU were not selected by perf
to open the events:
# perf stat -e cycles -- ls
arm-ffa-tee.ko
Performance counter stats for 'ls':
<not counted> cycles (0.00%)
0.000509460 seconds time elapsed
0.000044000 seconds user
0.000000000 seconds sys
#
To overcome this implication and always ensure the retrieval of meaningful data, perf
commands should be executed in one of two possible ways:
providing to the
perf
command the individual CPU PMU events to count:# perf stat -e armv9_cortex_x925/cpu_cycles/,armv9_cortex_a725/cpu_cycles/,armv9_cortex_a520/cpu_cycles/ -- ls arm-ffa-tee.ko Performance counter stats for 'ls': 1224520 armv9_cortex_x925/cpu_cycles/ <not counted> armv9_cortex_a725/cpu_cycles/ (0.00%) <not counted> armv9_cortex_a520/cpu_cycles/ (0.00%) 0.000970750 seconds time elapsed 0.001074000 seconds user 0.000000000 seconds sys #providing to the
perf
command a CPU mask so that the event is opened on all CPU PMUs:# perf stat -C 0-7 -e instructions,cycles -- ls arm-ffa-tee.ko Performance counter stats for 'CPU(s) 0-7': 1768977 instructions # 1.00 insn per cycle 1764843 cycles 0.000740800 seconds time elapsed #
As can also be seen on the previous example, the instructions are not broken down to specific PMU type (CPU type). This might be ambiguous for users to read the result, as instructions/cycles on different CPUs do have different performance meaning.
This issue seems to have been fixed in newer Kernel versions (>=6.6) and when running the perf
command with the default event names (without providing the CPU mask). Therefore, a possible solution could be to compile perf
from newer source code, and copy the resulting binary into the rootfs before booting the image, or alternatively use the scp
command to upload the binary to a booted system.
Additional perf stat
command examples are illustrated following, where the -C 0-7
argument was used as a workaround for the above-mentioned issue (TC3 FVP has 8 CPUs):
## Single event
# perf stat -C 0-7 -e <EVENT> -- <WORKLOAD>
## Multiple events
# perf stat -C 0-7 -e <EVENT1>,<EVENT2>,...,<EVENT-N> -- <WORKLOAD>
## Event grouping
# perf stat -C 0-7 -e '{<EVENT1>,<EVENT2>,...,<EVENT-N>}' -- <WORKLOAD>
## Attaching to the existing process; 'sleep X' is passed to run perf for a specific duration
# perf stat -C 0-7 -e <EVENT> -p <PID> -- sleep 1
Note
DSU and MCN PMU driver do not support all possible events by name. For cases where data for a particular event is not visible, perf stat
can be used with a raw event ID.
Some examples of how to read the non-CPU PMU event counters are presented below (the values 0xa2
and 0x182
are obtained from the respective component TRM documentation):
# perf stat -e arm_dsu_0/cycles/,arm_dsu_0/memory_error/ -- sleep 0.01
Performance counter stats for 'system wide':
2 arm_dsu_0/cycles/
0 arm_dsu_0/memory_error/
0.010749870 seconds time elapsed
#
## Additional examples:
## Count DSU cache read refills
# perf stat -e arm_dsu_0/event=0xa2/ -- sleep 0.01
## Count MCN MCTL_write_req
# perf stat -e arm_cspmu_0/event=0x182/ -- sleep 0.01
Perf Record, Report and Annotate
Running perf record
will collect and generate a perf.data
file containing the sampling data of one or more events.
This data can be later analysed using perf report
or perf annotate
commands.
By default, the perf record
uses cycles as a default event.
To modify the sampling period while running perf record
, two approaches can be followed:
frequency: specifies the average rate of samples/sec (
-F
option);count: enforces sampling at the specifies event period (
-c
option).
Some command examples illustrating this usage are presented below:
# Sample on event cycles at the default frequency
perf record -C 0-7 <WORKLOAD>
# Sample on event instructions at 1000 samples/sec
perf record -C 0-7 -e instructions -F 1000 <WORKLOAD>
# Sample on event instructions at every 2000 occurrences of event
perf record -C 0-7 -e instructions -c 2000 <WORKLOAD>
Perf and Arm SPE extension
The Arm Statistical Profiling Extension (SPE) feature provides a hardware assisted CPU operation profiling mechanism. This provides accurate attribution of latencies and events down to individual instructions.
The general perf record
command usage with SPE on TC23 platform looks like:
perf record -e arm_spe_<spe_instance>/<CONFIG PARAMETERS>/ -- taskset -c <cpu_list> <WORKLOAD>
TC23 supports SPE only on Mid and Big CPUs and not on small CPUs, there are 2 SPE instances, arm_spe_0
for Mid CPUs (CPUs 2-5) and arm_spe_1
for big CPUs (CPUs 6-7).
When workload needs to be analyzed using SPE, it should be bound to CPUs which have the SPE capability using taskset.
So on TC23 platform workloads should be bound to CPUs 2-5 when using arm_spe_0
and workloads should be bound to CPUs 6-7 when using arm_spe_1
.
min_latency=0
config parameter is mandatory to provide with any perf-spe command.
The following listing illustrates how to record SPE samples on Mid CPUs with arm_spe_0
:
# perf record -e arm_spe_0/min_latency=0/ -- taskset -c 2-5 ls
arm-tstee.ko build_env.cfg perf.data perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.225 MB perf.data ]
#
The previously recorded data (perf.data
) can then be analyzed using the perf report
command as follows:
# perf report
Warning:
Please install libunwind or libdw development packages during the perf build.
Only instruction-based sampling period is currently supported by Arm SPE.
# To display the perf.data header info, please use --header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 853 of event 'l1d-access'
# Event count (approx.): 853
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... ..................... ..........................
#
31.42% 31.42% ls [kernel.kallsyms] [k] __sanitizer_cov_trace_
9.03% 9.03% taskset [kernel.kallsyms] [k] __sanitizer_cov_trace_
1.88% 1.88% ls [kernel.kallsyms] [k] next_uptodate_page
1.52% 1.52% ls [kernel.kallsyms] [k] do_set_pte
1.41% 1.41% ls [kernel.kallsyms] [k] search_cmp_ftr_reg
1.41% 1.41% ls [kernel.kallsyms] [k] unmap_page_range
1.06% 1.06% ls [kernel.kallsyms] [k] __pi_copy_page
1.06% 1.06% ls [kernel.kallsyms] [k] __rcu_read_unlock
(...)
The previous example only specify min_latency=0
required config parameter. However, there can be situations where making usage of the other config parameters may help to filter profiling information.
Complementing the previous example, let’s assume it would be desirable to make usage of the config parameter event_filter=2
, which discards all samples which do not have retired instructions events.
The following command listing illustrates the command usage considering this scenario:
# perf record -e arm_spe_0/min_latency=0,event_filter=2/ -- taskset -c 2 ls
arm-tstee.ko build_env.cfg perf.data perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.225 MB perf.data ]
#
Detailed information regarding the config parameters can be found at link.
Kernel config and prerequisites for enabling Arm SPE can also be found in the Kernel documentation.
Perfetto
Perfetto is an open-source stack developed for performance instrumentation and trace analysis. It offers services and libraries for recording system-level and app-level traces, native and Java heap profiling, a library for analysing traces using SQL, and a web-based user interface (UI) that allows to visualize and explore the collected traces.
It has support for ftrace
, atrace
, /proc/{stat,vmstat,pid}/*
and perf_event
as data sources to collect system level traces. A data source can be seen as the capability, exposed by a producer, of providing some tracing data. The producer is an entity that offers the ability to contribute to the trace, advertising this ability with one or more data sources. A consumer is an entity that controls the tracing service, provides to the tracing service the trace configuration and reads back the trace buffers. The tracing service is a long-lived entity (i.e. a system daemon on Linux/Android) which handles the tracing sessions, routes trace configuration from consumer to producers, and manages trace buffers.
The data source defines its own schema (a protobuf) consisting of data source trace config (what kind of input config it would expect from the consumer) and trace packets (what kind of data it would output into the trace).
- Some examples of data sources advertised by different producers to collect system-level traces are listed below:
linux.process_stats
linux.ftrace
linux.sys_stats
linux.perf
Recording and Visualising Traces with Perfetto
Perfetto can record traces, using either the UI (available at https://ui.perfetto.dev/#!/record) or the command line. An example of how perfetto can be used to collect traces using the command line is present below:
# the following commands are intended to be run on the host PC;
# only applicable for the following command, the current path is assumed to be <TC_WORKSPACE>
export PATH="$(pwd)/src/android/out/host/linux-x86/bin:${PATH}"
adb connect localhost:<PORT>
adb devices
adb -s localhost:<PORT> push config.txt /data/local/tmp/config.txt
adb -s localhost:<PORT> shell perfetto -o /data/misc/perfetto-traces/trace_file.perfetto-trace --txt -c /data/local/tmp/config.txt
adb -s localhost:<PORT> pull /data/misc/perfetto-traces/trace_file.perfetto-trace ./
- Some complementing considerations regarding the previous presented command listing:
the use of
-s localhost:<PORT>
can be ignored if there is only one ADB instance available for debug to the host;the default ADB port is 5555; however, in cases where there are more than one ADB instance available for debug to the host, the port may change; in these situations refer to the output of the
adb devices
command or to the FVP-model start up log information to understand which port was assigned as replacement; the ADB connection on Android section provides additional information that can be useful to troubleshoot the connection;config.txt
contains the perfetto trace config; some examples of this config will be presented in the Trace config examples subsection.
Once the perfetto trace file is collected and downloaded to the host, it can be loaded into the perfetto UI (available at https://ui.perfetto.dev/) using the option “open trace file”, as illustrated on the following image:
Detailed information regarding perfetto can be found on the official documentation available at https://perfetto.dev/docs/.
Trace config examples
This subsection provides three trace config examples that can be used to control the tracing service and influence the sampled data on the TC3 platform. Alongside to each trace configuration, examples of the visualisation of the respective captured trace data using the Perfetto UI are also included for reference.
Additional examples of data source trace configurations for different supported data sources can be found at https://perfetto.dev/docs/ (please refer to the “Data sources” section). Some additional config examples can be found in test/configs/
directory in perfetto source code.
A full list of supported ftrace
events can be found in file protos/perfetto/trace/ftrace/ftrace_event.proto
in perfetto source code.
A full list of supported meminfo
and vmstat
counters can be found in file protos/perfetto/common/sys_stats_counters.proto
in perfetto source code.
Example 1: collect ftrace
scheduling events, process stats and system stats counters every 1000ms:
Trace configuration file:
buffers {
size_kb: 16384
fill_policy: RING_BUFFER
}
buffers {
size_kb: 16384
fill_policy: RING_BUFFER
}
data_sources {
config {
name: "linux.ftrace"
target_buffer: 0
ftrace_config {
# Scheduling information and process tracking. Useful for:
# - what is happening on each CPU at each moment
# - why a thread was de-scheduled
# - parent/child relationships between processes and threads.
ftrace_events: "sched/sched_switch"
ftrace_events: "power/suspend_resume"
ftrace_events: "sched/sched_process_exit"
ftrace_events: "sched/sched_process_free"
ftrace_events: "task/task_newtask"
ftrace_events: "task/task_rename"
# Wakeup info. Allows to compute how long a task was
# blocked due to CPU contention.
ftrace_events: "sched/sched_wakeup"
# os.Trace markers:
ftrace_events: "ftrace/print"
# RSS and ION buffer events:
ftrace_events: "mm_event/mm_event_record"
ftrace_events: "kmem/rss_stat"
ftrace_events: "kmem/ion_heap_grow"
ftrace_events: "kmem/ion_heap_shrink"
}
}
}
data_sources {
config {
name: "linux.sys_stats"
target_buffer: 1
sys_stats_config {
meminfo_period_ms: 100
meminfo_counters: MEMINFO_MEM_AVAILABLE
meminfo_counters: MEMINFO_BUFFERS
meminfo_counters: MEMINFO_CACHED
meminfo_counters: MEMINFO_SWAP_CACHED
meminfo_counters: MEMINFO_ACTIVE
meminfo_counters: MEMINFO_INACTIVE
meminfo_counters: MEMINFO_ACTIVE_ANON
meminfo_counters: MEMINFO_INACTIVE_ANON
meminfo_counters: MEMINFO_ACTIVE_FILE
meminfo_counters: MEMINFO_INACTIVE_FILE
meminfo_counters: MEMINFO_UNEVICTABLE
vmstat_period_ms: 100
vmstat_counters: VMSTAT_NR_FREE_PAGES
vmstat_counters: VMSTAT_NR_ALLOC_BATCH
vmstat_counters: VMSTAT_NR_INACTIVE_ANON
vmstat_counters: VMSTAT_NR_VMSCAN_WRITE
vmstat_counters: VMSTAT_NR_VMSCAN_IMMEDIATE_RECLAIM
vmstat_counters: VMSTAT_NR_WRITEBACK_TEMP
stat_period_ms: 100
stat_counters: STAT_CPU_TIMES
stat_counters: STAT_IRQ_COUNTS
stat_counters: STAT_FORK_COUNT
}
}
}
data_sources: {
config {
name: "linux.process_stats"
target_buffer: 0
process_stats_config {
scan_all_processes_on_start: true
proc_stats_poll_ms: 1000
}
}
}
duration_ms: 1000
Perfetto UI visualisation:
Example 2: collect cpu_cycles
and instructions CPU PMU counters on all CPUs:
Trace configuration file:
buffers {
size_kb: 10240
fill_policy: RING_BUFFER
}
data_sources {
config {
name: "linux.perf"
target_buffer: 0
perf_event_config {
all_cpus: true
timebase {
frequency: 99
counter: HW_CPU_CYCLES
timestamp_clock: PERF_CLOCK_MONOTONIC
}
}
}
}
data_sources {
config {
name: "linux.perf"
target_buffer: 0
perf_event_config {
all_cpus: true
timebase {
frequency: 99
counter: HW_INSTRUCTIONS
timestamp_clock: PERF_CLOCK_MONOTONIC
}
}
}
}
duration_ms: 1000
Perfetto UI visualisation:
Example 3: call stack sampling of processes:
Trace configuration file:
buffers {
size_kb: 10240
fill_policy: RING_BUFFER
}
data_sources {
config {
name: "linux.perf"
target_buffer: 0
perf_event_config {
timebase {
frequency: 99
timestamp_clock: PERF_CLOCK_MONOTONIC
}
callstack_sampling {
kernel_frames: true
}
}
}
}
duration_ms: 1000
Perfetto UI visualisation:
Copyright (c) 2022-2024, Arm Limited. All rights reserved.