System profiling, Applications tracing and Trace analysis

This section provides information related with tools, methodologies and features present and supported in the current Total Compute release, that allow to profile the performance and behaviour of the system and/or applications.

Simpleperf

Simpleperf is a native CPU profiling tool for Android, which can be used to profile both Android applications and native processes running on Android. Detailed documentation about this tool can be found on link.

The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the perf_event_open system call, which is used by simpleperf to collect and process the event’s data.

The following subsections do present some simpleperf command examples that can be used to obtain useful information to troubleshoot and validate the development. Detailed information on the available commands supported, and their usage can be obtained on link.

Note

Although not all, most of the following commands do require root privileges in order to retrieve some of the system wide information. Failing to do so, will result in the following error message “System wide profiling needs root privilege”. To obtain root privileges, simply run the command su 0 before running any of the following examples.

Simpleperf List

A list of the different supported events can be obtained by running the command simpleperf list. An example of the output provided by the command is presented below for reference:

console:/ $ simpleperf list

List of hw-cache events:
  # More cache events are available in `simpleperf list raw`.
  branch-load-misses
  branch-loads
  dTLB-load-misses
  dTLB-loads
  iTLB-load-misses
  iTLB-loads
  L1-dcache-load-misses
  L1-dcache-loads
  L1-icache-load-misses
  L1-icache-loads
  LLC-load-misses
  LLC-loads

List of coresight etm events:
  cs_etm/autofdo/
  cs-etm                # CoreSight ETM instruction tracing

List of hardware events:
  branch-instructions
  branch-misses
  bus-cycles
  cache-misses
  cache-references
  cpu-cycles
  instructions
  stalled-cycles-backend
  stalled-cycles-frontend

List of pmu events:
  arm_cspmu_0/cycles/
  arm_cspmu_1/cycles/
  arm_cspmu_2/cycles/
  arm_cspmu_3/cycles/
  arm_dsu_0/bus_access/
  arm_dsu_0/bus_cycles/
  arm_dsu_0/cycles/
  arm_dsu_0/memory_error/
  armv9_cortex_a520/br_immed_retired/
  armv9_cortex_a520/br_mis_pred/
  (...)
  armv9_cortex_a520/trcextout3/
  armv9_cortex_a520/ttbr_write_retired/
  armv9_cortex_a725/br_immed_retired/
  armv9_cortex_a725/br_mis_pred/
  (...)
  armv9_cortex_a725/ttbr_write_retired/
  armv9_cortex_a725/unaligned_ldst_retired/
  armv9_cortex_x925/br_immed_retired/
  armv9_cortex_x925/br_mis_pred/
  (...)
  armv9_cortex_x925/ttbr_write_retired/
  armv9_cortex_x925/unaligned_ldst_retired/
List of raw events provided by cpu pmu:
  # Please refer to "PMU common architectural and microarchitectural event numbers"
  # and "ARM recommendations for IMPLEMENTATION DEFINED event numbers" listed in
  # ARMv8 manual for details.
  # A possible link is https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile.
  raw-ase-spec (may not supported)              # Operation speculatively executed, Advanced SIMD instruction
  raw-br-immed-retired (may not supported)      # Instruction architecturally executed, immediate branch
  raw-br-immed-spec (may not supported)         # Branch speculatively executed, immediate branch
  (...)
  raw-unaligned-st-spec (may not supported)     # Unaligned access, write
  raw-vfp-spec (may not supported)              # Operation speculatively executed, floating-point instruction

List of software events:
  alignment-faults
  context-switches
  cpu-clock
  cpu-migrations
  emulation-faults
  major-faults
  minor-faults
  page-faults
  task-clock
  (...)

Simpleperf Stat

The stat command can be used to get event counter values of the profiled processes. The command can be customised to filter which events to use, which processes/threads to monitor, how to monitor and what print interval to adopt.

Some command examples are presented below.

Get system wide event counts for a specific duration and print at a specific interval

The following command allows to get the system wide default event counts, considering a duration of 1s and printing counts every 50ms:

console:/ # simpleperf stat -a --duration 1 --interval 0.05

Performance counter statistics:

#          count  event_name                # count / runtime
       1,562,624  cpu-cycles                # 0.012290 GHz
               0  stalled-cycles-frontend   # 0.000 /sec
               0  stalled-cycles-backend    # 0.000 /sec
       1,427,862  instructions              # 1.094380 cycles per instruction
          45,105  branch-instructions       # 392.411 K/sec
               0  branch-misses             # 0.000000% miss rate
  106.490720(ms)  task-clock                # 88.921592 cpus used
              14  context-switches          # 139.296 /sec
              11  page-faults               # 114.277 /sec

Total test time: 0.001198 seconds.
Performance counter statistics:

#          count  event_name                # count / runtime
       3,394,688  cpu-cycles                # 0.015067 GHz
               0  stalled-cycles-frontend   # 0.000 /sec
               0  stalled-cycles-backend    # 0.000 /sec
       3,309,558  instructions              # 1.025722 cycles per instruction
         162,493  branch-instructions       # 770.683 K/sec
               0  branch-misses             # 0.000000% miss rate
  199.783110(ms)  task-clock                # 13.914760 cpus used
              54  context-switches          # 278.607 /sec
              13  page-faults               # 68.572 /sec

Total test time: 0.014358 seconds.
(...)

Get event counts for a specific process within a duration

The following command allows to get the default event counts for the process system_server considering a duration of 50ms:

console:/ # ps -A | grep system_server
system         477   318   19313020 338596 do_epoll_wait       0 S system_server
console:/ #
console:/ # simpleperf stat -p 477 --duration 0.05

Performance counter statistics:

#        count  event_name                # count / runtime
             0  cpu-cycles                #
             0  stalled-cycles-frontend   #
             0  stalled-cycles-backend    #
             0  instructions              #
             0  branch-instructions       #
             0  branch-misses             #
  0.000000(ms)  task-clock                # 0.000000 cpus used
             0  context-switches          #
             0  page-faults               #

Total test time: 0.052116 seconds.
console:/sdcard #

Get specific events for a particular process

# this example assumes the previous "system_server" process with PID 477
console:/ # simpleperf stat -e cpu-cycles -p 477 --duration 0.05
Performance counter statistics:

#     count  event_name   # count / runtime
  6,351,580  cpu-cycles   # 0.099210 GHz

Total test time: 0.050210 seconds.
console:/sdcard #

# Additional examples:
console:/ # simpleperf stat -e cache-references,cache-misses -p 477 --duration 0.05
console:/ # simpleperf stat -e cache-references,cache-misses ls

Similarly to filtering events for a particular process using -p <PID> option, filtering events for specific threads can be achieved by using -t <TID> argument.

Get non-CPU PMU events

console:/ # simpleperf stat -a -e arm_dsu_0/cycles/ -- sleep 0.01
Performance counter statistics:

#                     count  event_name          # count / runtime
  9,223,372,036,854,775,809  arm_dsu_0/cycles/   # 469025720716.176 G/sec

Total test time: 0.019609 seconds.

console:/ # simpleperf stat -a -e arm_cspmu_0/cycles/ -- sleep 0.01
Performance counter statistics:

#                     count  event_name            # count / runtime
  9,223,372,036,854,775,809  arm_cspmu_0/cycles/   # 469169585366.791 G/sec

Total test time: 0.019612 seconds.
console:/sdcard #

Note

Non-CPU PMU events are not supported in per-process due to perf or simpleperf not being able to attach events to a process.

Collect event counters using event-groups

console:/ # simpleperf stat --group cpu-cycles,instructions -- ls

acct         debug_ramdisk          lost+found   second_stage_resources
apex         dev                    mnt          storage
bin          etc                    odm          sys
bugreports   fstab.total_compute    odm_dlkm     system
cache        init                   oem          system_dlkm
config       init.common.rc         postinstall  system_ext
d            init.environ.rc        proc         vendor
data         init.total_compute.rc  product      vendor_dlkm
data_mirror  linkerconfig           sdcard
Performance counter statistics:

#      count  event_name     # count / runtime
  27,147,316  cpu-cycles     # 2.603942 GHz
  27,147,250  instructions   # 1.000002 cycles per instruction

Total test time: 0.010935 seconds.
console:/sdcard #

Simpleperf Record

The record command is used to dump samples of the profiled processes. The following example provides a very basic usage scenario of the command:

console:/sdcard # pwd
/sdcard
console:/sdcard # simpleperf record ls
Alarms      DCIM       Movies         Pictures    Ringtones
Android     Documents  Music          Podcasts    TemporaryFile-t57Mnj
Audiobooks  Download   Notifications  Recordings  perf.data
simpleperf I cmd_record.cpp:798] Recorded for 0.0108992 seconds. Start post processing.
simpleperf I cmd_record.cpp:891] Samples recorded: 37. Samples lost: 0.
console:/sdcard #

Some additional command usage examples may include:

# Record individual process for a specific duration:
console:/ # simpleperf record -p <PID> --duration <DURATION IN SECONDS>

# Record set of processes for a specific duration:
console:/ # simpleperf record -p <PID1>,<PID2> --duration <DURATION IN SECONDS>

# Spawn workload as a child process and record it:
console:/ # simpleperf record <WORKLOAD APPLICATION>

# Frequency of the record can be set using -f or -c option, where
# '-f 1000' means collecting 1000 records every second, and
# '-c 1000' means collecting 1 record when 1000 events are hit.
console:/ # simpleperf record -f <FREQUENCY> -p <PID> --duration <DURATION IN SECONDS>
console:/ # simpleperf record -c <COUNT> -p <PID> --duration <DURATION IN SECONDS>

Simpleperf Report

The report command is used to report profiling data generated by the record command. The following example assumes being executed following the previous simpleperf record ls command example:

# this example assumes and follows the run of the previous "simpleperf record ls" example
console:/sdcard # simpleperf report
Cmdline: /system/bin/simpleperf record ls
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 37
Event count: 25996499

Overhead  Command     Pid   Tid   Shared Object                                   Symbol
38.72%    ls          2145  2145  /system/lib64/libcrypto.so                      sha256_block_data_order
16.70%    ls          2145  2145  [kernel.kallsyms]                               invoke_syscall
7.56%     ls          2145  2145  [kernel.kallsyms]                               perf_output_end
6.67%     ls          2145  2145  /apex/com.android.runtime/bin/linker64          [linker]soinfo::lookup_version_info(VersionTracker const&, unsigned int, char const*, version_info const**)
5.23%     ls          2145  2145  [kernel.kallsyms]                               vm_area_free
4.64%     ls          2145  2145  /apex/com.android.runtime/bin/linker64          [linker]ElfReader::ReadDynamicSection()
3.45%     ls          2145  2145  [kernel.kallsyms]                               el0_da
3.20%     ls          2145  2145  /apex/com.android.runtime/lib64/bionic/libc.so  __aarch64_cas4_acq
3.16%     ls          2145  2145  [kernel.kallsyms]                               mt_find
3.13%     ls          2145  2145  [kernel.kallsyms]                               mas_wr_walk
2.97%     ls          2145  2145  [kernel.kallsyms]                               mas_next_node
2.91%     ls          2145  2145  [kernel.kallsyms]                               __rcu_read_unlock
1.56%     ls          2145  2145  [kernel.kallsyms]                               mas_destroy
0.10%     ls          2145  2145  [kernel.kallsyms]                               mas_walk
0.01%     ls          2145  2145  [kernel.kallsyms]                               __pte_alloc
0.00%     ls          2145  2145  [kernel.kallsyms]                               down_write_killable
0.00%     ls          2145  2145  [kernel.kallsyms]                               setup_new_exec
0.00%     simpleperf  2145  2145  [kernel.kallsyms]                               __rcu_read_lock
console:/sdcard #

Perf

Perf is a profiler tool for Linux based systems that abstracts CPU hardware differences in Linux performance measurements, while presenting a simple command-line interface.

More information on the tool can be found on link.

The Linux Kernel exposes several Performance Monitoring Unit (PMU) - CPU and non-CPU - events, as well as software and tracepoint events to user space via the perf_event_open system call, which is used by perf to collect and process the event’s data.

List of available events

A list of the different supported events can be obtained by running the command perf list. An example of the output provided by the command is presented below for reference:

# perf list
List of pre-defined events (to be used in -e or -M):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

  alignment-faults                                   [Software event]
  bpf-output                                         [Software event]
  cgroup-switches                                    [Software event]
  context-switches OR cs                             [Software event]
  cpu-clock                                          [Software event]
  cpu-migrations OR migrations                       [Software event]
  dummy                                              [Software event]
  emulation-faults                                   [Software event]
  major-faults                                       [Software event]
  minor-faults                                       [Software event]
  page-faults OR faults                              [Software event]
  task-clock                                         [Software event]

  duration_time                                      [Tool event]
  user_time                                          [Tool event]
  system_time                                        [Tool event]

  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  branch-load-misses                                 [Hardware cache event]
  branch-loads                                       [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  iTLB-load-misses                                   [Hardware cache event]
  iTLB-loads                                         [Hardware cache event]
  br_immed_retired OR armv9_cortex_a520/br_immed_retired/ [Kernel PMU event]
  br_immed_retired OR armv9_cortex_a725/br_immed_retired/ [Kernel PMU event]
  br_immed_retired OR armv9_cortex_x925/br_immed_retired/ [Kernel PMU event]
  br_mis_pred OR armv9_cortex_a520/br_mis_pred/      [Kernel PMU event]
  br_mis_pred OR armv9_cortex_a725/br_mis_pred/      [Kernel PMU event]
  br_mis_pred OR armv9_cortex_x925/br_mis_pred/      [Kernel PMU event]
  (...)
  ttbr_write_retired OR armv9_cortex_a520/ttbr_write_retired/ [Kernel PMU event]
  ttbr_write_retired OR armv9_cortex_a725/ttbr_write_retired/ [Kernel PMU event]
  ttbr_write_retired OR armv9_cortex_x925/ttbr_write_retired/ [Kernel PMU event]
  unaligned_ldst_retired OR armv9_cortex_a520/unaligned_ldst_retired/ [Kernel PMU event]
  unaligned_ldst_retired OR armv9_cortex_a725/unaligned_ldst_retired/ [Kernel PMU event]
  arm_cspmu_0/cycles/                                [Kernel PMU event]
  arm_cspmu_1/cycles/                                [Kernel PMU event]
  arm_cspmu_2/cycles/                                [Kernel PMU event]
  arm_cspmu_3/cycles/                                [Kernel PMU event]
  arm_dsu_0/bus_access/                              [Kernel PMU event]
  arm_dsu_0/bus_cycles/                              [Kernel PMU event]
  arm_dsu_0/cycles/                                  [Kernel PMU event]
  arm_dsu_0/memory_error/                            [Kernel PMU event]
  arm_spe_0//                                        [Kernel PMU event]
  arm_spe_1//                                        [Kernel PMU event]
  cs_etm//                                           [Kernel PMU event]
  cs_etm/autofdo/                                    [Kernel PMU event]
  (...)
  alarmtimer:alarmtimer_cancel                       [Tracepoint event]
  alarmtimer:alarmtimer_fired                        [Tracepoint event]
  alarmtimer:alarmtimer_start                        [Tracepoint event]
  alarmtimer:alarmtimer_suspend                      [Tracepoint event]
  (...)

Note

The previous command may present its output in an unformatted way when running on the FVP. It may be desirable to instead redirect its output to a file and then list the contents of that file by running the following command sequence:

# perf list > perf_list.txt
# cat perf_list.txt

Perf Stat

The stat command can be used to get event counter values of the profiled processes. Some examples of its usage are presented following, as well as some considerations to take into account when considering TC3, with direct implications on the perf stat command.

Special considerations considering TC3 and implications on the perf stat command

TC3 defines per-microarchitecture PMU instances. As a result, the Kernel CPU PMU events will be displayed for each CPU micro-architecture during perf list, as illustrated on the following excerpt:

(...)
  cpu_cycles OR armv9_cortex_a520/cpu_cycles/          [Kernel PMU event]
  cpu_cycles OR armv9_cortex_a725/cpu_cycles/          [Kernel PMU event]
  cpu_cycles OR armv9_cortex_x925/cpu_cycles/          [Kernel PMU event]
(...)

When considering Kernel 6.1 and for situations where the perf command is executed as a task-bound (cpu==-1), the event is opened on an arbitrary CPU PMU and will only count on a subset of CPUs. This means, for example, that it might open on a “big” PMU and only count while the task is running on “big” CPUs, but not while the task is running on “little” CPUs. The following excerpt illustrates one such situation, where cycles are not counted for the command ls, as the command did execute on the CPUs whose PMU were not selected by perf to open the events:

# perf stat -e cycles -- ls
arm-ffa-tee.ko

 Performance counter stats for 'ls':

     <not counted>      cycles                                                        (0.00%)

       0.000509460 seconds time elapsed

       0.000044000 seconds user
       0.000000000 seconds sys
#

To overcome this implication and always ensure the retrieval of meaningful data, perf commands should be executed in one of two possible ways:

  1. providing to the perf command the individual CPU PMU events to count:

    # perf stat -e armv9_cortex_x925/cpu_cycles/,armv9_cortex_a725/cpu_cycles/,armv9_cortex_a520/cpu_cycles/ -- ls
    arm-ffa-tee.ko
    
     Performance counter stats for 'ls':
    
               1224520      armv9_cortex_x925/cpu_cycles/
         <not counted>      armv9_cortex_a725/cpu_cycles/                                     (0.00%)
         <not counted>      armv9_cortex_a520/cpu_cycles/                                     (0.00%)
    
           0.000970750 seconds time elapsed
    
           0.001074000 seconds user
           0.000000000 seconds sys
    #
    
  2. providing to the perf command a CPU mask so that the event is opened on all CPU PMUs:

    # perf stat -C 0-7 -e instructions,cycles -- ls
    arm-ffa-tee.ko
    
     Performance counter stats for 'CPU(s) 0-7':
    
               1768977      instructions                     #    1.00  insn per cycle
               1764843      cycles
    
           0.000740800 seconds time elapsed
    #
    

As can also be seen on the previous example, the instructions are not broken down to specific PMU type (CPU type). This might be ambiguous for users to read the result, as instructions/cycles on different CPUs do have different performance meaning.

This issue seems to have been fixed in newer Kernel versions (>=6.6) and when running the perf command with the default event names (without providing the CPU mask). Therefore, a possible solution could be to compile perf from newer source code, and copy the resulting binary into the rootfs before booting the image, or alternatively use the scp command to upload the binary to a booted system.

Additional perf stat command examples are illustrated following, where the -C 0-7 argument was used as a workaround for the above-mentioned issue (TC3 FVP has 8 CPUs):

## Single event
# perf stat -C 0-7 -e <EVENT> -- <WORKLOAD>

## Multiple events
# perf stat -C 0-7 -e <EVENT1>,<EVENT2>,...,<EVENT-N> -- <WORKLOAD>

## Event grouping
# perf stat -C 0-7 -e '{<EVENT1>,<EVENT2>,...,<EVENT-N>}' -- <WORKLOAD>

## Attaching to the existing process; 'sleep X' is passed to run perf for a specific duration
# perf stat -C 0-7 -e <EVENT> -p <PID> -- sleep 1

Note

DSU and MCN PMU driver do not support all possible events by name. For cases where data for a particular event is not visible, perf stat can be used with a raw event ID. Some examples of how to read the non-CPU PMU event counters are presented below (the values 0xa2 and 0x182 are obtained from the respective component TRM documentation):

# perf stat -e arm_dsu_0/cycles/,arm_dsu_0/memory_error/ -- sleep 0.01

 Performance counter stats for 'system wide':

                 2      arm_dsu_0/cycles/
                 0      arm_dsu_0/memory_error/

       0.010749870 seconds time elapsed
#

## Additional examples:
## Count DSU cache read refills
# perf stat -e arm_dsu_0/event=0xa2/ -- sleep 0.01
## Count MCN MCTL_write_req
# perf stat -e arm_cspmu_0/event=0x182/ -- sleep 0.01

Perf Record, Report and Annotate

Running perf record will collect and generate a perf.data file containing the sampling data of one or more events. This data can be later analysed using perf report or perf annotate commands. By default, the perf record uses cycles as a default event.

To modify the sampling period while running perf record, two approaches can be followed:

  1. frequency: specifies the average rate of samples/sec (-F option);

  2. count: enforces sampling at the specifies event period (-c option).

Some command examples illustrating this usage are presented below:

# Sample on event cycles at the default frequency
perf record -C 0-7 <WORKLOAD>

# Sample on event instructions at 1000 samples/sec
perf record -C 0-7 -e instructions -F 1000 <WORKLOAD>

# Sample on event instructions at every 2000 occurrences of event
perf record -C 0-7 -e instructions -c 2000 <WORKLOAD>

Perf and Arm SPE extension

The Arm Statistical Profiling Extension (SPE) feature provides a hardware assisted CPU operation profiling mechanism. This provides accurate attribution of latencies and events down to individual instructions.

The general perf record command usage with SPE on TC23 platform looks like:

perf record -e arm_spe_<spe_instance>/<CONFIG PARAMETERS>/ -- taskset -c <cpu_list> <WORKLOAD>

TC23 supports SPE only on Mid and Big CPUs and not on small CPUs, there are 2 SPE instances, arm_spe_0 for Mid CPUs (CPUs 2-5) and arm_spe_1 for big CPUs (CPUs 6-7). When workload needs to be analyzed using SPE, it should be bound to CPUs which have the SPE capability using taskset. So on TC23 platform workloads should be bound to CPUs 2-5 when using arm_spe_0 and workloads should be bound to CPUs 6-7 when using arm_spe_1. min_latency=0 config parameter is mandatory to provide with any perf-spe command.

The following listing illustrates how to record SPE samples on Mid CPUs with arm_spe_0:

# perf record -e arm_spe_0/min_latency=0/ -- taskset -c 2-5 ls
arm-tstee.ko   build_env.cfg  perf.data      perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.225 MB perf.data ]
#

The previously recorded data (perf.data) can then be analyzed using the perf report command as follows:

# perf report
Warning:
Please install libunwind or libdw development packages during the perf build.
Only instruction-based sampling period is currently supported by Arm SPE.
# To display the perf.data header info, please use --header/--header-only option
#
#
# Total Lost Samples: 0
#
# Samples: 853  of event 'l1d-access'
# Event count (approx.): 853
#
# Children      Self  Command  Shared Object          Symbol
# ........  ........  .......  .....................  ..........................
#
    31.42%    31.42%  ls       [kernel.kallsyms]      [k] __sanitizer_cov_trace_
    9.03%     9.03%  taskset  [kernel.kallsyms]      [k] __sanitizer_cov_trace_
    1.88%     1.88%  ls       [kernel.kallsyms]      [k] next_uptodate_page
    1.52%     1.52%  ls       [kernel.kallsyms]      [k] do_set_pte
    1.41%     1.41%  ls       [kernel.kallsyms]      [k] search_cmp_ftr_reg
    1.41%     1.41%  ls       [kernel.kallsyms]      [k] unmap_page_range
    1.06%     1.06%  ls       [kernel.kallsyms]      [k] __pi_copy_page
    1.06%     1.06%  ls       [kernel.kallsyms]      [k] __rcu_read_unlock
(...)

The previous example only specify min_latency=0 required config parameter. However, there can be situations where making usage of the other config parameters may help to filter profiling information. Complementing the previous example, let’s assume it would be desirable to make usage of the config parameter event_filter=2, which discards all samples which do not have retired instructions events. The following command listing illustrates the command usage considering this scenario:

# perf record -e arm_spe_0/min_latency=0,event_filter=2/ -- taskset -c 2 ls
arm-tstee.ko   build_env.cfg  perf.data      perf.data.old
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.225 MB perf.data ]
#

Detailed information regarding the config parameters can be found at link.

Kernel config and prerequisites for enabling Arm SPE can also be found in the Kernel documentation.

Perfetto

Perfetto is an open-source stack developed for performance instrumentation and trace analysis. It offers services and libraries for recording system-level and app-level traces, native and Java heap profiling, a library for analysing traces using SQL, and a web-based user interface (UI) that allows to visualize and explore the collected traces.

It has support for ftrace, atrace, /proc/{stat,vmstat,pid}/* and perf_event as data sources to collect system level traces. A data source can be seen as the capability, exposed by a producer, of providing some tracing data. The producer is an entity that offers the ability to contribute to the trace, advertising this ability with one or more data sources. A consumer is an entity that controls the tracing service, provides to the tracing service the trace configuration and reads back the trace buffers. The tracing service is a long-lived entity (i.e. a system daemon on Linux/Android) which handles the tracing sessions, routes trace configuration from consumer to producers, and manages trace buffers.

The data source defines its own schema (a protobuf) consisting of data source trace config (what kind of input config it would expect from the consumer) and trace packets (what kind of data it would output into the trace).

Some examples of data sources advertised by different producers to collect system-level traces are listed below:
  • linux.process_stats

  • linux.ftrace

  • linux.sys_stats

  • linux.perf

Recording and Visualising Traces with Perfetto

Perfetto can record traces, using either the UI (available at https://ui.perfetto.dev/#!/record) or the command line. An example of how perfetto can be used to collect traces using the command line is present below:

# the following commands are intended to be run on the host PC;
# only applicable for the following command, the current path is assumed to be <TC_WORKSPACE>
export PATH="$(pwd)/src/android/out/host/linux-x86/bin:${PATH}"
adb connect localhost:<PORT>
adb devices
adb -s localhost:<PORT> push config.txt /data/local/tmp/config.txt
adb -s localhost:<PORT> shell perfetto -o /data/misc/perfetto-traces/trace_file.perfetto-trace --txt -c /data/local/tmp/config.txt
adb -s localhost:<PORT> pull /data/misc/perfetto-traces/trace_file.perfetto-trace ./
Some complementing considerations regarding the previous presented command listing:
  • the use of -s localhost:<PORT> can be ignored if there is only one ADB instance available for debug to the host;

  • the default ADB port is 5555; however, in cases where there are more than one ADB instance available for debug to the host, the port may change; in these situations refer to the output of the adb devices command or to the FVP-model start up log information to understand which port was assigned as replacement; the ADB connection on Android section provides additional information that can be useful to troubleshoot the connection;

  • config.txt contains the perfetto trace config; some examples of this config will be presented in the Trace config examples subsection.

Once the perfetto trace file is collected and downloaded to the host, it can be loaded into the perfetto UI (available at https://ui.perfetto.dev/) using the option “open trace file”, as illustrated on the following image:

../../_images/perfetto_load_trace_file.png

Detailed information regarding perfetto can be found on the official documentation available at https://perfetto.dev/docs/.

Trace config examples

This subsection provides three trace config examples that can be used to control the tracing service and influence the sampled data on the TC3 platform. Alongside to each trace configuration, examples of the visualisation of the respective captured trace data using the Perfetto UI are also included for reference.

Additional examples of data source trace configurations for different supported data sources can be found at https://perfetto.dev/docs/ (please refer to the “Data sources” section). Some additional config examples can be found in test/configs/ directory in perfetto source code.

A full list of supported ftrace events can be found in file protos/perfetto/trace/ftrace/ftrace_event.proto in perfetto source code.

A full list of supported meminfo and vmstat counters can be found in file protos/perfetto/common/sys_stats_counters.proto in perfetto source code.

Example 1: collect ftrace scheduling events, process stats and system stats counters every 1000ms:

Trace configuration file:

buffers {
  size_kb: 16384
  fill_policy: RING_BUFFER
}

buffers {
  size_kb: 16384
  fill_policy: RING_BUFFER
}

data_sources {
  config {
    name: "linux.ftrace"
    target_buffer: 0
    ftrace_config {
      # Scheduling information and process tracking. Useful for:
      # - what is happening on each CPU at each moment
      # - why a thread was de-scheduled
      # - parent/child relationships between processes and threads.
      ftrace_events: "sched/sched_switch"
      ftrace_events: "power/suspend_resume"
      ftrace_events: "sched/sched_process_exit"
      ftrace_events: "sched/sched_process_free"
      ftrace_events: "task/task_newtask"
      ftrace_events: "task/task_rename"

      # Wakeup info. Allows to compute how long a task was
      # blocked due to CPU contention.
      ftrace_events: "sched/sched_wakeup"

      # os.Trace markers:
      ftrace_events: "ftrace/print"
      # RSS and ION buffer events:
      ftrace_events: "mm_event/mm_event_record"
      ftrace_events: "kmem/rss_stat"
      ftrace_events: "kmem/ion_heap_grow"
      ftrace_events: "kmem/ion_heap_shrink"
    }
  }
}

data_sources {
  config {
    name: "linux.sys_stats"
    target_buffer: 1
    sys_stats_config {
      meminfo_period_ms: 100
      meminfo_counters: MEMINFO_MEM_AVAILABLE
      meminfo_counters: MEMINFO_BUFFERS
      meminfo_counters: MEMINFO_CACHED
      meminfo_counters: MEMINFO_SWAP_CACHED
      meminfo_counters: MEMINFO_ACTIVE
      meminfo_counters: MEMINFO_INACTIVE
      meminfo_counters: MEMINFO_ACTIVE_ANON
      meminfo_counters: MEMINFO_INACTIVE_ANON
      meminfo_counters: MEMINFO_ACTIVE_FILE
      meminfo_counters: MEMINFO_INACTIVE_FILE
      meminfo_counters: MEMINFO_UNEVICTABLE

      vmstat_period_ms: 100
      vmstat_counters: VMSTAT_NR_FREE_PAGES
      vmstat_counters: VMSTAT_NR_ALLOC_BATCH
      vmstat_counters: VMSTAT_NR_INACTIVE_ANON
      vmstat_counters: VMSTAT_NR_VMSCAN_WRITE
      vmstat_counters: VMSTAT_NR_VMSCAN_IMMEDIATE_RECLAIM
      vmstat_counters: VMSTAT_NR_WRITEBACK_TEMP

      stat_period_ms: 100
      stat_counters: STAT_CPU_TIMES
      stat_counters: STAT_IRQ_COUNTS
      stat_counters: STAT_FORK_COUNT
    }
  }
}

data_sources: {
    config {
        name: "linux.process_stats"
        target_buffer: 0
        process_stats_config {
            scan_all_processes_on_start: true
            proc_stats_poll_ms: 1000
        }
    }
}

duration_ms: 1000

Perfetto UI visualisation:

Example 1 - collect ``ftrace`` scheduling events, process stats and system stats counters every 1000ms


Example 2: collect cpu_cycles and instructions CPU PMU counters on all CPUs:

Trace configuration file:

buffers {
  size_kb: 10240
  fill_policy: RING_BUFFER
}

data_sources {
  config {
    name: "linux.perf"
    target_buffer: 0
    perf_event_config {
      all_cpus: true
      timebase {
        frequency: 99
        counter: HW_CPU_CYCLES
        timestamp_clock: PERF_CLOCK_MONOTONIC
      }
    }
  }
}

data_sources {
  config {
    name: "linux.perf"
    target_buffer: 0
    perf_event_config {
      all_cpus: true
      timebase {
        frequency: 99
        counter: HW_INSTRUCTIONS
        timestamp_clock: PERF_CLOCK_MONOTONIC
      }
    }
  }
}

duration_ms: 1000

Perfetto UI visualisation:

Example 2 - collect cpu_cycles and instructions CPU PMU counters on all CPUs


Example 3: call stack sampling of processes:

Trace configuration file:

buffers {
  size_kb: 10240
  fill_policy: RING_BUFFER
}

data_sources {
  config {
    name: "linux.perf"
    target_buffer: 0
    perf_event_config {
      timebase {
        frequency: 99
        timestamp_clock: PERF_CLOCK_MONOTONIC
      }
      callstack_sampling {
        kernel_frames: true
      }
    }
  }
}

duration_ms: 1000

Perfetto UI visualisation:

Example 3 - call stack sampling of processes

Copyright (c) 2022-2024, Arm Limited. All rights reserved.