summaryrefslogtreecommitdiff
path: root/docs/perf
diff options
context:
space:
mode:
authorPaul Beesley <paul.beesley@arm.com>2019-02-11 17:54:45 +0000
committerPaul Beesley <paul.beesley@arm.com>2019-05-21 15:05:56 +0100
commit40d553cfde38d4f68449c62967cd1ce0d6478750 (patch)
tree1fafd4701066cdf0e5fb15aee2d842279a67b611 /docs/perf
parent12b67439e93a78a4b756e987e1bd1b6e22cc4bf8 (diff)
doc: Move documents into subdirectories
This change creates the following directories under docs/ in order to provide a grouping for the content: - components - design - getting_started - perf - process In each of these directories an index.rst file is created and this serves as an index / landing page for each of the groups when the pages are compiled. Proper layout of the top-level table of contents relies on this directory/index structure. Without this patch it is possible to build the documents correctly with Sphinx but the output looks messy because there is no overall hierarchy. Change-Id: I3c9f4443ec98571a56a6edf775f2c8d74d7f429f Signed-off-by: Paul Beesley <paul.beesley@arm.com>
Diffstat (limited to 'docs/perf')
-rw-r--r--docs/perf/index.rst9
-rw-r--r--docs/perf/psci-performance-juno.rst288
2 files changed, 297 insertions, 0 deletions
diff --git a/docs/perf/index.rst b/docs/perf/index.rst
new file mode 100644
index 00000000..50833b8c
--- /dev/null
+++ b/docs/perf/index.rst
@@ -0,0 +1,9 @@
+Performance & Testing
+=====================
+
+.. toctree::
+ :maxdepth: 1
+ :caption: Contents
+ :numbered:
+
+ psci-performance-juno
diff --git a/docs/perf/psci-performance-juno.rst b/docs/perf/psci-performance-juno.rst
new file mode 100644
index 00000000..caed8bf9
--- /dev/null
+++ b/docs/perf/psci-performance-juno.rst
@@ -0,0 +1,288 @@
+PSCI Performance Measurements on Arm Juno Development Platform
+==============================================================
+
+This document summarises the findings of performance measurements of key
+operations in the ARM Trusted Firmware (TF) Power State Coordination Interface
+(PSCI) implementation, using the in-built Performance Measurement Framework
+(PMF) and runtime instrumentation timestamps.
+
+Method
+------
+
+We used the `Juno R1 platform`_ for these tests, which has 4 x Cortex-A53 and 2
+x Cortex-A57 clusters running at the following frequencies:
+
++-----------------+--------------------+
+| Domain | Frequency (MHz) |
++=================+====================+
+| Cortex-A57 | 900 (nominal) |
++-----------------+--------------------+
+| Cortex-A53 | 650 (underdrive) |
++-----------------+--------------------+
+| AXI subsystem | 533 |
++-----------------+--------------------+
+
+Juno supports CPU, cluster and system power down states, corresponding to power
+levels 0, 1 and 2 respectively. It does not support any retention states.
+
+We used the upstream `TF master as of 31/01/2017`_, building the platform using
+the ``ENABLE_RUNTIME_INSTRUMENTATION`` option:
+
+::
+
+ make PLAT=juno ENABLE_RUNTIME_INSTRUMENTATION=1 \
+ SCP_BL2=<path/to/scp-fw.bin> \
+ BL33=<path/to/test-fw.bin> \
+ all fip
+
+When using the debug build of TF, there was no noticeable difference in the
+results.
+
+The tests are based on an ARM-internal test framework. The release build of this
+framework was used because the results in the debug build became skewed; the
+console output prevented some of the tests from executing in parallel.
+
+The tests consist of both parallel and sequential tests, which are broadly
+described as follows:
+
+- **Parallel Tests** This type of test powers on all the non-lead CPUs and
+ brings them and the lead CPU to a common synchronization point. The lead CPU
+ then initiates the test on all CPUs in parallel.
+
+- **Sequential Tests** This type of test powers on each non-lead CPU in
+ sequence. The lead CPU initiates the test on a non-lead CPU then waits for the
+ test to complete before proceeding to the next non-lead CPU. The lead CPU then
+ executes the test on itself.
+
+In the results below, CPUs 0-3 refer to CPUs in the little cluster (A53) and
+CPUs 4-5 refer to CPUs in the big cluster (A57). In all cases CPU 4 is the lead
+CPU.
+
+``PSCI_ENTRY`` refers to the time taken from entering the TF PSCI implementation
+to the point the hardware enters the low power state (WFI). Referring to the TF
+runtime instrumentation points, this corresponds to:
+``(RT_INSTR_ENTER_HW_LOW_PWR - RT_INSTR_ENTER_PSCI)``.
+
+``PSCI_EXIT`` refers to the time taken from the point the hardware exits the low
+power state to exiting the TF PSCI implementation. This corresponds to:
+``(RT_INSTR_EXIT_PSCI - RT_INSTR_EXIT_HW_LOW_PWR)``.
+
+``CFLUSH_OVERHEAD`` refers to the part of ``PSCI_ENTRY`` taken to flush the
+caches. This corresponds to: ``(RT_INSTR_EXIT_CFLUSH - RT_INSTR_ENTER_CFLUSH)``.
+
+Note there is very little variance observed in the values given (~1us), although
+the values for each CPU are sometimes interchanged, depending on the order in
+which locks are acquired. Also, there is very little variance observed between
+executing the tests sequentially in a single boot or rebooting between tests.
+
+Given that runtime instrumentation using PMF is invasive, there is a small
+(unquantified) overhead on the results. PMF uses the generic counter for
+timestamps, which runs at 50MHz on Juno.
+
+Results and Commentary
+----------------------
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0 | 27 | 20 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 1 | 114 | 86 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 2 | 202 | 58 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 3 | 375 | 29 | 94 |
++-------+---------------------+--------------------+--------------------------+
+| 4 | 20 | 22 | 6 |
++-------+---------------------+--------------------+--------------------------+
+| 5 | 290 | 18 | 206 |
++-------+---------------------+--------------------+--------------------------+
+
+A large variance in ``PSCI_ENTRY`` and ``PSCI_EXIT`` times across CPUs is
+observed due to TF PSCI lock contention. In the worst case, CPU 3 has to wait
+for the 3 other CPUs in the cluster (0-2) to complete ``PSCI_ENTRY`` and release
+the lock before proceeding.
+
+The ``CFLUSH_OVERHEAD`` times for CPUs 3 and 5 are higher because they are the
+last CPUs in their respective clusters to power down, therefore both the L1 and
+L2 caches are flushed.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 5 is a lot larger than that for CPU 3
+because the L2 cache size for the big cluster is lot larger (2MB) compared to
+the little cluster (1MB).
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0 | 116 | 14 | 8 |
++-------+---------------------+--------------------+--------------------------+
+| 1 | 204 | 14 | 8 |
++-------+---------------------+--------------------+--------------------------+
+| 2 | 287 | 13 | 8 |
++-------+---------------------+--------------------+--------------------------+
+| 3 | 376 | 13 | 9 |
++-------+---------------------+--------------------+--------------------------+
+| 4 | 29 | 15 | 7 |
++-------+---------------------+--------------------+--------------------------+
+| 5 | 21 | 15 | 8 |
++-------+---------------------+--------------------+--------------------------+
+
+There is no lock contention in TF generic code at power level 0 but the large
+variance in ``PSCI_ENTRY`` times across CPUs is due to lock contention in Juno
+platform code. The platform lock is used to mediate access to a single SCP
+communication channel. This is compounded by the SCP firmware waiting for each
+AP CPU to enter WFI before making the channel available to other CPUs, which
+effectively serializes the SCP power down commands from all CPUs.
+
+On platforms with a more efficient CPU power down mechanism, it should be
+possible to make the ``PSCI_ENTRY`` times smaller and consistent.
+
+The ``PSCI_EXIT`` times are consistent across all CPUs because TF does not
+require locks at power level 0.
+
+The ``CFLUSH_OVERHEAD`` times for all CPUs are small and consistent since only
+the cache associated with power level 0 is flushed (L1).
+
+``CPU_SUSPEND`` to deepest power level on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0 | 114 | 20 | 94 |
++-------+---------------------+--------------------+--------------------------+
+| 1 | 114 | 20 | 94 |
++-------+---------------------+--------------------+--------------------------+
+| 2 | 114 | 20 | 94 |
++-------+---------------------+--------------------+--------------------------+
+| 3 | 114 | 20 | 94 |
++-------+---------------------+--------------------+--------------------------+
+| 4 | 195 | 22 | 180 |
++-------+---------------------+--------------------+--------------------------+
+| 5 | 21 | 17 | 6 |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CLUSH_OVERHEAD`` times for lead CPU 4 and all CPUs in the non-lead cluster
+are large because all other CPUs in the cluster are powered down during the
+test. The ``CPU_SUSPEND`` call powers down to the cluster level, requiring a
+flush of both L1 and L2 caches.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are low because lead
+CPU 4 continues to run while CPU 5 is suspended. Hence CPU 5 only powers down to
+level 0, which only requires L1 cache flush.
+
+``CPU_SUSPEND`` to power level 0 on all CPUs in sequence
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
++-------+---------------------+--------------------+--------------------------+
+| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0 | 22 | 14 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 1 | 22 | 14 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 2 | 21 | 14 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 3 | 22 | 14 | 5 |
++-------+---------------------+--------------------+--------------------------+
+| 4 | 17 | 14 | 6 |
++-------+---------------------+--------------------+--------------------------+
+| 5 | 18 | 15 | 6 |
++-------+---------------------+--------------------+--------------------------+
+
+Here the times are small and consistent since there is no contention and it is
+only necessary to flush the cache to power level 0 (L1). This is the best case
+scenario.
+
+The ``PSCI_ENTRY`` times for CPUs in the big cluster are slightly smaller than
+for the CPUs in little cluster due to greater CPU performance.
+
+The ``PSCI_EXIT`` times are generally lower than in the last test because the
+cluster remains powered on throughout the test and there is less code to execute
+on power on (for example, no need to enter CCI coherency)
+
+``CPU_OFF`` on all non-lead CPUs in sequence then ``CPU_SUSPEND`` on lead CPU to deepest power level
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The test sequence here is as follows:
+
+1. Call ``CPU_ON`` and ``CPU_OFF`` on each non-lead CPU in sequence.
+
+2. Program wake up timer and suspend the lead CPU to the deepest power level.
+
+3. Call ``CPU_ON`` on non-lead CPU to get the timestamps from each CPU.
+
++-------+---------------------+--------------------+--------------------------+
+| CPU | ``PSCI_ENTRY`` (us) | ``PSCI_EXIT`` (us) | ``CFLUSH_OVERHEAD`` (us) |
++=======+=====================+====================+==========================+
+| 0 | 110 | 28 | 93 |
++-------+---------------------+--------------------+--------------------------+
+| 1 | 110 | 28 | 93 |
++-------+---------------------+--------------------+--------------------------+
+| 2 | 110 | 28 | 93 |
++-------+---------------------+--------------------+--------------------------+
+| 3 | 111 | 28 | 93 |
++-------+---------------------+--------------------+--------------------------+
+| 4 | 195 | 22 | 181 |
++-------+---------------------+--------------------+--------------------------+
+| 5 | 20 | 23 | 6 |
++-------+---------------------+--------------------+--------------------------+
+
+The ``CFLUSH_OVERHEAD`` times for all little CPUs are large because all other
+CPUs in that cluster are powerered down during the test. The ``CPU_OFF`` call
+powers down to the cluster level, requiring a flush of both L1 and L2 caches.
+
+The ``PSCI_ENTRY`` and ``CFLUSH_OVERHEAD`` times for CPU 5 are small because
+lead CPU 4 is running and CPU 5 only powers down to level 0, which only requires
+an L1 cache flush.
+
+The ``CFLUSH_OVERHEAD`` time for CPU 4 is a lot larger than those for the little
+CPUs because the L2 cache size for the big cluster is lot larger (2MB) compared
+to the little cluster (1MB).
+
+The ``PSCI_EXIT`` times for CPUs in the big cluster are slightly smaller than
+for CPUs in the little cluster due to greater CPU performance. These times
+generally are greater than the ``PSCI_EXIT`` times in the ``CPU_SUSPEND`` tests
+because there is more code to execute in the "on finisher" compared to the
+"suspend finisher" (for example, GIC redistributor register programming).
+
+``PSCI_VERSION`` on all CPUs in parallel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since very little code is associated with ``PSCI_VERSION``, this test
+approximates the round trip latency for handling a fast SMC at EL3 in TF.
+
++-------+-------------------+
+| CPU | TOTAL TIME (ns) |
++=======+===================+
+| 0 | 3020 |
++-------+-------------------+
+| 1 | 2940 |
++-------+-------------------+
+| 2 | 2980 |
++-------+-------------------+
+| 3 | 3060 |
++-------+-------------------+
+| 4 | 520 |
++-------+-------------------+
+| 5 | 720 |
++-------+-------------------+
+
+The times for the big CPUs are less than the little CPUs due to greater CPU
+performance.
+
+We suspect the time for lead CPU 4 is shorter than CPU 5 due to subtle cache
+effects, given that these measurements are at the nano-second level.
+
+.. _Juno R1 platform: https://www.arm.com/files/pdf/Juno_r1_ARM_Dev_datasheet.pdf
+.. _TF master as of 31/01/2017: https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/?id=c38b36d