In-Depth: Understanding the Cortex A53 on Mobile SoCs

This post is part of the series In-Depth

Other posts in this series:

  1. In-Depth: Understanding the Cortex A53 on Mobile SoCs (Current)
  2. In-Depth: Why phone manufacturers continue to eliminate the headphone jack despite all the backlash
  3. In-Depth: How Google talks to you and what WaveNet is all about

To start off, ARM Holdings is a chip design company, which is owned by Soft Bank Ltd. ARM stands for Advanced RISC (Reduced Instruction Computer) Machine. RISC is something that people in Computer Science domain should definitely be aware of, but for the uninitiated, RISC is a microprocessor that is designed to perform a smaller number of types of computer instructions so that it can operate at a higher speed. Its speed is measured in MIPS (Million Instructions Per Second). Today, we take a look at the Cortex A53.

ARM makes designs and licenses it to other manufacturers, who later use these designs to make CPU’s that power our smartphones. In fact, ARM designs are used by almost all the chip makers in the world. Some of these include:

  • Qualcomm (Snapdragon 4xx, 6xx series)
  • Samsung Exynos (Exynox 75xx and 78xx series)
  • Raspberry Pi
  • MediaTek (MT673x, MT675x, MT6795, MT873x, MT8752, MT8163)
  • Hisillicon Kirin series

The focus of our discussion today is going to be on the Cortex A53. It has already been succeeded by the Cortex A55. More powerful cores such as the A72 series power high-end devices such as the Samsung Galaxy S8. One notable exception here would be the iPhone, which has its own reference design when it comes to cores.

A NanoPi, which uses AllWinner’s A53 processor

Design

The ARM Cortex-A53 runs at 1.5 GHz with an eight-stage pipeline and executes the ARMv8 instruction set. It uses dynamic multiple issuing two instructions per clock cycle. It is a static in-order pipeline, in that instructions issue, execute, and commit in order. The pipeline consists of three sections for instruction fetch, instruction decode, and execute.

Cortex A53 block design (Photo: ARM website)

As far as thermal design goes, it has a clock rate of 1.5 GHz. There are 4 cores, which are all configurable. It can handle floating point and has 8 pipeline stages. It uses a hybrid branch prediction technique.

Pipelining: How it handles many, many calculations

The advantage of having multiple cores in a processing unit is that you can have multiple tasks running synchronously on the processor. The way we assign these tasks is pipelining.  Pipelining is a form of computer organization in which successive steps of an instruction sequence are executed in turn by a sequence of modules able to operate concurrently so that another instruction can be begun before the previous one is finished. This, in turn, gives rise to data hazards. Data hazards occur when the pipeline changes the order of reading/writing accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.

Pipelining in the A53 (ARM, 2012)

The Cortex-A53 processor supports the PLD and PRFM prefetch hint instructions. PLD and PRFM instructions lookup in the cache, and start a linefill if they miss and are to a cacheable address. However, the PLD or PRFM instruction retires as soon as its linefill is started rather than waiting for data to be returned.

The Cortex-A53 delivers performance to entry-level devices that were previously enjoyed by high-end flagship mobile devices – in a lower power budget and at a lower cost.

A comparison between performances in big.LITTLE implementation

There are three things that control what is loaded onto the pipeline. A Hybrid Predictor, Indirect Predictor, and a Return Stack are used to try and keep the queue full.  The first three stages fetch two instructions at a time and try to keep a 13-entry instruction queue full.

In order to ensure that it remains full, there is a 6K-bit conditional branch predictor along with an eight entry return address stack, which ensures that future function returns are mapped. If the branch prediction goes wrong, it empties the pipeline. This causes an eight clock cycle penalty since you have to do it all once more.

Given the static pipeline of the Cortex-A53, it is up to the compiler to try to avoid structural hazards and data dependencies.

Dealing with memory and data

The biggest problem that occurs when trying to execute multiple instructions per second is about handling cache data. In order to deal with this problem, we break the cache into smaller banks and allow multiple, independent, parallel accesses, provided the accesses are to different banks. One more thing that is implemented is to use a nonblocking cache. Two techniques that are used here are “hit under miss” and “miss under miss”. Hit under miss allows additional cache hits during a miss, while miss under miss allows multiple outstanding cache misses.

A slide from a course on Computer Architecture – Lihu Rappoport and Adi Yoaz

Implementing this is not as simple, you need a high bandwidth memory system that is capable of handling multiple misses in parallel. Applications which have a larger memory footprint tend to have higher miss rates in the L1 and L2 cache of the A53.

How is it sold?

ARM delivers the reference designs that it makes as Intellectual Property core. Unlike Intel’s Core series of processors or Apple’s A series, the technology that they make can be licensed out to other chip makers. These cores have become dominant in the industry, as billions of chips that are embedded inside personal devices, embedded systems, and other computing devices are made based on ARM reference designs.

These cores can actually be incorporated into other logic (hence it is the “core” of a chip), including application-specific processors (such as an encoder or decoder for video), I/O interfaces, and memory interfaces, and then fabricated to yield a processor optimized for a particular application. Although the processor core is almost identical logically, the resultant chips have many differences. One parameter is the size of the L2 cache, which can vary by a factor of 16.

ARM Design Review Process

Program design and memory

One of the things that I hear sometimes is that programmers can ignore memory hierarchies in writing code. Programmers can easily double or significantly increase the performance of their code if they take into account the memory behavior of the systems that run their algorithms. A lot of people think that the OS level is the best place to define and schedule disk access. The best that an OS can do is to sort the logical addresses in an increasing order. However, the disk knows the actual mapping of the logical addresses and can reduce seek latencies by rescheduling.

Conclusion

The Cortex-A53 in ideal for use in a standalone use scenario, delivering excellent performance at very low power and area enabling new features to be supported in the low-cost smartphone segments. In fact, Brian Jeff of ARM wrote about this extensively in 2013, “Who you callin’ LITTLE.” It’s hardly surprising to see the popularity of this architecture. In 2012, we were giving props to ARM for innovation and were talking about how x86 is stagnating.

Here we are, today. Innovation always pays off in the long run. When this architecture was unveiled, we were building off a 32 nm process, and almost 15 years later, we now work on a 14 nm process in our fabs.

References:

  1. Brian Jeff on ‘ARM Connected Community,’ Oct 28, 2013
  2. David Patterson, John Henessy – Computer Organisation ARM edition [ISBN 9780128017333]
  3. ARM Developers

Continue reading this series: