Accelerators For Large-Scale Data Centre Services
Introduction
Data centres need more and more storage devices to store constantly generated data. With high computational capabilities, a data centre can extract useful information from the unsorted data, but to only rely on a CPU for massive data analysis is far from meeting the demands. So the researchers and designers have to find ways to accelerate the Large-Scale data centre Services. However, with polymorphic and diversification applications on the rise, and the continuous growth of data, datacentres faces plenty of challenges such as security, reliability, flexibility, power efficiency andincreasing costs. In this paper, we present a survey on how to accelerate the data centre service.
After introducing different solutions that are currently used in the industry, we then analysis thepros and cons of different solutions considering the technological trends, flexibility and energy efficiency. Finally, we give a conclusion about the discussed solutions.
Architectures
In this section we will describe possible architectures that could be used to accelerate your data centre. The first two are both developed by Microsoft and the second is an upgraded version of the first one. We have chosen to include both since the older version might be enough for yourpurposes. Three of the four architectures we present here in this survey utilises FPGAs and wetherefore recommend the use of a FPGA targeted C-to-gates tool to facilitate the programming ofthe FPGAs.
Catapult
Field Programmable Gate Arrays (FPGA) are reconfigurable chips that can provide capabilities foraccelerating workflows. Multiple FPGAs can offer a scalable area, but may cost more and consumemore power, while a single FPGA per server may limit the acceleration. A reconfigurable fabric, called FPGA Catapult, is designed to balance these performance concerns. This can be achieved byimplementing multiple FPGAs on a daughter-card and put the card in a subset of the servers, orimplementing a single high-end FPGA with a small daughter-card and communicate directly tothe secondary network. The authors used the seconds approach, because of the sufficientutilisation of the reconfigurable logic. Compared with the connection via FPGU, the connectionvia PCIe from board to CPU minimise the disruptions to the servers. The network topology consists of two dimensional, 6x8 torus(secondary network), which provide a low latency and highbandwidth.
Performance
Putnam et al. tested their re-configurable fabric by letting the fabric acceler-ate the production-ranker of the search engine Bing. Average and tail latency was both measuredwith and without the fabric in order to have something to compare with. When comparing latenciesin the 95th percentile was the worst-case latency reduced by 29% when using the re-configurablefabric. Putnam et al. also found that their fabric could not only reduce the latency of the ranker, but also handle increasing input better than the pure software solution Bing was using. Due tothis did the Catapult accelerated ranker have a 95% higher throughput than software ranker. Thefabric can therefore be used for either reduce the number of servers needed to achieve the currentperformance or increase the performance of the existing servers.
Configurable Cloud Architecture
This architecture is a continuation of the work done on Catapult and it’s more scalable andflexible. In the Configurable Cloud Architecture, multiple FPGAs are connected together througha high performance network switch. A layer of FPGAs is placed between the Ethernet networkswitches and the servers’ NICs. The Configurable Cloud accelerate the datapath (networking flows, storage flows, security operations and distributed applications) of cloud communication withprogrammable hardware. The host use remote FPGAs for acceleration, since they are independentfrom the networking packets generation of FPGAs and the host can donate their own local FPGAsto the global pool. The Configurable Cloud is a flexible architecture, since the FPGAs not only canbe used as local computers or network accelerators, but also can be used as large-scale resourcespools. The Lightweight Transport Layer (LTL) protocol allows the remote acceleration service, which makes the access to remote FPGA resources closer than both the local SSD access and thetime to visit the host’s network stack.
Achronix PCIe Accelerator-6D
In networking, the execution of the lower OSI layer bit-intensive tasks are not as sufficient as the execution in upper layers. Even though CPU provides different kinds of services in the networking control plane applications, it lacks supportive for the packet based services in network layer, data link layer and physical layer. The Achronix PCIe Accelerator-6D board is a programmable NIC. With a configurable hardware acceleration engine, the Accelerator-6D board can support customise bit-intensive tasks and imple-ment a brunch of accelerators for data shaping, header analysis, encapsulation, security, networkfunction virtualization (NFV) and test and measurement. Data centre apply Accelerator-6D to a variety of Remote Direct Memory Access (RDMA) applications through the Ethernet (ROCE) or iWARP protocols. By implementing the RDMA on the Accelerator-6D, the servers East-West communication in the data centre can be bypassed and the conventional North-South transactionsare supported by the standard networking and tunnelling protocol to access the Internet or otherremote resources. Such local communication reduce the access to system resources. Without labo-rious memory accesses and pipeline executions, the Accelerator-6D board improve the data centre performance and efficiency.
Tensor Processing Unit
The development of the Tensor Processing Unit (TPU) was driven by the increasing use of DeepNeural Networks facilitating speech recognition in search engines. The TPU are therefore aneural Network accelerator and is designed to speed up inference computations. It is designed asan co-processor to be connected to the PCIe I/O bus in the same manner as a GPU would be. Thisdesign choice expedites installation of TPUs in already existing systems since the TPU board plugsin to a standard SATA disk slot. The main idea of installing a TPU co-processor is to offload the CPU and by doing so free up theCPU for other tasks and let the TPU perform computations related to the Neural Network. The TPU is built for the purpose of doing these kinds of calculations and will therefore perform themmuch faster than an ordinary CPU. The TPU consists of three parts: a Unified Buffer, a Matrix Multiply Unit and the Data Path. TheUnified Buffer holds input from the CPU, intermediate results from the Matrix Multiply Unit aswell as the final output from the Matrix Multiply Unit. The data transfer between TPU and CPU arehandled by a DMA. The Matrix Multiply Unit can, when running on peak performance, perform256 reads and writes per clock cycle or perform a matrix multiplication or a convolution eachclock cycle. The TPU use CISC instructions extended with a repeat field. There are about a dozendifferent instructions and the CPI for these are typically 10-20.
As mentioned above is it during performance of Neural Network operationsthat the TPU shines. Jouppi et al. measured the performance of 6 different inference apps on aTPU, a K80 GPU and a Haswell CPU [3]. In their experiments could the TPU provide a speedupof X15-X30 compared to the K80 GPU and the Haswell CPU when only considering the pureperformance of the processing unit. When designing a data centre it is more important to considerthe cost in relation to the performance, commonly called Total Cost of Ownership (TCO). Theauthors calculated therefore the performance/Watt in order to factor in the energy consumptionof the three different processing units. The TPU proved to be the unit that provided the highestperformance per Watt compared to the other units that was just in the experiments. Since both theGPU and the TPU are co-processors were two different performance/Watt values calculated. Onewhere the energy consumed by the host CPU (totalt-performance/Watt) and one where this energywas omitted (incremental-performance/Watt). When compared the CPU did the TPU provide aspeedup of X17-X34 total-performance/Watt and X41-X83 incremental-performance/Watt. Whencompared to the GPU was the numbers X14-X16 and X25-X29 respectively.
Discussion
A draw back when using Catapult is the need for complex and expensive cabling along withdepending on awareness of the physical location of the hardware. This among other things waseliminated during the work on the Configurable Cloud fabric. Also Catapult uses one PCIe-attachedFPGA per CPU and solves the inelastic problem by deploying a secondary inter-FPGA network, which brings additional cost, increases cabling and management complexity. Even though researcher says the catapult achieve 95% improvement in ranking throughput a fixed latency compare with asoftware only approach, a dedicated serial network connected PCIe-attached FPGAs breaks thehomogeneity of the data centre network and increases system complexity. In the later version of Accelerators from Microsoft, it says first version Catapult has several limitations, includingneeds more expensive and complex cabling, limit amount of FPGAs can communicate directly, problems with enhancing data centre infrastructures, etc.
The new cloud architecture achieves low latency and high bandwidth by placing a layer ofFPGSs between switches and servers, which the inter-FPGAs communicate through a low overheadtransport layer rather than go through the CPU and software network stack. Also, the largescale is achieved with FPGAs communicate directly through the data centre Ethernet infrastructure.
Unlike Catapult, the Cloud Scale Architecture only has one subset of the feature calculations implemented. The reliable communication protocol in cloud architecture extended to hundreds of thousands of nodes compare with the previous version. The Accelerator-6D Accelerator Board is the only FPGA-based PCIe accelerator board with six independent DRAM memory ports connected to the same FPGA device. Each individual portcan configure up to 32GB of DDR3 memory. These memories are complemented by four QSFP+modules supporting 40G Ethernet, providing an ideal platform for data centre architects to develop intelligent network cards (NICs) for network function vitalisation (NFV) network acceleration andnetwork security. It provides low cost, high performance and scalability and used for acceleratingthe data centre.
The TPU is probably the board that is the easiest to plug into an existing system due to it havingSATA connectors. This being said, the speedup or gain from the installation relies on the usage ofNeural Networks since this is a Neural Network accelerator. But it is probably safe to say that wewill see more Deep Neural Networks in data centres since they are useful for speech recognitionand there have been observations of increases in use of speech instead of text when using a searchengine. This together with user applications having a heavier focus on responsiveness insteadof throughput it will be beneficial to have a system that provides quick computations with spokeninput.
Conclusion
The Could Scale Architecture is the developed version of Catapult, and it not only overcome several limitations of Catapult, such as awareness of the physical location of machines, maximal 48 nodescan communicate directly, hardly enhancing the data centre infrastructure, but also enables thedirectly communication between FGPAs and more scalable. Accelerator-6D offers high performance by focusing on the acceleration of lower OSI layer bit stream transmission. While in the current market, the first version Catapult architecture is already replaced by theCloud Scale Architecture in the new Microsoft products. So we would also recommended using Cloud Scale Architecture for the computer architecture design, since it provide lower latency and higher bandwidth and suitable for large scalable network. But if it is the case that your data centrerelies heavily on Neural Networks and machine learning we would instead recommend installingTPUs since they provide X41-X83 the incremental-performance/Watt compared to a Haswell CPU. For the coming future, we think an evolution in network protocols, storage stacks, and physical organisation of components would happen.