Exploiting Heterogeneous Mobile Architectures Through a Unified Runtime Framework

. Modern mobile SoCs are typically integrated with multiple heterogeneous hardware accelerators such as GPU and DSP. Resource heavy applications such as object detection and image recognition based on convolutional neural networks are accelerated by oﬄoading these computation-intensive algorithms to the accelerators to meet their stringent performance constraints. Conventionally there are device-speciﬁc runtime and programming languages supported for programming each accelerator, and these oﬄoading tasks are typically pre-mapped to a speciﬁc compute unit at compile time, missing the opportunity to exploit other underutilized compute resources to gain better performance. To address this shortcoming, we present SURF: a Self-aware Uniﬁed Runtime Framework for Parallel Programs on Heterogeneous Mobile Ar-chitectures. SURF supports several heterogeneous parallel programming languages (including OpenMP and OpenCL), and enables dynamic task-mapping to heterogeneous resources based on runtime measurement and prediction. The measurement and monitoring loop enables self-aware adaptation of run-time mapping to exploit the best available resource dynamically. Our SURF framework has been implemented on a Qual-comm Snapdragon 835 development board and evaluated on a mix of image recognition (CNN), image ﬁltering applications and synthetic benchmarks to demonstrate the versatility and eﬃcacy of our uniﬁed runtime framework.


Introduction
Mobile computing has benefited from a virtuous cycle of powerful computational platforms enabling new mobile applications, which in turn create the demand for ever more powerful computational platforms.In particular contemporary mobile platforms are increasingly integrating a diverse set of heterogeneous computing units 1 that can be used to accelerate newer mobile applications (e.g., augmented reality, image recognition, inferencing, 3-D gaming, etc.) that are computationally demanding.The privacy and security needs of these mobile applications (i.e., safely compute on the mobile platform, rather than suffer the vulnerability of sending to the cloud for processing) place further computational stress on emerging mobile platforms.Consequently, as shown in Table 1, contemporary mobile platforms typically include a diverse set of compute units such as multiple heterogeneous multi-processors (HMPs), and programmable accelerators such as GPUs, DSPs, NPUs, as well as other custom application-specific hardware accelerators.However, current mobile platforms and their supporting software infrastructures are unable to fully exploit these heterogeneous compute units for two reasons: 1) existing runtime systems are typically designed for one or a few compute units, thus unable to exploit other heterogeneous compute units that are left idle, and 2) conventional wisdom dictates that certain application codes are best accelerated by specific compute units (e.g., embarassingly parallel codes by GPUs, and filtering/signal processing by DSPs).Consequently, some compute units (e.g., GPUs) can get heavily overloaded with high resource contention resulting in overall poor performance.Indeed, in our recent study [10], we made the case for exploiting underutilized resources in heterogeneous mobile architectures to gain better performance and power; and even counterintuitively using a slower/less efficient but underused compute unit to gain overall performance and power benefits when the platform is saturated.To fully exploit such situations, we believe there is a need for a unified runtime framework for parallel programs that can accept applications and dynamically map them to fully utilize the available heterogeneous architectures.
Towards that end, this article motivates the need for, and presents the software architecture and preliminary evaluation of SURF, our Self-aware Unified Runtime Framework for parallel programs, that exploits the range of mobile heterogeneous compute units.SURF is a unified framework built on top of existing parallel programming interfaces to provide resource management and task schedulability for heterogeneous mobile platforms.Using SURF application interfaces, application designers can accelerate application blocks by creating schedulable SURF tasks.The SURF runtime system includes a self-aware task mapping module that considers resource contention, the platform's native scheduling scheme, and hardware architecture to perform performance-centric task mapping.We have implemented SURF in Android on a Qualcomm Snapdragon 835 development board, supporting OpenMP, OpenCL and Hexagon SDK as the programming interfaces to program CPU, GPU and DSP respec-tively.Our initial experimental results -using a naive, but self-aware scheduling scheme -shows that SURF achieves average performance improvements of 24% over contemporary runtime systems, when the system is saturated with mutliple applications.We believe this demonstrates the potential upside of even larger performance improvements when more sophisticated scheduling algorithms are deployed within SURF.
The rest of this article is organized as follows.Section 2 presents background on existing mobile programming frameworks.Section 3 outlines opportunities to exploit heterogeneous compute units for mobile parallel workloads, and motivates the need for the SURF framework through a case study.Section 4 presents SURF's software architecture.Section 5 presents early experimental results using SURF to execute sample mobile workloads.Section 6 discusses related work, and Section 7 concludes the article.

Background
Modern mobile heterogeneous system-on-Chip (SoC) platforms are typically shipped with supporting software packages to program the integrated heterogeneous hardware accelerators.However, there is no unified programming framework.Open Computing Language (OpenCL) was designed to serve this purpose but it ends up being mostly limited to GPU only among mobile platforms.Other compute units such as DSP or FPGA need their own software supporting packages instead of relying on OpenCL.As a consequence, existing infrastructures require a static mapping of the workload to compute units at compile time.severe resource contention for one unit (e.g., the GPU) while underutilizing other units (e.g., DSP).Besides, there is no information sharing between individual device runtimes, which makes it difficult to make intelligent task-mapping decisions even if the schedulability is provided.Hence, existing software infrastructures are unable to exploit the full heterogeneity of compute units.In our previous case study [10], we showed how underutilized heterogeneous resources can be exploited to boost performance and gain power saving when the platform is saturated with workloads -an increasingly common scenario for mobile platforms where users are multi-tasking between mobile games, image/photo manipulation, video streaming, AR, etc.Our study highlighted the need for a new runtime that can dynamically manage and map applications to heterogeneous resources at runtime.To address these challenges, we have built SURF, a unified framework that sits on top of existing parallel programming interfaces to provide resource management and task schedulability for mobile heterogeneous platforms.Using SURF application interfaces, application designers can accelerate application blocks by creating schedulable SURF tasks.Next we analyze the performance of several popular mobile data parallel workloads on heterogeneous compute units to illustrate the potential for SURF to map these computations across these units.
Data-Parallel Workload Characterization Data-parallel computations are common in several mobile application domains such as image recognition (using CNNs) and image/video processing/manipulation where the same function is applied to a huge amount of data.Due to the simplicity of this programming pattern, they can be easily offloaded to hardware accelerators such as GPUs without substantial programming effort.In order to highlight the opportunity for gaining performance improvement through task mapping/schedulability across heterogeneous compute units, we measured the execution time of two benchmark suites (Polybench benchmark suite [7] and Hexagon SDK benchmark suite [16]), as well as for the critical layers in a CNN (cuda-convnet) that contain several common data-parallel kernels across different domains.In addition to their original implementations, we added OpenMP/CPU, OpenCL/GPU or C/DSP implementations to execute them on different compute units (CPU, GPU, DSP).
Figure 1 shows the measurement results of running each benchmark on the CPU, GPU and DSP respectively.As expected, we typically see one "dominant" version for best performance on a specific compute unit, e.g., syrk and convnet pool1 have the lowest execution time on GPU, whereas bilateral and convnet conv2 runs best on the DSP.However, note that the non-dominant (slower) versions (e.g., syrk and convnet pool1 on CPU or DSP; and bilateral and convnet conv2 on CPU or GPU) -while seemingly inferior in performance -can be opportunistically exploited by our SURF runtime to improve overall system performance, especially as the mobile platform suffers from high contention when popular apps (e.g., image recognition, photo manipulation/filtering) compete for a specific compute unit (e.g., the GPU for data parallel computations).

Motivational Case Study
With abundant compute resources on a mobile chip, a developer typically partitions an application into task kernels to be executed on compute units and accelerators (e.g., CPU, GPU, DSP) that correspondingly promise a boost in performance.For instance, a convolutional neural network (CNN) application with multiple layers can be partitioned into data-parallel tasks for each layer and mapped onto GPUs for boosting performance.Intuitively, this strict partitioning of tasks to execute them on the highest-performing compute units should result in overall better performance.Mobile platforms often face resource contention when executing multiple applications, saturating these high-performing compute units.In such scenarios -contrary to intuition -offloading of computational pressure to other underutilized and seemingly under-performing compute units (e.g., DSPs) can actually result in overall improvements in performance and energy.Indeed, in an earlier experimental case study [10], we observed an average improvement of 15-46% in performance and 18-80% in energy when executing multiple CNNs, computer vision and graphics applications on a mobile Snapdragon 835 platform by utilizing idle resources such as DSPs and considering all available resources holistically.
In this section, we present this motivational study executing a mix of popular data-parallel workloads and show that both performance and energy consumption of mobile platforms can be improved by synergistically deploying these underutilized compute resources.We select and run three classes of applications: image recognition, image processing and graphics rendering workload, to emulate when the system is heavily-exercised by high computation-demanding applications such as augmented reality and virtual reality applications.

Experiment Description CPU-float, CPU-8bit Run the original or quantized version on the CPU GPU-float, GPU-8bit Run the original or quantized version on the GPU DSP-float Run the original version on the DSP DSP-8bit
Run the quantized version on the DSP w/ batch processing DSP-8bit-nob DSP-8bit w/o batch processing Hetero Layers or stages are statically configured to run on highest-performing compute unit Hetero-noGPU Like Hetero but avoid using GPU Table 2: Keywords used in Experiments [10] Platform: We use a Snapdragon 835 development board with the Android 6 operating system (which uses the Linux 4.4.63 kernel).The board's SoC integrates custom CPUs with big-LITTLE configurations that conform to ARM's ISA.It also integrates a GPU with unified shaders, all capable of running compute and graphics workloads.The 835 board has two Hexagon DSPs: a cellular modem DSP dedicated to signal processing, and a compute DSP for audio, sensor, and general purpose processing.We target exploiting the compute DSP since it is typically idle.
Applications: For the CNN applications, we select two Caffe CNNs: lenet-5 and cuda-convnet using datasets MNIST and CIFAR10, respectively.MNIST represents a lightweight network with a few layers and low memory footprint whereas CIFAR10 has more layers and high memory footprint.We also implemented a quantized version of Caffe, which supports quantized matrix multiplication using 8-bit fixed-point for convolutional and fully-connected layers.The other layers still perform floating-point computation.The experiments include floating-point and fixed-point versions of CNN models running on CPU, GPU and DSP.For the CED application, we modified Chai CED [9] to support all heterogeneous compute resources for each stage.
Table 2 summarizes the different experiments by executing the above applications on various compute units (CPU, GPU, DSP, and heterogeneousincluding all compute units).In addition to the original floating-point version of CNNs, we also deploy 8-bit quantized versions to exploit the DSP effectively.The row DSP-8-bit represents a single function call for batch processing of 100 images to amortize the communication overhead, whereas the row DSP-8bit-nob represents no batch processing, i.e., separate function calls for each image.

Opportunities for Exploiting Underutilized Resources
Figure 2 presents the performance of the convolutional layers of MNIST and CI-FAR10.Since the Hexagon DSP is fixed-point optimized, the quantized version (DSP-8bit) of the conventional layers are able to outperform some of the other versions.Therefore -following intuition -the performance of a single application can be boosted by allocating the workload to the corresponding highestperforming compute unit.However -counterintuitively -we may be able to exploit seemingly slower compute units to gain overall performance and energy improvements.Figure 3 illustrates this scenario, showing the execution time of running one to three instances of CIFAR10 in parallel.When executing only one CIFAR10 instance, the GPU-only version yields the best result compared to GPU-CPU and GPU-DSP versions (as expected).However, when we execute multiple instances of CIFAR10 (i.e., panels showing CIFAR10*2 and CI-FAR10*3 ), we observe that offloading to the other seemingly inferior compute units (e.g., CPU & DSP) yields overall better performance.Indeed, when executing 3 instances of CIFAR10 (CIFAR10*3 ), we see that the performance of GPU-CPU and GPU-DSP significantly outperform the GPU-only version, since the GPU is saturated.This simple example motivates the opportunity to exploit underutilized resources such as DSPs as outlined in Sections 3.3 and 3.4.

Optimization for Single Application Class
Intuitively, the performance and energy consumption of an application (e.g., CNN) can be improved by partitioning and executing on specific accelerators (e.g., GPUs).But frameworks such as Tensorflow and Caffe run the CNN model on the same GPU, saturating that compute unit while missing the opportunity to improve performance and energy consumption by exploiting other underutilized Fig. 3: Performance of executing multiple CIFAR10 instances on different compute units [10] compute units (e.g., CPU and DSP).Therefore, we partition the neural network at the layer level so each layer can be executed as a task running on a different compute unit to exploit heterogeneity.Figure 4a, 4b shows the execution time, average power and energy consumption of running different versions of MNIST and CIFAR10.For MNIST, conv2 runs on DSP and the others run on CPU.For CIFAR10, conv2, and conv3 run on DSP, and the others run on GPU.Although DSP-8bit has better performance over convolution layers in general as shown in Figure 2, it performs worse due to the floating-point computation in other layers such as the Pooling and ReLu layers.For all quantized models, the accuracy drops 1.4% on average.Hetero represents the results of utilizing diverse compute units to gain performance and energy improvements.Indeed, the Hetero results show a 15.6% performance boost and a 25.4% energy saving on average compared to CPU-float and GPU-float (which respectively perform best for MNIST and CIFAR10).
Figure 4c shows the results of running multiple CIFAR10 instances.The results are grouped by CPU, GPU and heterogeneous resources and the values are normalized to CPU-8bit.For CPU-8bit, the performance is scalable but the power and energy consumption increases drastically with more instances because more cores are exercised.The performance of GPU-8bit downgrades along with the increase of instances because they contend for the GPU.Hetero shows more stability than the others due to the distribution of the workload over all compute resources.We also simulate the scenario when the GPU is saturated by rendering high-quality graphics.We use the GPU Performance Analyzer benchmark to produce a high quality graphics workload.As Figure 4d shows, the performance of GPU-float and Hetero decreased significantly because the GPU is fully-saturated by the above-mentioned graphics workload.Hetero-noGPU is statically configured to offload the conv2, conv3 and relu layers to DSP while the other layers run on CPU.As Hetero-noGPU specifically avoided using the GPU, its performance and energy consumption outperforms the others.

Optimization for Multiple application Classes
When executing multiple application classes on a system, both the task partitioning and the exploitation of heterogeneous resources help for better distribution of workload, which in turn leads to better performance and energy consumption.Figure 5a presents the results of running different combinations of CED and CIFAR10.CPU/CPU represents the static task mapping policy that CED runs only on the CPU and CIFAR10 also runs on the CPU.The other terms in the figure follow the same convention.Makespan is from when we execute all the applications in parallel to when the last application terminates.By exploiting all heterogeneous (including underutilized) resources efficiently, we can achieve better results: the fully heterogeneous with Hetero mapping outperforms CPU-only and GPU-only up to 51% for performance and 55% for energy consumption.
Figure 5b presents the results of running all three workload including CED, CIFAR10 and the graphics benchmark.The mapping policy Hetero/Hetero/-Grahpics contends for GPU and therefore fail to achieve better outcome.However, the Hetero-noGPU/Hetero-noGPU/Grahpics policy where we adjust the CED and CIFAR10 to map only on CPU and DSP outshadows the previously policy since GPU becomes the bottleneck due to severe contention.This scenario highlights the need for runtime decision making for pairing workload from different applications with compute unit according to the system status -something not possible in existing runtimes.Hence, we proposed our runtime model, SURF, to deal with the problem which will be detailed in the next section.
4 SURF: Self-aware Unified Runtime Framework SURF [11] is a unified runtime framework built on top of existing programming interfaces and device runtime to provide adaptive, opportunistic resource management and task schedulability that exploits underutilized compute resources.Figure 6 shows the architectural overview of SURF.In a nutshell, mobile applications create SURF tasks through SURF APIs.When a SURF task is submitted, a self-aware task mapping algorithm is invoked referencing runtime information of compute units provided by SURF service.After the task mapping decision is made, the corresponding parallel runtime stub executes that task.

Application and Task Model
Figure 7 shows the hierarchy of SURF's application model.At the highest level, the mobile platform admits new applications at any time.A newly entering application (e.g., CNN in Figure 7) can create and submit tasks to SURF dynamically.A task (e.g., conv1, pool and relu1 in Figure 7's CNN application) represents a computational chunk (parallel algorithm or application block) that could be a candidate for acceleration.A kernel residing in a task represents the programming-interface-specific implementation artifact to program one compute unit (e.g., OpenMP, OpenCL and Hexagon DSP kernels as shown on the right side of Figure 7).SURF opportunistically maps each task (encapsulating multiple kernels) for scheduling execution on a specific compute unit.All kernels in a task share a set of common inputs and outputs, The code block in Figure 8 demonstrates an example of how to use the application interfaces to create and execute a 2-dimensional convolution task with three kernels including an OpenMP, a OpenCL and a Hexagon DSP kernel.

Memory Management and Synchronization
SURF assumes compute units are sharing the system memory which is also the dominant architecture in mobile SoCs.Hence, the expensive data movement between device memory can be ignored if the memory is mapped to all the devices correctly.The SURF buffer object is a memory region mapped to all the device address space through device-specific programming interfaces e.g., OpenCL Qualcomm extension and Hexagon SDK APIs for Qualcomm SoCs.Memory synchronization is still necessary when the buffer is used among different devices to ensure the running device can see the most recent update of data.SURF automatically synchronizes memory objects when the memory object is going to be used by a different device; this memory overhead is included in SURF's task mapping decision.

Self-aware adaptive task mapping
SURF employs a self-aware adaptive task mapping strategy.SURF exhibits selfawareness [5] by creating a model of the underlying heterogeneous resources, assessing current system state via the SURF monitor, and using predictive models to guide mapping decisions.This enables SURF to act in a self-aware manner, combining both reactive (e.g., as new applications arrive or when active applications exit), as well as proactive (e.g., through the use of predictive models to enable evaluation of opportunistic mapping to underutilized compute units) strategies to enable efficient, adaptive runtime mapping.SURF's current implementation deploys a variant of the heterogeneous earliest finish time (HEFT) [17] task mapping algorithm, enhanced to incorporate the cost of runtime resource contention.We consider two types of contention: intra-compute-unit the contention happens when multiple tasks are submitted to a compute unit.The cost of the contention depends on the device runtime and the hardware architecture.For compute unit accelerators such as GPU and DSP, the task execution is usually exclusive due to costly context switch overheads.A FIFO task queue is implemented for each compute unit, so we include the wait time in the queue when calculating the finish time for a task.We also consider device concurrency (i.e., how many tasks can run concurrently on a device) in the analysis.Contemporary mobile GPUs can only accommodate one task execution at a time.Other devices such as DSPs may have more than one concurrent task execution (e.g.Qualcomm Hexagon DSP supports up to 2 when setting to 128byte vector context mode [16]).And of course for the CPU cluster we can have multiple, concurrent tasks executing across the big.LITTLE cores, that typically employs an existing sophisticated scheduler such as the Linux Completely Fair Scheduler (CFS) [14].
inter-compute-unit Typically memory contention is the major bottleneck when there are concurrent memory-intensive task executions in different compute units, resulting in the execution makespan of a task increasing significantly.
Figure 9 shows SURF's dynamic task mapping scheme.SURF proposes a heuristic-based scheme to estimate the finish time for a task running on different compute units considering both intra-and inter-compute-unit contention.First, to determine which compute unit has the fastest execution time, a new task starts within a profile phase to measure the execution time for all kernels in the task.Mapping phase comes after the profile phase is finished where it begins to find the earliest finish time based on the runtime information.Policy determines how to perform the task mapping according to the task profile and device load.Equation 1shows how we estimate the finish time.T cu task is the finish time when Fig. 9: SURF Task Mapping Scheme executing task t on compute unit cu.T inter is the execution time considering inter-compute-unit contention.The influence of memory contention to execution time is difficult to estimate at runtime because the micro-architecture metrics for hardware accelerators are usually not feasible; hence we use a history-based method to model that effect.A history buffer is introduced to track execution time of the latest n runs.T inter is the average of the history buffer.T intra is the execution time considering intra-compute-unit contention.For GPU/DSP, T intra is the sum of execution time of earlier submitted tasks.For CPU, T intra is complicated to estimate if left unbounded.So we estimate the worst execution time based on OpenMP programming model and assume the active CPU threads have the same priority under CFS policy (each thread is allocated with the same time slice).SURF configures an OpenMP kernel to execute on a CPU cluster with a thread on each core.Hence, we approximate the worst execution time by Equation 2. TPC is the number of concurrent OpenMP tasks in the CPU cluster.T o represent the overhead of deploying the task to the compute units and the memory synchronization if it is necessary (e.g., memory buffer is written by GPU and CPU is going to use the results).SURF finds the kernel with the minimum T cu task and submits it to the SURF device queue for execution.

Parallel Runtime Stub
Parallel runtime stub is an abstract layer on top of the existing programming interfaces.This layer utilizes their interfaces to communication with the corresponding runtime.The corresponding stub provides the following features: a) Initialization of programming resources for different programming interfaces accordingly; b) Memory management and synchronization: while the shared system memory model between heterogeneous compute units is dominant in mobile SoCs, and saves expensive data movement, it still needs to perform memory synchronization between cache and system memory before another compute unit accesses the memory; and c) Computation kernel execution.SURF currently supports three programming interfaces: OpenMP, OpenCL and Hexagon SDK to program CPU, GPU and DSP respectively.

SURF Service and Monitor
The SURF service is a background process that synchronizes the system information with application processes.The SURF Monitor collects system status and profile results.For example, we collect execution time of OpenMP threads from the entity sum exec runtime through sysfs so to estimate how long an OpenMP kernel runs.

Experimental Setup
Figure 10 shows our experimental setup.We have implemented the SURF framework using C/C++ in Android 7 running on Qualcomm Snapdragon 835 development board, which has two CPU clusters (big.LITTLE configuration), and integrated GPU and DSP.SURF considers the little CPU cluster, big CPU cluster, GPU and DSP as four compute units when making task mapping decisions where GPU and DSP are exclusive for 1 and 2 tasks respectively.SURF kernels can be created by the programming interfaces of OpenMP, OpenCL and Hexagon SDK to program CPU, GPU and DSP respectively.We deploy the Caffe convolutional neural network framework [12], Canny Edge Detector (CED), Polybench benchmark suite and Hexagon SDK benchmarks to run on SURF.We also use the Snapdragon Profiler [15] to measure the utilization for each compute unit.Power consumption is measured by averaging the product of voltage and current read from the power supply module through Linux sysfs interface (e.g./sys/class/power supply).Energy consumption is the product of makespan and average power consumption.We also access Android Debug Bridge (adb) through WiFi instead of USB connection so the USB charging will not compromise the   In our experimental sets, we run two applications: image recognition (cudaconvnet within Caffe and with Cifar10 dataset) and image filter (CED) representing foreground processes that have 9 and 4 SURF tasks respectively.We also run two GPU-dominant benchmarks (syrk and gemm from Polybench) and two DSP-dominant benchmarks (bilateral and epsilon) representing background processes and each of the benchmarks runs one SURF task.We characterize application workloads as heavy and light workloads by changing batch processing size (how many images are processed each iteration) and benchmark workloads as heavy, medium and light workload by changing their input size.Light workload is characterized as real-time workload which can be done within 20ms.Medium and heavy are the ones can be done within 20-100ms and above 100ms respectively.Table 3 summarizes the configurations of applications and benchmarks used in our experimental sets.

Experimental Results
As Table 4 shows, we run six test sets composed of combinations of heavy/light applications and heavy/medium/light benchmarks.Figure 11 shows the execution makespan of running our six test sets with static best-performing task mapping and SURF dynamic task mapping.The static best-performing mapping configures each task to run on their best-performing compute unit according to the profiling results without SURF.SURF's dynamic task mapping outperforms static mapping by 24% on average.Table 4 also shows that the speedup increases with the level of the background benchmark workload because for heavy background benchmarks, a single run of them will occupy the compute resources for long time in GPU and DSP, which creates opportunities to map alternative kernels to exploit other underutilized compute units.The light applications have better speedup than heavy applications because the light application setup experiences more contention with background processes during the entire makespan and it's easy to find alternative kernels because the kernels in one task tend to have similar performance in light workload configuration.Figure 12 shows the sum of all device utilization of the makespan including little/big CPUs, GPU and DSP utilization when running each test set (max 400% across the 4 classes of units).GPU and DSP utilization are similar (increased by 4.51% and 3.38% respectively) across all in general, since the GPU and DSP are heavily exercised.Here the Big CPU is better utilized (increased by 30.6%) by our dynamic scheme, and is the major contributor to the speedup.Further experiments were conducted where DSP-dominant background processes (bilateral and epsilon) are not executed, but only foreground applications and GPU-dominant background processes.Figure 13 and 14 show the results of makespan and utilization respectively.SURF's dynamic scheme outperforms static mapping by 27% in performance which is slightly better than the previous experiments because there are more available resources while GPU is saturated.The utilization for big CPU and DSP increased by 43.15% and 8.13% respectively which shows part of the computation are offloaded to them.While these preliminary experimental results demonstrate SURF's efficacy in exploiting underutilized compute units for improving performance, the current policy which applies the HEFT algorithm introduced in Section 4.3 is not powerand energy-aware.As a result, the power and energy consumption increase by 62.8% and 31.6% on average shown in Figure 15.We speculate that the current implementation for the computational kernels make SURF infeasible to deploy energy-aware policy because they are not optimized according to the hardware architecture.Hence, the trade-off between performance and energy becomes trivial -either high performance and energy consumption or low performance and low energy consumption.We expect to see reductions in energy consumption once the kernels are optimized with an energy-aware policy.This development is currently ongoing.Heterogeneous resource management has been widely studied, with a large body of existing work on task scheduling/mapping algorithms [17] [8] [4] [13].For instance, Topcuoglu et al. [17] proposes the heterogeneous earliest finish time (HEFT) algorithm that schedules tasks in a directed acyclic graph (DAG) onto a device to minimize execution time.Choi et al. [4] estimates the remaining execution time for tasks on CPU and GPU by using a history buffer and selects the most suitable device.The StarPU [2] framework targets high performance computing and enables dynamic scheduling between CPU and GPU based on static knowledge of the tasks.Zhou et al. [19] perform task mapping onto heterogeneous platforms for fast completion time.Some recent efforts also address domain-specific platforms: Wen et al. [18] and Bolchini et al. [3] propose dynamic task mapping schemes specific for OpenCL; Georgiev et al. [6] proposes a memetic algorithm based task scheduler for mobile sensor workload; and Aldegheri et al. [1] presents a framework allowing multiple programming languages and exploit their different level of parallelism for computer vision applications which achieves better performance and energy consumption.SURF distinguishes from these works in two directions.First, the SURF framework is composed of a runtime system for task mapping and APIs for mobile systems.SURF is built on top of existing programming interfaces and dynamically profiles task execution and perform task mapping without userprovided static knowledge.Second, SURF is self-aware: aware of the heterogeneous hardware architecture, existing scheduling scheme and the runtime system status.It takes care of resource contention of single compute units while other works make assumptions that all the compute unit are exclusive to a single task (e.g., CPU should not be exclusive).The device concurrency of hardware accelerators is also ignored in these previous works.

Conclusion
In this article, we presented the architecture of SURF, a self-aware unified runtime framework built on top of existing programming interfaces including OpenMP, OpenCL and Hexagon DSP SDK for mapping tasks onto CPU, GPU, and DSP respectively in mobile SoCs.We illustrated how to use SURF's application interfaces to create and execute a SURF task.SURF performs task mapping while being aware of existing scheduling schemes, intra-and inter-compute-unit contention and heterogeneous hardware architectures to select the compute unit with the earliest finish time for the given tasks without user-provided static information about the tasks.Our early experimental results show an average of 24% speedup by running mixed mobile workloads including two applications, image recognition by using convolution neural networks and an image filter with couple of background processes sharing workload on the compute units.Our ongoing work is incorporating more sophisticated mapping and prediction algorithms, and analyzing the performance as well as energy benefits of deploying SURF on emerging heterogeneous mobile platforms.

Fig. 8 :
Fig.8: Sample code of SURF application interfaces including SURF buffer, task and kernel creation as well as SURF task execution and termination.

Fig. 15 :
Fig. 15: Power and energy consumption for adaptive HEFT policy in SURF

Table 3 :
Details of applications and benchmarks used in our experimental sets

Table 3 :
Details of applications and benchmarks used in our experimental sets (continuation)

Table 4 :
Speedup for different test sets