Hardware Architectures and Platforms Have Been Evolving Fast in the Past Few Years, Which

Introduction

The last decade has seen dramatic new developments in Big Data systems after Hadoop/MapReduce became the mainstream software framework for distributed processing of Big Data. Several open source software, such as Cloudera Impala[1], Apache Spark[2], AsterixDB [3] and Stratosphere[4], in addition to commercial products, have been developed to improve Hadoop with respect to functionality, efficiency and usability [5] [6]. However, most of the existing mainstream Big Data systems are designed for relational data and graph data [5] [7]. While these data represent two very different data types, one commonality of such Big Data systems is adopting a simplified data model that supports only element-wise operations in a streaming mode, such as scans and aggregations on tuples[5], vertices or edges [8]. This leaves more complex operations that are crucial for practical applications, such as theta joins, heavily under-supported. Although various research works have aimed at supporting such operations within the existing MapReduce-alike frameworks, the mismatches between the MapReduce data model abstractions and their runtime implementationshave made their realizations inferior to native parallel/distributed implementations with respect to end-to-end performance.

Hardware architectures and platforms have been evolving fast in the past few years, which have significant impacts on processing Big Data. Many Big Data software packages, such as Hadoop, were initially developed for inexpensive commodity workstations with uniprocessors. Their implementations on multi-core CPUs with large memory capacities and deep cache hierarchies can result in significant performance degradation when compared with native parallel/distributed implementations that can take these factors into consideration. Furthermore, to the best of our knowledge, very few Big Data systems are capable of utilizing the fast increasing computing power of hardware accelerators, such as commodity Graphics Processing Units (GPUs)[9] and Intel Xeon Phi devices [10], for much desired high performance. As modern Cloud computing resources are increasingly equipped with new hardware [11], such as multi-core CPUs, many-core GPUs, fast storage (e.g. SSDs) and fast network (e.g., infiniband), it is desirable to develop new data parallel processing techniques that take this new hardware into consideration to fully utilize the new generation of Cloud computing platforms.

Spatiotemporal data[12] [13][14] is one of the fastest growing types of data due to the advances of sensing, modeling and navigation technologies and newly emerging applications, such as satellite imagery for Earth observation[15],environmental modeling for climate change studies[16] and GPS data for location dependent services[17]. While ubiquitous and crucial for both the society as a whole and individuals, there is little foundational work on developing efficient and elastic Big Data techniques for spatiotemporal data in parallel and distributed computing environments. Previous works (e.g.,[18] [19]) mostly focus on extending Hadoop for spatiotemporal data to achieve scalability for batch processing while leaving more fundamental spatiotemporal data management issues untouched. We envision that the desired features of the next generation high performance data management systems for spatiotemporal Big Data include: 1) natively supporting spatiotemporal data models through domain-specific abstractions, 2) allowing exploiting fine-grained data parallelisms on parallel hardware (for efficiency) with bounded memory budgets (for elasticity), and 3) facilitating principledrandom accesses to both data and indexes in distributed file systems (for efficiency and programmability).

In this proposal, we aim at laying foundations for the next generation Big Data systems for spatiotemporal data that are not only scalable but also efficient and elastic. While our primary focus is to effectively support spatiotemporal Big Data management and analytics, which has not yet received much research attention, we believe many of our proposed ideas are applicable to relational data and certain graph data to enhance the functionality and performance of existing Big Data systems. Our specific aims are: 1) to design a unified framework for indexing and processing spatiotemporal Big Data in Cloud, 2) to develop a framework and techniques to support data parallel operations with bounded memory capacity, and 3) to design and implement data parallel techniques for popular operators of spatiotemporal data on both multi-core CPUs and accelerators by harnessing Single-Instruction Multiple-Data (SIMD) computing power on modern parallel hardware[20] and demonstrate practically achievable efficiency.

Background, Motivation, Related Work and Proposal Roadmap

Big Data systems and techniques are playing an increasingly important role in our everyday life. Insights and knowledge derived from linked Big Data have been demonstrated to be effective in various application domains, ranging from understanding human mobility from GPS traces and taxi trip records [17]toglobal climate change and biodiversity research [21]. Fig. 1 illusrates a climate application scenario on managing and processing large-scale spatiotemporal enviromnetal datasets.

Fig. 1 An Illustrative application scenario of Large-Scale Climate Data Processing

In a way similar to the evolution of databases, which started with handling simple data and gradually added support for complex data due to application needs, there are increasing practical demands on developing Big Data systems that support spatiotemporal data in Cloud. Given the complexity of developing full-fledged Big Data systems, which typically consist of hundreds of thousands of Lines of Code (LOC), a natural way to support new operations of spatiotemporal data is to extend existing Big Data systems instead of developing new systems from scratch. However, as argued in Section 2.1, the extension-based approaches are inherently ineffective for spatiotemporal data and a more fundamental approach is needed (research Aim 1). We discuss the importance of disciplined utilization of limited memory in Section 2.2, which motivatesour research Aim 2. We further argue that (Section 2.3) it is very difficult to fully utilize SIMD computing power and high intra-node memory bandwidth on modern parallel hardware by adopting existing Big Data platforms which motivates our research Aim 3. We subsequently outline our proposal roadmap (Section 2.4) on developing frameworks and techniques that have the potential to significantly improve end-to-end system performance in processing spatiotemporal Big Data in Cloud.

2.1.Data Abstractionbeyond Relational Extensions

Most of existing Big Data systems have adopted a partition-based data abstraction approach in favor of shared-nothing distributed computing frameworks with node-level parallelisms. While the MapReduce model assumes no communications among partitions, software systems such as Hadoop typically implement bulk data transfer through a separate shuffle phase which is both disk I/O and network bandwidth intensive. Although traditional sequential computing techniques (including indexing) can be easily integrated into both Map and Reduce tasks within partitions, almost all operations that require relating multiple partitions, e.g., data/index accesses and multiple datasets colocation, mandate significant changes to runtime libraries[22] [23] [24] [25] [26]. Different from table-structured relational data, flat partitions in MapReduce systems (typically using hash partitioning) may not be suitable for spatiotemporal data, which can have complex structures (e.g., polygon-ring-vertex hierarchical structure for complex polygons[27]). Coarse-grained partitions may also limit the opportunity of exposing fine-grained parallelisms in applications that can be readily exploited by modern hardware, e.g., SIMD computing power to be discussed in Section 2.3.

Different from relational data and graph data the queriesof which are mostly based on equality test (e.g., equi-joins), spatiotemporal data typically involves complex theta joins (e.g., point-in-polygon test based spatial join) where hierarchical tree-based indexing is crucial for performance. Current Big Data systems adopting a relational model not only makes it difficult to incorporate existing or new spatial indexing techniques for efficient query processing, but also forces their extensions to store non-relational data as strings[18] [28].This not only is cumbersome but also increases data volumes (and hence disk I/O) significantly, which is a major source for their low efficiency based on our evaluations on real world datasets [29]. Existing We next discuss three research prototype systems, such as HadoopGIS[18], SpatialHadoop [19]and SpatialSpark [28], that extend existing Big Data systems to manage spatial data and process spatial joins. Compared with HadoopGIS and SpatialHadoop that are based on Hadoop, our SpatialSpark is based on Spark RDD (Resilient Distributed Dataset), exploited the efficiency of in-memory processing and is capable of achiving higher performance as discussed in [29]. Several similar works, such as GeoSpark [ ], SparkGIS[ ] and Simba[ ], have been subsequently proposed for geospatial data processing. We note that, different from the rest prototype systems that make use of geometry libraries (JTS in particular) and thus support spatial operations on major geometry types, Simba implemented several operations for point data from scratch and achieved higher performance. Unfortunately, Simba does not support operations on non-point data types yet which are more complex. Despite the significant performance improvements of Spark-based systems over Hadoop-based ones, we argue that, We argue that eextension-oriented approaches, which are feasible and practically useful to a certain extent, lack essential supports for spatial join query processing efficiently in distributed computing environments. The issues are likely to be escalated when attempting to further extend Big Data systems to process spatiotemporal data where spatiotemporal data types and their operators, complexities of indexing and query processing and diverse semantics of application domains may grow significantly and are well beyond the simplified abstraction capability of MapReduce and RDD.

processing spatiotemporal data efficiently. Array Databases such as SciDB that are capable of distributed execution are closely related to our proposed research. As time-evolving rasters, such as satellite images and environmental modeling outputs, can naturally be represented as time-serial multi-dimensional arrays, array databases can be exploited to answer certain spatial and spatiotemporal queries []. Different from Hadoop-based systems, SciDB is desgined to be in-memory processing and provides an efficient distributed execution engine based on Message Passing Interface (MPI) communication and is likely to achieve higher performance in processing raster-based spatiotemporal data. However, in a way similar to the long-standing debates on raster and vector based geospatial models, data models in array databases suffer from accury related problems when handling vector data. For non-point spatiotemporal data that have spatial and/or temporal extents that are likely to overlap, including polylines, polygons, trajectories and moving regions, array databases are much less applicable in supporting relevant spatiotemporal queries. Furthermore, many spatiotemporal queries involve both vector and raster data where one side or both sides of these queires can be time-evolving, which may also make it cumbsersome for array databases. Our experiences with SciDB also suggest that, while SciDB and is well suited for application scenarios that it was desgined for, it lacks many features that are essential in processing spatiotemporal data, such as theta-joins based on testing spatial and/or spatiotemporal relationships and efficient aggregations on static or moving regions. Furhtermore, as the only currently active array database implementation, SciDB supports kd-tree indexing only and its extensibility is unclear. Finally, while there are both advantages and disadvantages on not using a distributed file system (e.g., HDFS) when compared with Hadoop-based systems, SciDB is more difficult to extend in general due to the tight coupling of distributed computing and data communication. Learnt lessons, which are also applicable to other types of data, have motivated us to propose the research Aim 1 detailed in Section 3.

HadoopGIS [18] and SpatialHadoop [19] are among the leading works on supporting spatial data management by extending Hadoop. We have also extended Apache Spark for spatial joins and developed SpatialSpark [28] which has achieved significantly higher performance than both HadoopGIS and SpatialHadoop [29]. As spatial join is among the most complex and computationally intensive spatiotemporal operations, we analyze its implementations in the three research prototype systems to illustrate the technical challenges on extending existing Big Data systems that were originally developed for relational data to support spatial operations.

HadoopGIS adopts the Hadoop Streaming [30] framework and uses additional MapReduce jobs to shuffle data items that are spatially close to each other into the same partitions before a final MapReduce job is launched to process re-organized data items in the partitions. SpatialHadoop extends Hadoop at a lower level and has random accesses to both raw and derived data stored in the Hadoop Distributed File System (HDFS [31]). By extending FileInputFormat defined by the Hadoop runtime library, SpatialHadoop is able to spatially index input datasets, explicitly access the resulting index structures stored in HDFS and query the indexes to pair up partitions based on the index structures before a Map-only job is launched to process the pairs of partitions in distributed computing nodes. SpatialSpark is based on Apache Spark [32]. Spark provides an excellent development platform by automatically distributing tasks to computing nodes as long as developers can express their applications as data parallel operations on collection/vector data structures, i.e., Resilient Distributed Datasets (RDDs [2]). The automatic distribution is based on the key-value pairs of RDDs which largely separates domain logic from parallelization/distribution.

HadoopGIS and SpatialSpark are forced to access data sequentially within data partitions due to the restrictions of the underlying platform (Hadoop Streaming for HadoopGIS and Spark for SpatialSpark). The restrictions, due to the streaming data model (for HadoopGIS) and Scala functional programming language (for SpatialSpark), have significantly lower the capabilities of the two systems in efficiently supporting spatial indexing and indexed query processing. Indeed, spatial indexing in the two systems is limited to intra-partitions and requires on-the-fly reconstructions from raw data. The implementations of spatial joins on two datasets are conceptually cumbersome as partition boundary is invisible and cross-partition data reference is supported by neither Hadoop Streaming nor Spark RDD. To solve the problem, both HadoopGIS and SpatialSpark require additional steps to globally pair up partitions based on spatial relationships (spatial intersection in this case) before parallel/distributed local joins on individual partitions. While the additional steps in SpatialSpark are implemented as two GroupBy primitives in Spark which are efficient for in-memory processing, they have to be implemented as multiple MapReduce jobs in HadoopGIS and significant data movements across distributed computing nodes (including Map, Reduce and Shuffle phases) are unavoidable. The excessive disk I/Os are very expensive and largely contribute to HadoopGIS’ lowest performance among the three systems [29] . On the other hand, while SpatialHadoop has support for storing, accessing and indexing geometric data in binary formats with random access capabilities by significantly extending Hadoop/MapReduce runtime libraries, its performance is significantly limited by Hadoop and is inferior to SpatialSpark for data-intensive applications [29], largely due to the performance gap between disk and memory.

The above discussions suggest that a data abstraction that

Learning from the past experiences, we envision that that an abstraction that is similar to Spark’s RDD (in-memory processing with disk persistence option) but supports principled random accesses (to support both inter- and intra-partition data items and indexes),, which essentially combines the advantages of both SpatialHadoop and SpatialSpark, is highly desirable. However, as a partition is the basic unit for parallel and distributed processing in Hadoop, it can be difficult to dynamically adjust the granularities of parallelisms in spatiotemporal data processing (e.g., for load balancing) to achieve the desired high efficiency on top of Hadoop. It is thus also desirable to extend the two-level data partition scheme to multiple levels to explore higher degrees of parallelisms to better utilize parallel hardware and to facilitate more flexible and efficient scheduling. In constrast to SciDB that requires a homogenous array representation, our abstraction is capable of handling both vector and raster data which mandates a heterogenous data representation. Given that both vector data and raster data can be indexed at the partition (or chunk/segment/tile) level, the heterogenous representation will only impact local spatiotemporal query processing of individual paired paritions on a single computing node and will not significantly increase implementation complexities when comparing with extending/adopting array databases. Our proposal on developing a unified framework for indexingand processingspatiotemporal data is detailed in research Aim 1 in Section 3.

2.2.Towards More Elastic Big Data Systems

Assuming that the capacity of disk storage is unlimited, Hadoop-based systems are robust and fault-tolerant in the sense that all the intermediate results are written to external storage and become persistent. On the other hand, in-memory systems such as Spark and Impala require large memory capacity to avoid frequent failures. However, memory is used largely in an unprincipled way in applications based on Spark (e.g., SpatialSpark[28]) and Impala (e.g., our ISP-based prototype systems [33] [34]). Memory is allocated for operations as they are executed and programs are aborted when memory limit is reached. There is neither prediction/estimation on the required memory capacity nor guarantee that a program can run with a certain memory budget. It is thus desirable to combine the advantages of both Hadoop-alikesystems and in-memory processing systems by developing operators that require bounded memory capacities. For a memory budget that is above the aggregated minimum memory capacity of relevant operators in an application, the application is guaranteed to be successful while enjoying the efficiency of in-memory processing. As proposed in research Aim 2 in Section 4, our idea is to associate a limited memory budget with operators that are designed to support parallel and distributed computing.