Big Data Analytics Using SQL: Quo Vadis?

. Big Data processing and analytics are dominated by tools other than SQL based relational databases, which have lost their numero uno status. In a world deluged by data, the general perception is that SQL databases play a marginal role even for analyzing structured Big Data despite their inherent strengths in processing such data. Focusing on the most important aspect of Big Data processing, namely analytics for data mining, we examine the validity of this perception through a study of competing technologies, published results on SQL implementations of data mining algorithms, the impact of cloud platforms and the raging debate on SQL vs NoSQL vs NewSQL. Contrary to the general belief, it appears that SQL databases in their parallel, columnar deployments on cloud with UDF support do solve some, if not all, Big Data problems and are not likely to become dinosaurs in Big Data era.


Introduction
There is a churn in the data processing world leading to a metamorphosis of database technology as understood in latter years of 20th century and early 21st century.The deluge of data in a variety of forms, from a connected world driven by internet and mobile technologies, has ushered in Big Data [1], [2] and new paradigms of query processing that appear to be beyond the realm of relational model and SQL, the lingua franca of database systems.If the data wave from an internet used by humans dislodged DBMSs from numero uno status, how would they fare with data tsunamis likely to arise from sensor driven Internet of Things and its use in oncoming Industry 4.0 [3], the 4th industrial revolution?Big Data, with its goal of deriving value through analytics for informed decision making and its envisaged role in Industry 4.0, brings to fore the question of quo vadis (whither going)? on relational, SQL databases.Are they relevant anymore for processing Big Data using techniques [4], [5]  To understand and answer these questions, this paper presents a state-ofthe-art survey of SQL databases and contending technologies for Big Data processing, focusing more closely on Big Data analytics and its solutions realized within a relational database emphasizing algorithms, SQL techniques, platforms and products.Unlike other Big Data analytics studies [6], [7], which adopt only a NoSQL perspective, we examine the role of relational SQL for such analytics.
This paper is organized as follows.Section 2 summarizes evolution and status of contending technologies: SQL, NoSQL and NewSQL.Section 3 addresses issues in Big Data analytics implementation within a SQL relational database, presenting results obtained until now.Section 4 touches upon related topic of platforms and products, in context of cloud.Section 5 summarizes criticisms and limitations of competing Big Data technologies; section 6 concludes the paper.

Data Processing Systems
Since the late seventies of 20th century, relational model and products based on it using row stores have dominated data processing applications.The rise of the internet, Big Data, IoT and novel applications have questioned this vicelike grip and given birth to a host of newer data processing technologies.The contending technologies, SQL databases, NoSQL systems and the most recent entrant NewSQL systems are discussed in subsections 2.1 to 2.3.

SQL Databases
SQL DBMSs targeted business data processing for enterprise systems and were very successful in providing OLTP solutions for ERP, SCM, CRM, banking, etc. ERP, the back bone of EIS has been somewhat immune [8] to Big Data, and advances in it, but is likely to be shaken up by Industry 4.0.
Applications used a row store with SQL and relational model supported by indexes, ACID (atomicity, consistency, isolation and durability) transactions and cost-based query planners.Based on business intelligence requirements, SQL DBMSs evolved to tackle analytic query processing of data warehouses optimized for dimensional modeling.For performance gains, they adopted parallel programming techniques to run on a shared nothing cluster of nodes as MPP SQL systems partitioning data across nodes.SQL appliances were next, led by Netezza: MPP row store with custom FPGA hardware for query processing.
Despite significant progress in SQL row store DBMSs, over a decade ago, 2014 Turing Award winner, Stonebraker argued [9] that requirements and characteristics of data centric systems vary widely and the then prevalent architecture of databases as all-encompassing, monolithic, "one size fits all " systems was no longer relevant or applicable.They categorize extant DBMSs as "outbound " systems that must write before processing, and illustrate their unsuitability for -low latency systems for algorithmic trading, essentially "inbound " systems -data warehousing and OLAP, better served by column stores -scientific databases that require native support for arrays -text engines: custom solutions for web (inbound), medical/legal/library data -semi-structured data such as XML, JSON, etc., common in Web 2.0 -IoT sensor network processing systems, akin to low latency systems Advocating use of domain specific DB engines, they show performance advantage for such engines over row databases for the first four applications in stream processing, data warehousing, scientific databases and text management.Authors of [9] conclude with the prescient observation "'one size fits all' theme is unlikely to successfully continue under these circumstances".
Today, we have a variety of special purpose data processing systems that neither use relational model nor SQL but adopt newer programming paradigms.In data warehousing market, SQL databases changed their underlying store model to columnar and continue to retain their OLAP market share.
Based on an eighties proposal for database page storage by vertical partitioning [10] of columns, research prototypes MonetDB [11], [12] and C-Store [13] pioneered column oriented databases that were followed up by successful commercial products.Today, almost all industrial DBMSs support some form of column store, simulated or modern, [14] with performance gains over only-row counterparts due to IO reduction, compression, late materialization, etc.

NoSQL Systems
The difficulties in scaling up SQL database systems for on-line, web scale processing, and in tune with the thinking against one-size-fits-all, NoSQL [1], [15] systems were born.They were built on non-relational models abandoning ACID conformance of databases.The underlying models of NoSQL (Not only SQL; not No to SQL) systems include key/value, documents, columnar, graphs and streams.For the type of applications targeted, NoSQL proponents believe that -the rigid relational schema was inflexible, particularly for unstructured data -guaranteeing ACID properties of transactions reduced performance -emphasis on high availability is paramount -horizontal scalability is preferable to expensive vertical scalability of DBMS -non-procedural SQL was not the most suitable programming language The most significant impetus for NoSQL systems came from CAP theorem [16] that states an impossibility result for trade-offs in implementing distributed systems: a network shared-data system can have only two of three desirable properties consistency (C), high availability (A) and partition (P) tolerance.The CAP theorem led to an alternative view of transactions favoring "availability, graceful degradation and performance" [16] over consistency: BASE standing for basically available (BA), soft-state (S) and eventual consistency (E).
Several NoSQL systems choosing availability and partition tolerance over consistency were built for a diverse set of data processing applications.Table 1 summarizes features of some major NoSQL products.Two other contemporaneous developments contributed to growth and popularization of NoSQL systems: -MapReduce [17], Google's divide-and-conquer software framework for distributed systems: Inspired by LISP like functional programming style, it advocates programming of distributed applications using two functions Map and Reduce that work on key/value pairs with their execution, including failures, handled by the framework.-Hadoop a distributed file system from Apache: An open source system with the goals of performance, availability and scalability, modeled on the proprietary Google File System underlying MapReduce, it made MapReduce style application development popular in the community.
Relieving the programmer of managing a parallel application running in a distributed environment of commodity clusters, with fault tolerance support, was a major step.Hadoop contributed to the meteoric rise and adoption of MapReduce for solving web scale data processing problems, with several NoSQL products incorporating the MapReduce model as shown in Table 1.

NewSQL Products
Disputing the importance of BASE over ACID, Stonebraker et al [18] investigate reasons for under performance of SQL databases in OLTP applications of Big Data era and cite locking, latching, recovery and buffer pool management as the reasons.Eliminating these bottlenecks in their prototype H-Store database they claim 82x performance gain in TPC-C benchmark compared to row DBMSs.
The next few years heralded the term NewSQL [19] coined to refer to a class of products that preserve relational model, ACID transactions and SQL but offer NoSQL like performance and scalability for OLTP read-write workloads.Elaborating on applications of NewSQL systems [19] characterizes them as "executing read-write transactions that (1) are short-lived, (2) touch a small subset of data using index lookups and (3) are repetitive".They also observe that their characterization of NewSQL is in consonance with the more narrow definition of [18]: being lock-free and using a shared nothing distributed architecture.
As of 2016, [19] lists seventeen products as NewSQL systems including SAP HANA, Amazon's Aurora and both H-Store and VoltDB from Stonebraker et al.It is interesting to note that 15 of the 17 products listed in Table 1 of [19] use MVCC [20] for concurrency control found in several row store SQL DBs including open source PostgreSQL, and SQL column stores for analytic workloads.

Big Data Analytics
The term Big Data is all around us and became part of 21st century English when Oxford dictionary defined it in 2013.Most authors characterize Big Data [1], [2], [6], [7] through 3Vs (volume, velocity and variety) or more (veracity, variability, value).In the user community, as well as lexicon definition, the stress for Big Data has been on data mining, generally understood as discovery of models for data using statistics, machine learning and computer science [4], [5].
Though both data mining and analysis techniques predate Big Data, they are current hot topics due to technology challenges and commercial value gained by enterprises through insights gleaned from data.Technology challenges arise from the general perception of SQL databases being inadequate [1], [2], [7], [15] for Big Data processing, due to their difficulty in dealing with the 3Vs: volume: horizontal scalability, elastic or not, is an issue for DBMSs; the advocated vertical scalability is too expensive velocity: consequent to being outbound [9] systems, SQL databases do not perform well on streaming data for real-time analytics variety: heterogeneous data are anathema to SQL databases that deal well with structured data (e.g.number, boolean, varchar), partially well with semistructured data (e.g.XML, JSON) but are inadequate with unstructured data (e.g.tweets, text, video, audio) Does this imply the death knell of relational, SQL databases for Big Data analytics?No; there is a contrarian view that we examine in subsections of this section.Our focus is on high volume structured data and analytics on such data; we do not touch upon other aspects of Big Data processing: data collection, cleansing, loading or privacy.Section 3.1 discusses data mining algorithms, 3.2 addresses the question of mining using SQL and section 3.3 surveys published work on implementing data mining algorithms in relational DBMS using SQL.

Data Mining Algorithms
Data mining algorithms build models to classify data and the models may be used with unlabeled data for prediction or scoring.At a broad level, the learning techniques used by these algorithms may be classified [4], [5] as (a) supervised that uses a training set for correct classification and (b) unsupervised that discovers a model without any training set or a priori knowledge.
A large number of data mining algorithms addressing a variety of topics, clustering, classification, statistical learning, association mining, link mining, bagging/boosting, dimensionality reduction and regression have been published.The often quoted survey [21] discusses about 30 algorithms for the important unsupervised learning technique of clustering that partitions data into similar groups.Wu et al [22] conducted a survey to identify the top 10 data mining algorithms ranking them based on votes polled and citations.Table 2 summarizes details of top 10 data mining algorithms of Wu et al (three with same rank 7).Table 2 includes year of publication of algorithm, and a very brief note on nature of algorithm; more details of the algorithms may be found in [22], the original publication references cited therein, and in data mining books [4], [5].

Some observations on Table 2 algorithms rated highly by mining community:
-formulated on or before 2001, most in 20th century when Big Data was unknown; nothing intrinsic to NoSQL or Big Data in analytics techniques.-based on mathematics or statistics dealing with numbers or categorical data, both of which are essentially structured data.-mostly iterative in nature, a programming style that is not supported by a declarative language like SQL.
Though structured data processed by mining algorithms is well handled by SQL databases, iterative nature of algorithms has been a stumbling roadblock.DBMS vendors responded by including imperative style programming with SQL; external to DBMS in C/C++, Java, Python, etc. through ODBC/JDBC interfaces; and as internal database objects: stored procedures in PL/SQL type imperative SQL, or user defined functions (UDF) in C/C++, Java, Python, etc. External programs incur data transfer cost, while code in UDF or stored procedures runs in DBMS environment close to data with performance gains.
Adopting parallel techniques through MPP shared nothing systems, for performance gains with high volume data, originated in DBMS world: first such commercial MPP system from TeraData was in 1986 [1], [23].Evolution of SQL row DBs into column stores [14], targeting OLAP with better performance, has enhanced their suitability for structured data mining applications.
Ordonez investigates suitability of SQL databases [24] for implementing data mining algorithms and concludes that parallel columnar databases with UDFs can solve important Big Data problems.Row database vendors, Oracle, IBM, Teradata and Microsoft, offer data mining packages tightly coupled to their products modifying internal DBMS code with SQL extensions.As data mining involves development of newer, or modifications to existing, algorithms both needing source code access, such packages are not very popular.Implementations of some of the top 10 algorithms and others exist in user developed SQL.

Data Mining Algorithms in SQL
k-Means Clustering: The importance of sufficient statistics, smaller in size than data, for decoupling mining algorithms from data was highlighted [25], and used in [26] to scale k-Means for large databases.Size in thousands of [26] was scaled to millions [27] on a 4-nodes, parallel row DBMS of TeraData with standard and optimized versions, evaluating performance varying dimensions, clusters and data size.Standard version runs a bunch of SQL statements computing Euclidean distance between points and cluster centroids iteratively until termination.Optimized version improves performance with SQL tricks to reduce joins/groupings, UDFs and uses sufficient statistics of [26], which is defined as a triplet for data of d dimensions and size n to be partitioned into k clusters; sufficient statistics does not eliminate multiple scans due to iterations.
x i x T i cluster j quadratic sum; matrix (d x k) Using equations (1) to (3), cluster weight W j , cluster centroid C j and cluster variance R j for iteration termination are computed [27] as below: Apriori Association Mining: Association mining algorithm Apriori for market basket problems was programmed in SQL with UDFs [28] on DB2 system.Several alternatives for implementing [29] Apriori in DB2 SQL have also been explored: plain SQL using joins and subqueries, cache-mine, stored procedures and UDFs.Both papers use a non-parallel row DBMS.
Expectation Maximization: EM maximizes loglikelihood and for each point x i finds its probability for cluster j; version given in SQL [30] uses Mahalanobis distance on d dimensions and k clusters with C j being the mean vector of size d and R j the covariance matrix (d x d) with zeros for off-diagonal elements.
Sufficient statistics to improve performance of EM in SQL is suggested in [31].
PageRank: The algorithm that made Google the leader in web search was implemented [32] in SQL with query optimization on 4 parallel nodes of column store Vertica on publicly available real-life data sets (Twitter, Livejournal and YouTube) of large sizes: graphs varying from 81k to 41.6 million nodes, 1.7 million to 1.4 billion edges.They report competitive performance for SQL with NoSQL products GraphLab and Giraph with less system resource utilization for memory and read I/O; extending the comparison of PageRank to mixed graph and relational analysis problems SQL Vertica outperforms Giraph by 17x.
Naive Bayes: Assumes Gaussian classes and independence across dimensions to compute [33], [34] 7) Final scoring is to class c with maximum probability p(x i ) = max(p(x i |g)).SQL and MapReduce versions are compared in [34] with better performance for SQL.
kNN: Points in a multidimensional space are mapped [35] to one dimension defining z-value of a point by interleaving binary representation of its coordinates from MSB to LSB.For a point p i = (x i , y i ) in 2-d space, its z-value z p (x i , y i ) is: where bit k (v) is the kth bit of value v.The Z-order of points on z p is a SQL range query, generally preserving spatial locality.But, for a theoretical guarantee they use random shifts to define a γ-neighborhood and propose algorithms for approximate/exact kNN, distance based θ-join and kNN-joins; analyze complexity, implement in SQL to compare with others (iDistance & Medrank in SQL).z p is extended to real values, higher dimensions and queries with ad-hoc conditions.
Decision Trees: Two of Top-10, C4.5 and CART, are decision trees, which are greedy, recursive, memory/time intensive algorithms, but intuitive and used widely.Using sufficient statistics (counts: CC tables) for splitting, SQL and C++ middleware [36] shows scalable full tree construction with C4.5/CART like entropy measure for selection.Primitives in SQL based on CC tables for C4.5, CART, etc., are given in [37]; C4.5 is implemented [38] as Oracle PL/SQL stored procedure, and decision tree constructed [39] from SQL data cubes.
Graphs Mining: Graph analytics applications, like PageRank, use mining techniques to process graphs.Study in [32] also includes two other graph algorithms in SQL: single source shortest path (SSSP) and HCC to find connected components of a graph.In performance comparison of mixed graph and relational analysis for SSSP on Twitter data, Vertica SQL outperforms Giraph by 4x.
Others: Sufficient statistics is used to build other statistical models in SQL [31], [40]: linear regression/correlation for n variables; dimensionality reduction for preprocessing mining data by principal component analysis (PCA).Regression over 2 variables, multidimensional analysis (cube, rollup, grouping set) and windowing analysis (partitions, order, frames) are part of standard SQL.

Platforms and Products
The advent of cloud computing through pay-by-use public services democratizes grid/cluster computing: facilitates scale out, parallel applications and data storage for Big Data processing.Through a browser based GUI any data scientist with web access may harness the power of parallel computing and large stores without recourse to high capital investment or a team of system specialists.Multiple vendors offer managed IaaS (Infrastructure as a Service) environments with choice of configurations to suit budget and application requirements: Amazon AWS, Microsoft Azure, CenturyLink, Samsung, INAP, Alibaba, etc. Several MPP SQL analytics products are available on public clouds along with competing NoSQL products.Table 3 lists some leading relational SaaS products for Big Data analytics on cloud; all products listed support horizontal scaling.Table 3 categorizes the listed cloud products on high level criteria for Big Data analytics rather than a detailed evaluation of SQL features support: 1. DB Store: Column store has been shown to have better performance for analytics than row store [14]; five of the six products support a native column store with compression; Greenplum is a native row store with restricted, append-only support for column store; dbX is a hybrid store with no serious use-case restrictions: modern column store with compression and row store.Some products also offer API interfaces to external open source data mining packages such as R and MADLIB, an approach similar to vendor specific mining products.Both Vertica and Greenplum offer customized version of R compatible to their products; additionally, Greenplum SQL may also be used with MADLIB.

Discussion
MapReduce, key-enabler of most NoSQL systems, is compared [41] with two SQL parallel databases (Vertica and row DBMS) on clusters of 100 nodes with large, synthetic web crawler type data.The benchmark used 5 tasks; grep task as in [17] and 4 DBMS analysis tasks: selection, aggregation, join and UDF aggregation.Both DBMSs outperformed MapReduce on all 5 tasks; average values: row store (3.2x); column store (7.4x).Data load was easier and faster on MapReduce.
Based on results of [41], criticisms by Stonebraker et al [23], [41] of MapReduce include (1) repetitive record parsing as data is stored in text form (2) lack of compression advantage: slowing down with block/record level compression (3) pull model of Reduce for data exchange with Map (4) absence of plan optimization (5) lack of high level interfaces and developer eco-system (6) schema less world.They surmise that MapReduce "is more like an extract-transformload (ETL) system" [23] and hence complementary, rather than competitive, to DBMSs.Basing their comparison on a technical evaluation, they perhaps understate the importance of MapReduce framework, Hadoop or GFS, which simplifies distributed application development, by managing everything including failures.
Mohan, inventor of ARIES recovery fundamental to ACID transactions, criticizes [42] NoSQL for oversimplifying complex issues with ad-hoc solutions, being expedient not rigorous, missing interactive query support and ignoring history.
Challenges for MapReduce in Big Data include [43]: (1) data storage issues without schemas (2) coding iterative analytics algorithms in MapReduce (3) performance overheads with correlated data for predictive modeling (4) harder interactive data exploration without high level interfaces like SQL (5) same issues as SQL DBs in low latency applications (6) lack of security and privacy features and its legal impact for proposed privacy regulations.
Examples of on-going work to mitigate the challenges: integration with SQL DBMSs such as Oracle and Greenplum, Spark and HaLoop to deal with iterative algorithms, Storm for low latency applications, SQL like Hive for interactive analytics, Mahout for data mining, etc.It appears that NoSQL systems are evolving like DBMSs into vertically segmented engines to address Big Data.
Revisiting CAP theorem, raison d'êtere of NoSQL systems, its proposer Brewer considers "'2 of 3' formulation was misleading because it tended to oversimplify the tension among properties" [44], and proposes alternatives to deal with partition tolerance.Stressing consistency-latency trade-off [45] suggests that CAP's 2-of-3 limitation, applicable only in the context of failures, has been misinterpreted to build limited systems.NewSQL prefers ACID to BASE.
Parallel database theory questions characterization of Big Data through 3Vs and suggests alternative dimensions: communication, iteration and failure [46].Despite high scale-out of a few NoSQL products, the comprehensive survey [15] discusses issues and limitations in horizontal scaling of other NoSQL products.
Social factors too contribute to perception of SQL not being suitable for Big Data analytics: (1) unlike parallel DBMSs, almost all NoSQL products are open source with no cost (2) limited mathematical exposure of programmers hampers translation of complex and iterative algorithms into declarative SQL.

Conclusion
It is an axiomatic fact that Big Data and its analysis are decision making drivers in a 21st century world driven by web, mobile and IoT technologies.To understand why 20th century numero uno tool for data processing, SQL rdbms, has lost its primacy we have briefly summarized its shortcomings that led to birth of competing technologies, NoSQL and NewSQL, and traced their evolution.
Focusing on the important aspect of analytics in Big Data processing, we examined suitability of SQL relational databases for data mining, and presented published work on data mining algorithms in SQL; majority of the top ten data mining algorithms and a few others in SQL solve Big Data analytics problems on parallel, columnar DBMS with UDFs.Cloud deployments of such products makes them more accessible, at lower cost and easy scale-out, even elastic.
Comparative discussion in section 5 shows that no technology is fully ready for all challenges of Big Data, more so for IoT and Industry 4.0 that could possibly use blockchain based P2P networks for devices [3], [47] along with multiple options on cloud: NewSQL stream DBMSs, NoSQL Storm, or even parallel, columnar MPP DBMSs fed by distributed streaming platform Apache Kafka.
We observe a convergence of technologies, and note that relational model and SQL are unlikely to disappear, a view endorsed by [19]: "all of the key systems in these groups will support some form of relational model and SQL".
of data mining?If yes, on what type of data and at what scale?If no, what are the alternatives?

Table 2 .
IEEE KDD Top-10 Data Mining Algorithms sufficient statistics, N g , L gh and Q gh , for g classes of training set across d dimensions (h ∈ 1..d) like in (1) to (3), and finds class prior π g , class means C gh and class variance R gh to classify new data by probability p(x i ).

2 .
Cloud: A product available on multiple cloud platforms offers mobility across IaaS platforms and imposes no restriction on the application.Only dbX is cloud agnostic: available on AWS, Azure and other smaller public clouds; Amazon and Microsoft products are tied down to respective vendor clouds; Snowflake is only on AWS; other two on both AWS and Azure.3. Store Type: Type of cloud storage impacts cost: irrespective of usage, products that run only on attached store must be 24x7 as data is lost on shutdown of compute instances.With network storage (AWS EBS or Azure premium IO), compute and storage are decoupled: shutting down compute instances preserves store for later use.dbX offers services on both store types; Redshift is preconfigured on attached; Vertica/Greenplum use or recommend attached.SQL Datawarehouse and Snowflake, targeting elastic scale-out, use a low cost store with poorer performance (Azure blob or AWS S3) as primary store caching retrieved data on attached store with attendant performance overheads; no shutdown data loss.Greenplum also uses external files on S3. 4. On Premise: On premise deployments are sought by users who may not want to store their data on public clouds for security/privacy reasons, or enterprise users who wish to build AaaS (Analytics as a Service) private clouds.Vertica and Greenplum are available as SQL appliances bundled on vendor hardware; dbX may be deployed on commodity clusters and even on other virtualized environments such as VMware; other DBs only on cloud. 5. UDF: Table 3 lists UDF support as sections 3.2 and 3.3 highlight importance of UDFs for data mining algorithms in SQL.Snowflake is the only product without UDF support; others offer it in different languages: PL/SQL type stored procedures (SQL DW, dbX, Greenplum), C/C++ (Vertica, dbX, Greenplum), Python (Redshift, dbX, Greenplum).