Recommending Database Architectures for Social Queries: A Twitter Case Study

. Database deployment is a complex task depending on a multitude of operational parameters such as anticipated data scaling trends, expected type and volume of queries, uptime requirements, replication policies, available budget, and personnel training and experience. Thus, enterprise database administrators eventually rely on various performance metrics in conjunction to existing company policies in order to determine the best possible solution under these constraints. The recent advent of NoSQL databases, including graph databases such as Neo4j and document stores like MongoDB, added another degree of freedom in database selection since for a number of years relational databases such as PostgreSQL were the only available technology. In this work the scaling characteristics of a representative set of social queries executed on virtual machine installations of PostgreSQL and MongoDB are evaluated on a large volume of political tweets regarding Brexit. Moreover, Wiener ﬁlters for predicting the execution time of social query windows of ﬁxed length over both databases are designed.


Introduction
Database deployment is by no means a trivial engineering task since it entails the fine tuning of a plethora of high-and low-level operational parameters including hardware management, data transformations and cleansing mechanisms, integrated access policy, communication with other production-grade systems, and query optimization techniques.Additionally, the complexity as well as the performance of each database installation with respect to at least a representative set of anticipated queries should be assessed [41].With the advent of social media, databases have been widely adopted as a reliable means for storing large volumes of (semi)structured social data in enterprise environments.
Twitter is perhaps the most popular microblogging platform, suitable for online interplay mainly in the form of conversations [21].The topic of the latter may well range from smart cities [34] and public health and the ongoing COVID-19 pandemic [22] to education [23] and sustainable development [31].Nevertheless, it was the political Twitter which during the recent past years has gained considerable popularity.In part this can be attributed to the fact that Twitter was the primary media vehicle of the 45th US President [35] [40], with the 2016 US presidential elections being an important milestone [10] [45].As a result, Twitter is now widely considered as a central political stage.
The primary research contribution of this conference paper is the assessment of two standalone instances of PostgreSQL and MongoDB to handle queries closely related to social analytics.First, the scaling dynamics of four representative queries forming the basis for three Twitter analytics are evaluated based on a large collection of English tweets regarding Brexit.Second, Wiener filters of various lengths are designed for the prediction of the execution time of fixed length non-overlapping query windows over these two installations.This point differentiates this work from previous approaches.
The remainder of this work is structured as follows.Section 2 summarizes the recent scientific literature regarding Twitter analytics, relational database management systems (RDBMSs), and document databases.The experimental setup and the proposed methodology are described in section 3.In section 4 possible future research directions are given.Random variables (r.vs) are represented with capital calligraphic letters.Technical acronyms are explained the first time they are met in the text.Table 1 summarizes notation.Mean value of random variable X Eq. ( 11) ρj j-th autocorrelation coefficient of a random variable Eq. ( 10) Set of followers of the k-th account Eq. ( 1) Set of followees by the k-th account Eq. ( 1)

Previous Work
Relational databases have been a productivity mainstay almost since their introduction [27].They rely on a tabular data format as well as on a number of integrity constraints to ensure both high performance and data consistency [16].
Their performance has been evaluated in multiple ways, including query scalability [42], indexing [30], and fault tolerance [17].From an operational complexity view normalized forms [29] and user rights assignment [24] play a central role.Document databases typically operate on structured documents, usually in extensible markup language (XML) or JavaScript object notation (JSON) format [4].The prospect of developing normalization and embedding operations in MongoDB is explored in [20], while the expressivity of MongoDB queries is the subject of [8].Neural networks in keras for predicting mentions to Twitter verified accounts driven by data stored in MongoDB are described in [26].Finally, a persistent data structure with rollback capabilities which can represent graphs but also structured documents is presented in [25].
There is a plethora of social analytics for Twitter.Digital influence as inferred through numerous attributes is paramount [36], with affective influence being also taken into consideration [7].Also community discovery can take many forms such as spatio-linguistic [14], semantic based on hashtags [12], or hashtag similarity [44].One-dimensional topological correlation for Twitter graphs based on structural and functional attributes is proposed in [13].Recently data-driven approaches have been developed for examining Twitter graphs [6].In [18] a socio-technical analysis of tweets is presented.Twitter as a vehicle for political campaigns is examined in [3].Surveys covering the topic are [33] and [2].

ACID vs BASE
During the 2010s the advent of NoSQL databases marked an addition of variation of fundamental database models.Along with the existing tabular data format of relational databases special infrastructure has been added for graphs, documents, associative arrays, and column families with the corresponding database type.Table 2 summarizes the main architectural differences between an RDBMS such as PostgreSQL and a document database like MongoDB.

Dataset
The two databases were installed on two sepearate virtual machines (VMs) running over a public cloud.Table 3 has the technical specifications.In either case no indexing was activated.In the case of MongoDB the Python driver was used to write code for the Twitter analytics examined here.
In order to test the query dynamics of each database, a Twitter dataset containing a large number of tweets pertaining to Brexit, a term encompassing major political events since 2018, has been collected by a social crawler utilizing a topic sampling approach.Given that the actual Brexit took effect on February 1st of 2020, there was considerable interest in it during the few preceding months.Because of the paramount geopolitical importance of the event, a considerable number of English tweets was eventually collected over four months preceding it.The main properties of the dataset are shown in table 3.

Twitter analytics
As stated earlier, social analytics abound for both the general case and for Twitter alone.In this work the following analytics will be used as a benchmark.More details about their respective implementations will be given later in the text.One way to assess the impact of the k-th Twitter account is to compute the respective followers-to-followees logarithmic ratio defined as in equation (1): From another perspective, the digital influence of the l-th tweet, regardless of the influence of the respective posting account, can be measured by the number of hashtags H l it contains.This is shown in equation ( 2): An alternative for accounts requiring combined information is to count the total number of tweets W k each account has posted multipied by the number of the respective followers as a visibility metric.This is shown in equation (3): Observe that each of the above metrics captures a different aspect of Twitter activity, whether it is account-or tweet-oriented.In general, so far and to the best of the knowledge of the authors there is no single metric describing Twitter influence or account online behavior.From a database standpoint this translates to the need for quantifying the performance of individual queries as well as of query sequences.This can be done in terms of total execution time, system resources such as memory utilization or disk usage, or of operational complexity and personnel training costs.For the purposes of this work the total execution time has been selected as the single database performance criterion.

Response to individual queries
In order to execute the four queries described below, in the PostgreSQL case two tables were created, one containing information about accounts and one about tweets.After that, a set Q of reference queries, called social queries, was created.The wallclock time is a reliable indicator of the system execution time since these queries are strongly CPU bound [38].Specifically, Q consists of the following queries, which are cast in SQL in addition to being verbally explained: q 1 : This query selects rows based on the values of a given column.In social media it is frequently used to find accounts with a certain attribute and the value of the attribute in question satisifes certain numerical constraints.For instance, q 1 can be a filter for accounts having more than one follower.
SELECT a c c o u n t FROM a c c o u n t s WHERE f o l l o w e r s >= 1 ; q 2 : This query is similar to q 1 but the results are additionally sorted in descending order based on the values of a given column.Continuing the previous example, q 2 returns accounts as q 1 in descending order of followers.This is a popularity metric as well as a crude measure of digital influence.q 3 : The next step in query complexity is aggregation, in this case by the field preceding the GROUP BY clause.The following returns the total hashtags for each Twitter account.
SELECT h a s t a g s , a c c o u n t FROM t w e e t s GROUP BY a c c o u n t ; q 4 : Finally, the most time-consuming query is join.There are no such operations in the NoSQL world where each of the four primary technologies deals differently with it.The following query returns the number of tweets and followers for each account, provided the latter has more than one follower.There is a correspondence between the above queries and the Twitter analytics presented earlier.Specifically, J a can be implemented with two q 2 to find followers and followees respectively and one q 1 , J h with q 3 and a q 1 , and J t with q 4 and a q 1 .In each case the last q 1 is used to select the top accounts or tweets.
In order to accurately determine total response times, each query for each database was run r 0 times, each time on a randomly selected subset of n tweets.The sample mean T (n) was recorded as follows, where t j is the j-th measurement: Equation ( 4) can be thought of the sample mean approximation of the true stochastic mean of the random variable counting the total execution time.Under mild conditions of ergodicity, the former converges to the latter as r 0 grows.A metric of how much are the samples concentrated around the respective mean value is the variance.Along a similar line of reasoning, the sample variance σ 2 (n) approximates the true one and it is computed as follows: In figure 1 the mean times for q 1 and q 2 for both databases are displayed.Likewise, in figure 2 the response times for queries q 3 and q 4 is shown.The values of the sample variance are shown in talbe 4. Observe that the sample variance grows very slowly compared with the growth of the respective query size.This is an indicator that the estimates obtained by the sample mean are reliable, which can be partly attributed to the selection of r 0 (for its actual value see table 3).Note that the PostgreSQL installation tends to yield more consistent estimators, as denoted by the systematically lower sample variance.The semilogarithmic scale can reveal certain scaling patterns.In particular, the almost straight lines of time curves for both databases for q 4 suggests a power law scaling with n of the total response time T (n) of the form [32]: In models like that of equation ( 6) the most important parameter is the scaling exponent γ 0 which determines the growth rate, whereas parameters α 0 and n 0 represent initial conditions and have minor effect on growth.
Taking the natural logarithm of equation ( 6) yields the equivalent (7).Notice there are N 0 equations, one for each of the possible values of n as shown in table 4. The same value is repeated in table 3 for convenience.ln T (n; α 0 , γ 0 , n 0 ) = ln α 0 + γ 0 ln (n + n 0 ) (7) Stacking the N 0 equations (7) for the different values of n in increasing order of the latter yields the non-linear system of equation ( 8): Observe that ( 8) is non-linear and moreover the parameter n 0 is difficult to be separated from the data values n.To this end, instead of the standard least squares (LS) estimator, the following iterative scheme was used: -The initial estimate for n 0 is used to obtain an LS solution for α 0 and γ 0 .
-Using a line search with step µ 0 = ln n 0 , a new estimate for n 0 is obtained.
-A new LS solution for α 0 and γ 0 is computed.
-If this solution increases residual error, then the search reverses direction.
-If two successive reverses occur, then the search terminates.
-The values of α 0 , γ 0 , n 0 with the least residual error are returned.
From the semilogarithmic plot of (7) these initial estimates can be deduced: -The average arctangent of the plot of T (•) approximates the exponent γ 0 .
-The rise of the plot of T (•) at the beginning of the latter is ln α 0 .
-The shift of the plot of T (•) at the beginning of the latter is n 0 .
From the definition of model (6) it is clear that it is the exponent γ 0 which dominates the time growth.The values obtained by the process described above for γ 0 for each query are given in table 5. Notice that the values for q 1 are very close, but for the remaining queries MongoDB scales with a much lower rate.

Response to query sequences
In a typical enterprise environment even a standalone database installation is bound to serve query sequences with varying characteristics and of different sizes.The prediction of the execution time of each window allows the preemptive allocation of resources least under a certain set of normal operating conditions.In this case the following scenario was simulated.The same batch of L 1 tweets, in the order they were created, are serially inserted to the two databases.This simulates the constant flow of tweets under normal operational conditions with no special events sparking intense activity happening.At the end of each such window the top ten accounts and tweets as determined by the three metrics of equations ( 1), (2), and (3) are computed.The sum of the individual times of these analytics is defined as the reponse time for the respective window.
Notice that even in a simple scenario like this there is no direct correspondence between the execution time of each window and the queries it contains.This can be attributed to the following reasons: -Although a connection between analytics and queries has been derived, the relationship between queries and response time is complicated.-There is a non-linear connection between the time reponse of a query and its respective input size.Therefore, the reponse of an entire query sequence is more difficult to model and analyze.-The query input size keeps growing, which translates not only to non-constant total response times but also to an increased variations thereof.Therefore, the prediction complexity is non-trivial.-The response time depends not only on the query but also on factors such as the available memory, disk and processor utilization, disk controller performance, and the load factor which have not been modeled.
One way for predicting the execution time of a query window is a linear filter which takes as input the execution times of L 0 past windows and yields as output an estimate of the execution time of the (L 0 + 1)-st window.Depending on the filter as well as on the stochastic properties of the windows the coefficients of a filter may be updated periodically, updated after certain events, or remain fixed.
In this work the Wiener filter of length L 0 will be used, which has numerous applications in signal processing.It relies on approximation of input autocorrelation coefficients as computed by their respective sample counterparts.For instance, given a tuple of past window execution times as in equation ( 9): Observe that tuple X essentially contains L 0 samples of the r.v X k which models the response time of the k-th execution window.Since these times were generated by the same database installation under the same conditions, they can be assumed to be modeled by identically distributed r.vs.As the queries of each window do not interact with those of other windows, then X k can be also assumed to be independent.This probabilistic assumption greatly facilitates the approximate computation of the autocorrelation sequence, which is necessary in order to compute the filter coefficients.Assuming a stationary distribution for the execution times, the j-th autocorrelation coefficient ρ j be defined as in (10): For example, the first two coefficients of the autocorrelation sequence E X 2 k and E [X k X k−1 ] are approximated as follows in equation ( 11): In general, the coefficients w k of a Wiener filter of length L 0 are the solution of the following linear system (12).Observe that the coefficient matrix is diagonally dominant as determined by the properties of the autocorrelation sequence, symmetric, and Toeplitz.These lead to efficient solutions for long filters.
Once the Wiener filter has been computed for both database architectures and for various values of the filter length L 0 , the mean square error between the actual and the predicted execution times was computed.The normalized results are shown in table 6.The latter was selected in order for the differences in errors of the various combinations to be easier to understand.From the values of table 6 it follows that MongoDB achieves a lower mean square error and, thus, the time reposnses of its query sequences are easier to be modeled and predicted.This in turn translates to the database administrators having a better view about the scaling dynamics of incoming tweet flows.The ability for quicker responses may be vital in extraordinary events.

Remarks
The findings of this previous are consistent, to the extent where a comparison can be made, with those reported in the recent literature examining the performance of MongoDB against that of the PostgreSQL in various engineering cases.Specifically, the scaling dynamics were similar to those in [39] and in [28], although refers to spatial data.Other tests, again on spatial data, in [5] showed the strengths and weaknesses of both databases.In [19] big data considerations were examined and constraints were derived again for both databases.The perspective of queries about line intersection and spatial containment problems was given in [1] where MongoDB was better.Queries on unstructured data were run on the same pair of databases in [9] with MongoDB being the winner because of its enhanced flexibility.When it comes to queries for semistructured, high velocity sensor measurements as explored in [43], again MongoDB seems to be the choice at the cost of extensive memory use.In [11] the performance of k-nearest neighbors is explored with the relational solution offering more advnatages.MongoDB was the better for handling a stock market dataset over Hadoop [37].Finally, the potential of MongoDB for mobile applications was explored with encouraging results [15].
Based on the experiments, the following remarks can be made: -Concerning isolated queries, MongoDB achieves lower execution times but PostgreSQL has lower variance.The latter implies that PostgreSQL performance is easier to predict.This can be attributed to the structured data format and the greater degree of automation it offers.-Concerning query windows, the total execution time is easier to predict in the case of MongoDB with Wiener filters of lower order.Since Wiener filters are efficiently implemented in most if not all scientific computation tools and are easy to understand intuitively, database administrators can easily obtain an understanding of MongoDB scaling dynamics and, based on their experience, of the major drivers behind them.

Future Research Directions
This conference paper focuses on the performance comparison between Mon-goDB and PostgreSQL, a special case of the general BASE vs ACID question, when handling Twitter analytics.As is the case in most realistic engineering scenarios, the answer is not simple.Rather, it depends on the evaluation of the current state as well as the near-term trend of a plethora of operational variables including memory and disk utilization, system load, and input query size.In this conference paper the scaling dynamics of both individual social queries, namely queries forming the base for Twitter analytics, and query sequences were evaluated on a large number of English tweets about Brexit.Although PostgeSQL achieved consistently lower variance of query execution times, MongoDB had lower mean times.Moreover, the execution times of query sequences were easier to predict in the MongoDB case, meaning that database administrators had a much better view of the query scaling trends.
Concerning possible future research directions, the use of larger datasets with a higher variability of attributes as performance benchmarks should be examined.Moreover, the scaling dynamics of database server clusters should be investigated.Additionally, other query sequence execution time predictors, whether fixed or adaptive, can be the focus or future research.
SELECT a c c o u n t FROM a c c o u n t s WHERE f o l l o w e r s >= 1 ORDER BY f o l l o w e r s DESC;

SELECT a c c
o u n t s .account , a c c o u n t s .f o l l o w e r s FROM a c c o u n t s INNER JOIN t w e e t s ON a c c o u n t s .a c c o u n t = t w e e t s .a c c o u n t WHERE a c c o u n t s .f o l l o w e r s >= 1 ;

Table 1 .
Notation of this conference paper.

Table 4 .
Sample variance vs query size and database technology (msec).

Table 5 .
Values for the exponent γ0.

Table 6 .
Normalized mean square error for database architectures.