Computing Long Sequences of Consecutive Fibonacci Integers with TensorFlow

. Fibonacci numbers appear in numerous engineering and computing applications including population growth models, software engineering, task management, and data structure analysis. This mandates a computationally eﬃcient way for generating a long sequence of successive Fibonacci integers. With the advent of GPU computing and the associated specialized tools, this task is greatly facilitated by harnessing the potential of parallel computing. This work presents two alternative parallel Fibonacci generators implemented in TensorFlow, one based on the well-known recurrence equation generating the Fibonacci sequence and one expressed on inherent linear algebraic properties of Fibonacci numbers. Additionally, the question of using lookup tables in conjunction with spline interpolation or direct computation within a parallel context for the computation of the powers of known quantities is explored. Although both parallel generators outperform the baseline serial implementation in terms of wallclock time and FLOPS, there is no clear winner between them as the results rely on the number of integers generated. Additionally, replacing computations with a lookup table degrades performance, which can be attributed to the frequent access to the shared memory.


Introduction
The sequence of Fibonacci integers f k appears often in a broad spectrum of engineering applications including coding theory, cryptography, simulation, and software management.Additionally, Fibonacci numbers are very closely tied to the golden ratio ϕ which is frequently encountered in nature, such as in population growth models and in botanics.Moreover, in architecture ϕ is almost considered synonymous to harmony.Thus, Fibonacci integers are arguably among the most signifiant sequences.Although their defining linear recurrence equation is simple, serially generating a long sequence of consecutive Fibonacci integers is by no means a trivial task.
However, with the advent of GPU computing, the efficient parallel generation of f k has been rendered feasible.Indeed, by exploiting known properties of the Fibonacci integers it is possible to build parallel generators which exploit the underlying hardware potential to a great extent, achieving low response times.This requires specialized scientific software such as TensorFlow which not only contains very efficient libraries, but also facilitates the development of high quality custom source code.
The primary research contribution of this work is twofold.First, it lays the groundwork for two parallel Fibonacci integer generators, one based on a closed form for each number in the sequence and one based on certain linear algebraic properties of Fibonacci integer pairs.The performance of the two proposed generators developed in TensorFlow for Python is evaluated in terms of both total turnaround time and FLOPS against a serial implementation with the same software tools.Second, it explores the question whether it is worth substituing the computation of known quantities with a lookup table.
This conference paper is structured as follows.Section 2 reviews current scientific literature regarding the computational aspects of Fibonacci numbers.Their fundamental properties are described in section 3. The TensorFlow implementation is presented in section 4, whereas future research directions are outlined in section 5. Table 1 summarizes the notation of this work.Concerning notation, matrices and vectors are depicted with boldface uppercase and boldface lowercase respectively, whereas roman lowercase is reserved for scalars.

Previous Work
The Fibonacci sequence of integers, examined among others in [6] and [30], has perhaps the most applications not only in computer science and in engineering but in science as a whole.Closed forms for the spectral norms of circulant matrics whose entries are either Fibonacci or Lucas integers are derived in [21].In data structure analysis the Fibonacci heap [19] and the associated pairing heap [18] have efficient search and insertion operations with numerous applications such as network optimization.A pair of successive Fibonacci numbers are known to be the worst case in Euclidean integer division algorithm as shown in [9] as well as in [29].Fibonacci numbers play a central role in estimating task duration and, consequently, task difficulty in scrum based software engineering methodologies [28][27], including inaccurate estimation discovery [25] and using agile methodologies to predict student progress [26].Moreover, Fibonacci numbers are very closely linked to the golden ratio ϕ6 as well as to symmetry of many geometric shapes, the latter having important implications in group theory [16].
Many identities in combinatorics regarding Fibonacci numbers can be found in [31] as well as in the most recent works [5] and [23].Finally, the Lucas sequence k is closely associated with the Fibonacci sequence f k since the two integer sequences constitute a Lucas complementary pair and share similar properties such as growth rate and generation mechanism [3][4].
TensorFlow, originally developed by Google for massive brain circuit simulation, is an open source computational framework whose algorithmic cornerstone is the dataflow paradigm as described in [2], [1], or [20].A library for generating Gaussian processes in TensorFlow is presented in [24].For a genetic algorithm implementation in TensorFlow for heuristically discovering community structure, a problem examined in [10], in large multilayer graphs, such as those presented in [12] and in [11], see [15].In [14] the ways insurance and digital health markets can benefit from blockchain and GPU computing are explored.For a path and triangle based graph resilience metric in TensorFlow see [13].A very popular front end for the low level TensorFlow is keras, which allows the easy manipulation of neural network layers, including connecticity patterns and activation functions [8].Model training and prediction generation is done also easily in keras in four stages [22].Convolutional kernels whose lengths depend on their relative location inside the neural network architecture for computational vision purposes implemented in keras are introduced [7].

Fibonacci Numbers
The n-th integer in the Fibonacci sequence f n is defined as: Theorem 1.The n-th Fibonacci number f n has the closed form: Proof.The characteristic polynomial of (1) is: And its two real and distinct roots are: Therefore, it follows that: The constants α 0 and α 1 are computed using the initial conditions derived by the first two Fibonacci numbers as follows: The above conditions yield: Another way to prove theorem 1 is the following: Proof.Another way to directly compute the n-th Fibonacci number f n is to rewrite the Fibonacci definition of equation ( 1) and the identity f n = f n combined in matrix-vector format as follows: The eigenvalues λ 1 and λ 2 (recall that det (A) = λ 1 λ 2 and that tr (A) = λ 1 +λ 2 ) and the corresponding eigenvectors e 1 and e 2 are: Notice that e 1 2 = 1 and e 2 2 = 1 and, additionally, e T 1 e 2 = 0. From the spectral decomposition of A it follows that: Observe that the first element of vector f n−1 is f n and it is equal to: Finally, the structure of A n can be shown by induction to be: Despite form (5), Fibonacci numbers as evident by the initial conditions and their generation mechanism.Another way to see this is the following theorem: Theorem 2. Fibonacci numbers are integers.

TensorFlow Implementation
As stated earlier, TensorFlow relies on the dataflow algorithmic paradigm, which essentially utilizes a potentially large operations graph in order to break down the desired computational task to smaller manageable components.Dataflow graphs have the following properties: -Vertices represent a wide array of mathematical operations including advanced ones such as eingenvector computation, singular value decomposition, and least squares fitting.-Edges describe the directed data flow between operation results.
-The operands between the various graph operations are tensors of aritrary dimensions as long as they are compatible.
TensorFlow r1.12 was installed to Ubuntu 18.04 LTS for Python 3.6 using the pip package installer.An NVIDIA Titan Xp GPU based on Pascal architecture was available in the system and was successfully discovered by TensorFlow as gpu0.
Three Fibonacci generators were implemented in total.Each such generator yields a batch consisting of the first n consecutive Fibonacci integers.In the experiments, n ranged from 2 to 1024 with an exponentially increasing distance between two successive batch sizes.Since the particular GPU has 3840 CUDA cores, the parallelism potential is high.The generators are: -A serial implementation which consists of a single loop which adds one new Fibonacci number with each pass.-A parallel implementation which directly computes the k-th element of the sequence f k based on equation ( 11).-A second implementation which relies on the slightly simpler closed expression of equation ( 2).
Figure 1 shows the total number of floating point operations which were required for each batch size.The values are the arithmetic mean of ten executions for each batch size.Preceding each such execution there was a trial run not taken into consideration in the final results which served the single purpose of loading the data into system cache.Since the serial implementation is a loop, the number of additions is linear in terms of n.On the contrary, the parallel implementations require a number of auxiliary floating point operations, most notably the exponentiation of certai parameters.Thus, they require more operations whose number is a polynomical function, approximately quadratic, of batch size, with the second generator clearly always being more more expensive in terms of operations.
However, the fact that a generator requires more floating point operations does not necessarily makes it slower in terms of total execution time.Instead, the results shown in figure 2 indicate that both parallel generators achieve considerably lower wallclock execution time in milliseconds.Since the computation of f k is GPU-bound process and the design of both parallel generators entail very low communication across the memory hierarchy, ordinary wallclock time in this case consists almost entirely of time spent to actual computations.
Notice that, unlike the previous figure, no parallel implementation appears to be ideal for every batch size.Specifically, the second implementation is better for lower sized batches, whereas the first one becomes more preferable as batch size grows despite its more complex formula.This can be attributed to the fact that the second generator achieves more locality as certain parameters are common across each batch size.
This difference in performance can be also seen by dividing the number of floating point operations to the wallclock time, yielding an approximation of the FLOPS for each generator as seen in figure 3.Although by no means a single absolute benchmark, especially within a parallel computation context, FLOPS are in this case indicative since the algorithmic core is purely computational.The difference from the serial implementation is now obvious, as is also the fact that the second generatorfor large batches performs better.
Notice that in both parallel implementations appear many consecutive powers of known quantities.In order to save floating point operations, a lookup table could have been used according the design principles found for instance in [17].In order to evaluate the impact of relying on a lookup table to the total FLOPS for the two parallel generators, two variants of each were also implemented.The first version used locally half the known quantities required, whereas the second only only one quarter of them.In both cases, and in order to achieve comparable result accuracy for fairness reasons, spline interpolation was used.As it can be seen from figure 4, the introduction of a lookup table for both generators downgraded their FLOPS counter.This can be explained from the facts that TensorFlow has an efficient multipication algorthm especially for large numbers and that frequently accessesing the shared GPU memory eventually slowed computations down.

Conclusions And Future Work
This conference paper presented two parallel Fibonacci integer generators for TensorFlow running over Python 3.6 and an NVIDIA Titan Xp GPU and discussed certain implementation aspects.Both generators yield batches of n integers, with n ranging from 2 to 1024.The maximum batch size is smaller than the number of cores in the GPU, increasing thus parallel potential.The baseline was a serial implementation for TensorFlow consisting of a single loop generating one new Fibonacci integer in each pass.The primary finding of the experiments is the superior performance of parallel generators.Although requiring more floating point operations in total, both parallel implementations outperform the baseline in terms of wallclock time and of FLOPS.This is attributed to the efficient use of parallelism.A secondary finding was that replacing actual computations with frequent accesses to a shared lookup table led to lower FLOPS values.This can be explained by the latency caused by a large number of threads asking for the same information.
Concerning future research directions, there is a number of options which can be followed.Experiments with larger batch sizes should be conducted, especially with sizes which exceed the number of GPU cores.Additionally, more algorithmic schemes should be tested, such as those constructing Fibonacci integers bitwise, as they may led in generators with higher parallelism.Finally, the performance of the proposed algorithms to other parallel architectures can be evaluated, in order to understand whether a given hardware architecture is more appropriate for particular algorithmic principles.Additionally, this conference paper is part of Project 451, a long term research initiative whose primary objective is the development of novel, scalable, numerically stable, and interpretable tensor analytics.
Moreover, the authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Fig. 1 .
Fig. 1.Number of operations vs batch size of Fibonacci numbers.

Table 1 .
Notation of this conference paper.