TaskLocalRandom: a statistically sound substitute to pseudorandom number generation in parallel java tasks frameworks

Several software efforts have been produced over the past few years in various programming languages to help developers handle pseudorandom streams partitioning. Parallel and distributed stochastic simulations can obviously benefit from this kind of high‒level tools. The latest release of the Java Development Kit (JDK 7) tries to tackle this problem by providing facilities to partition a pseudorandom stream across various threads thanks to the new class ThreadLocalRandom. Meanwhile, Java 7 offers a framework to split a problem in a divide and conquer way through the new class called ForkJoinPool. As any other Java thread pool, ForkJoin exploits threads as workers and manipulates the tasks that will be run on the workers. In ThreadLocalRandom, pseudorandom number generation is handled at a thread level. As a consequence, a scientific application taking advantage of a Java thread pool to parallelize its computation may suffer from a bad pseudorandom stream partitioning due to the behavior of ThreadLocalRandom. The present work introduces TaskLocalRandom, a task‒level alternative to ThreadLocalRandom that solves this partitioning problem and assigns an independent pseudorandom stream to each task run in the thread pool. TaskLocalRandom is compatible with existing Java thread pools such as Executors or ForkJoin. Copyright © 2014 John Wiley & Sons, Ltd.


INTRODUCTION
At the many-core era, simulation practitioners must take advantage of such powerful architectures. The new release of the Java Development Kit 7 (JDK 7) addresses this need by focusing on the concurrency features of Java. Java 7 offers a couple of new tools to enhance the already existing concurrent package. Namely, a new framework called Fork/Join appears [1]. It provides an easyto-use implementation of the divide and conquer paradigm through lightweight tasks. The divide and conquer paradigm suggests to split an important workload among several processing elements (PEs). First, each PE will compute a subset of the whole workload. Then, the results will be gathered when all the subtasks have returned their results.
This new task framework, added to the already present tools allowing the distribution of the computing load across several threads, should attract more and more simulation practitioners to Java development. These users will also bring their own concerns bound to parallelization in their domain 3384 J. PASSERAT-PALMBACH, C. MAZEL AND D. R. C. HILL of expertise. Thus, simulationists working on stochastic simulations will ask for a tool to help them partition a random source in a parallel Java environment.
In fact, correct partitioning of random streams is the main concern of several studies [2, 3], and neglecting this part of a simulation could lead to biased results [4]. To avoid such issues, one needs to ensure that the four main guidelines exposed in [5,6] are followed: (i) Each computing element should dispose of its own random sequence; (ii) The parallelization technique must be usable for any number of computing elements; (iii) The parallel random streams produced should be uncorrelated; (iv) When the status of the pseudorandom number generator (PRNG) is not modified, the sequence of random numbers generated for a given computing element must be the same no matter the number of computing elements and regardless of the way computing elements are scheduled.
Java 7 introduces the ThreadLocalRandom class, a tool that intends to enable developers to deal with pseudorandom numbers in parallel on a single shared memory computer, without having to figure out how to distribute numbers among the available PEs. The question for scientific applications is as follows: can ThreadLocalRandom serve as a random source with the statistical quality required by scientific applications?
Although this development is a good initiative that is worth being integrated in Java, we will see that the current implementation has still some major drawbacks for scientific purposes. In fact, the underlying PRNG used by ThreadLocalRandom can hardly be considered for any scientific application as explained in Section 2.2 and in [2, 7,8]. Moreover, as its name suggests, ThreadLocalRandom is designed to perform at a thread level. However, threads are now mostly used as worker threads in Java thread pools, following the introduction of tasks frameworks since JDK 5. Worker threads were designed to get rid of the overhead bound to threads creation. An application creating lots of threads will indeed be slowed down by frequent thread spawns. To overcome this issue, threads are created once and for all and are then assigned tasks to achieve. Such permanent threads are gathered in thread pools over the lifetime of the application.
Because of architectural considerations, we usually use as many worker threads as there are available PEs in the system. PEs can take the shape of physical or logical cores depending on the underlying architecture that exploits the JVM. For instance, modern Intel CPUs integrate a feature called HyperThreading that enables a physical core to refine its parallelism capabilities by running two different threads in parallel, provided they perform operations involving different hardware resources (e.g., one thread can compute a floating point instruction, whereas the other one treats an operation on integers).
The JVM considers these two execution paths as two logical cores. In such a case, the number of PEs will denote the number of logical cores. In order not to limit the parallelism granularity to this logical cores boundary, the notion of tasks has been introduced. Tasks are purely equivalent to Threads in terms of development, because they implement the same Java Runnable interface. They only differ from threads in that they are queued within worker threads and thus are scheduled when their number is greater than the number of workers. As a consequence, a single task will run in a worker thread at a given point, but the worker thread can preempt it in case it is stalled, waiting for data for instance.
These hardware considerations are not taken into account by the JVM, but task frameworks integrate a task stealing mechanism that enables an idle worker thread to steal tasks from the queue of a busy worker. As a result, Java task frameworks will benefit from the scheduling mechanism that spreads the workload among the available hardware resources. This behavior is depicted in Figure 1. Task scheduling allows designing a finely grained parallel algorithm that will scale up smoothly on platforms with the number of worker threads.
Tasks make pseudorandom stream distribution from ThreadLocalRandom inefficient, because tasks are not taken into account by the class. As a consequence, a novelty brought by the latest release of the JDK 7 cannot handle one of the most common way to leverage threads in Java concurrent applications! Although ThreadLocalRandom is stated as 'particularly appropriate when 3385 Figure 1. One worker thread per processing element is created. It is assigned a queue of tasks to process. multiple tasks (for example, each a ForkJoinTask) use random numbers in parallel in thread pools', ‡ it cannot be considered as safe in terms of pseudorandom stream distribution because all the tasks run by the same worker thread will share the same pseudorandom stream.
The present work will consequently tackle a correct way to distribute pseudorandom streams in parallel Java applications harnessing the power of tasks frameworks. To do so, we will Study ThreadLocalRandom's intrinsics to figure out whether its output is satisfying regarding stochastic simulations needs; Discuss ThreadLocalRandom's capabilities when used in a task framework context Present already existing libraries that could serve as alternatives to ThreadLocalRandom; Introduce TaskLocalRandom, our proposal based upon the MRG32k3a PRNG algorithm from Pierre L'Ecuyer [9]; Compare TaskLocalRandom with ThreadLocalRandom and consider its potential evolutions.

Implementation concerns
Officially released with JDK 7, the ThreadLocalRandom facility was developed within the jsr166y initiative by Doug Lea. ThreadLocalRandom tries to solve the complexity regarding use of random sources correctly in parallel applications. Each thread owns a ThreadLocalRandom instance, allowing each thread to be independent from the others to pick up random numbers. Still, the most important point behind this technique is that it is supposed to distribute pseudorandom streams safely among threads.
ThreadLocalRandom inherits from java.util.Random, thus sharing its interface. Every thread must call a method named current() before calling the classical nextXXX methods to pick up a random number whose type is indicated by the XXX suffix.
ThreadLocalRandom makes use of the random spacing technique [3] to distribute pseudorandom streams across threads. This technique consists in initializing an identical PRNG instance in each thread with a different seed-status [10], the latter being randomly chosen by another algorithm.
By doing so, each thread owns a pseudorandom sequence considered as highly independent (no mathematical proof allows asserting that two sequences are truly independent from a probabilistic point of view), provided the PRNG algorithm has a long enough period and is not subject to longrange correlations [11].
Random spacing is implemented in ThreadLocalRandom through its constructor. Indeed, this method calls the Random constructor before setting a Boolean to true, thus depicting that initialization has been carried out and cannot be performed again. ThreadLocalRandom must then rely on the constructor of the Random class to set its initial seed. Until JDK 6, the Random constructor used to automatically perform a call to setSeed, the method in charge of the random spacing initialization of the seed. However, this is not true anymore with JDK 7, where the constructor of Random does not summon setSeed anymore. Consequently, any PRNG class that extends Random and relies on it to call setSeed will see its seed status remain uninitialized. According to Gosling et al. [12], the seed of each thread is thus set to zero, as any class member of the long type would be. Hence, every thread will pick up the same pseudorandom sequence in such a case.
We have already spotted a similar problem in a Java Mersenne Twister implementation and proposed a corrective patch that solves it. § Calling setSeed in every thread could easily solve this problem. Unfortunately, the setSeed public method, which would normally allow setting the seed of the PRNG of a thread to a new value, is locked by the previously mentioned Boolean. Such a feature is important to prevent any user from harming pseudorandom streams independence between threads by setting several seeds to the same value. However, this also prevents us from adapting the class behavior and forcing a call to setSeed directly in the constructor of the subclass. Moreover, this solution relies on user-awareness of the problem, which goes against the initial purpose of ThreadLocalRandom to hide random streams distribution to the user.
The problem was finally solved in the second update of the JDK 7 by changing the constructor of Random in order to take into account a potential use of setSeed by subclasses. This change is confusing in two ways. Not only does it break encapsulation, one of the elementary concepts of the object-oriented paradigm, but it also appears as a lack of good software engineering. It is indeed not recommended to adapt an implementation according to an already existing source code. Instead, it is safer to rely on the specification only. In our case, the random class documentation issued by Oracle makes no mention of a potential call to the setSeed method by the Random constructor. As a consequence, we cannot blame Oracle for this bug but rather advise developers to focus on the official documentation only, especially when they are working on such sensitive aspects of the implementation.
Another weakness in the implementation of ThreadLocalRandom lies in the impossibility to reproduce the same pseudorandom sequences throughout several runs of the application by default. Scientific applications such as stochastic simulations need to ensure reproducibility between executions for their results to be checked or for debug purposes. When ThreadLocalRandom is used by default, it does not satisfy this need because it relies on the Random constructor to set its internal seed, whose behavior is to use the current system time as seed. This could be interesting for games but not for a scientific software. This problem can be fixed by basing the initial seed on a unique identifier for each thread so that for a given identifier, a thread will always be assigned the same stochastic stream. Still, although Java provides a thread identifier at runtime through the Thread.currentThread().getId() call, this identifier is not reliable because it is global for the whole JVM and thus depends on how many threads were created before the one considered. Therefore, ThreadLocalRandom must be extended with its own thread identifier to make it safe in terms of pseudorandom stream distribution and reproducibility. We have proposed such an extension in [13].

Statistical quality discussion
Nowadays, several renowned tools exist to check the statistical quality of pseudorandom streams. The sole Knuth's tests [14] cannot characterize the statistical quality of a pseudorandom sequence on their own. They have been integrated in wider test batteries, which give a more thorough 3387 judgment on the statistical quality of the pseudorandom sequence. A testing suite, named DieHard, highly regarded for many years, was proposed by Marsaglia [15] and was improved by Brown [16], who proposed the DieHarder testing suite. The Scalable Parallel Pseudo Random Number Generators Library (SPRNG) (Free and Open Source Software) [17] library is also providing a thorough set of statistical tests. For 6 years now, the scientific community has widely agreed that the current reference test battery is TestU01 from [18]. TestU01 currently offers the most complete collection of utilities for the empirical statistical testing of uniform random number generators. Please note that this enumeration does not take into account testing suites focusing on cryptographic applications. In this category, the leading tool is instead the National Institute of Standards and Technology Statistical Test Suite proposal [19], although TestU01 also owns tests targeting cryptographic generators.
In addition to the classical statistical tests for PRNGs and the other tests previously cited and proposed in the literature, TestU01 proposes new original tests and predefined tests suites (Small-Crush, Crush, and BigCrush with more than a hundred tests). Many of the most spread PRNGs fail significantly when faced with this software. The underlying PRNG of ThreadLocalRandom, which is a well-known and widely studied linear congruential generator (LCG) from Knuth [14] (it also rules the output of the POSIX drand48 C function for example), is among the algorithms at fault. LCG generators should be discarded from scientific applications because their structure is not adapted to many modern applications [20]. The problem is even bigger when parallel and distributed computing is considered. In addition, the period proposed by ThreadLocalRandom is relatively small for modern scientific applications: it is 2 48 numbers long, when Pierre L'Ecuyer suggests that for modern applications periods should be at least 2 100 numbers long [20].
In regard to the parallel utilization of ThreadLocalRandom, we can barely imagine that such a bad generator [2, 7, 8] could behave better in a parallel environment. Thanks to TestU01 parallel filters [18], we can easily create a random sequence formed by the combination of any number of input sequences from different ThreadLocalRandom initializations. However, as stated in [21], it is impossible to perform a complete coverage of all possible logical sequences because many strategies can be set up to distribute both tasks and random streams across parallel computational units. Consequently, testing campaigns often focus on samples that are particularly representative of the distribution technique used.

ThreadLocalRandom plunged into tasks frameworks
Java tasks frameworks are now widely spread across Java applications exploiting concurrency. Introduced in JDK 5 through the ExecutorService class, these tools are thread pools that create threads for the whole lifetime of an application. These threads are then used as workers that will pick up tasks from queues created by the task framework and execute their content, instead of creating new threads. By doing so, the application no longer suffers from the overhead induced by frequent thread creations. The power of this approach is that it relieves developers from low-level thread management without impacting the application or requesting new knowledge.
The tasks queued to be executed by worker threads are nothing more than instances implementing the Runnable interface. This latter interface is already used to implement handcrafted concurrent Java applications: they contain the workload to be performed by threads when they are used without any task framework. This simplicity explains the wide adoption of tasks frameworks amongst the Java community.
However, ThreadLocalRandom's internal mechanisms make it unable to handle tasks. Most of the features provided by ThreadLocalRandom to distribute pseudorandom streams among threads lie behind the current() method. The latter is a static method that every thread must call in order to retrieve its own ThreadLocalRandom instance. The method basically acts like a singleton that builds a ThreadLocal instance parameterized with the PRNG class, ThreadLocalRandom in our case. ThreadLocal is a generic Java class that appeared in JDK 2 and provides easy copy-on-access facilities to concurrent threads. When a thread first accesses a ThreadLocal object, it gets its own copy of the object that does not require synchronized accesses with other threads anymore. Typical applications of this mechanism are thread-based counters such as thread identifiers. The ThreadLocal mechanism only operates at thread level and is not aware of any task concept introduced by top-level frameworks such as Fork/Join. Thus, reproducibility cannot be expected when ThreadLocalRandom is used by tasks from these frameworks.

RELATED WORKS
Several attempts to provide a user-friendly interface to generate random numbers in parallel environments can be found in the literature. Here, we recall the major proposals that can compete and replace ThreadLocalRandom in scientific applications. We only consider frameworks that provide ways to automatically distribute pseudorandom streams through threads without the user's help.
As we have seen previously, the standard Java library only ensures thread safety through synchronized methods when accessing the random number generation features of the java.util.Random class. This approach is not satisfying in the world of high performance computing: in addition to not ensuring reproducibility of simulations because of thread scheduling and of scaling problems, it impacts performance of parallel stochastic applications because of the sequential bottleneck implied by the synchronization guarding random facilities. This method to partition pseudorandom sequences is known as central server in the literature [3].
JAPARA (Free and Open Source Software) [22] was proposed by Coddington and Newell in 2004 to tackle this lack in Java libraries. They bring up a Java API to support parallel generation of random streams. JAPARA proposes that every PE (Java threads in that case) handles its own pseudorandom stream. In doing so, only the initialization phase is synchronized, and a referenced partitioning technique is then used to distribute the underlying pseudorandom streams. JAPARA comes with three PRNGs implemented, each coupled with a distribution technique that matches its intrinsic characteristics. The user only has to select the PRNG he wants to employ and then rely on the framework to ensure independence between the different streams assigned to the threads. Furthermore, JAPARA allows the user to save and restore the current state of a PRNG, thus permitting to checkpoint a simulation.
After having first proposed a random number package with splitting facilities [23], L'Ecuyer's team proposed an object-oriented pseudorandom generation package in 2002 with [24]. This was achieved in [24], which proposes a C++ implementation of the MRG32k3a PRNG, whose independent streams are partitioned from an original stream thanks to the sequence splitting technique [3]. A Java declination comes with the stochastic simulation in Java [25] framework and its RandomStream interface, the pseudorandom streams parallelization utility of the library. It provides a greater set of PRNGs (including the famous Mersenne Twister [26], for instance) and a compliant set of distribution techniques.
The latest Java random number generation framework that has retained our attention is DistRNG (Free and Open Source Software) [4,27]. Although its API does not diverge from the two other proposals described in this section, DistRNG focuses on correct partition of random streams. To do so, this framework handles XML generic statuses that model any PRNG state. Every computational element is initialized with a different XML status that needs to be built upstream. DistRNG displays a fine choice of statistically sound PRNGs according to the TestU01 reference testing library. Now considering other languages, SPRNG [17] is one of the most widely used libraries in C++ to automatically distribute pseudorandom streams across message passing interface (MPI) [28] processes. Although this framework achieves the same result than ThreadLocalRandom for MPI processes, it takes advantage of the MPI rank (MPI term for process identifier) assigned to processes by MPI, whereas we cannot rely on such a mechanism in Java.
Random123 [21] looks like an interesting development, especially for memory constrained environments such as graphics processing units. This set of PRNGs have shown a good statistical quality when faced with empirical testing battery such as TestU01 [9]. Still, beyond the fact that the period length of these PRNG is long, we cannot assess uniformity theoretically via a spectral test but only empirically by typical tests. Consequently, we are still waiting for more thorough studies regarding these algorithms from domain experts before using them in our own developments.
In conclusion, this section has shown that several satisfying proposals of APIs for parallel pseudorandom number generation can be found in the literature. Consequently, users have many reliable solutions at their disposal if they want to take advantage of statistically sound pseudorandom sequences in their Java applications. Moreover, most of these solutions can replace ThreadLocal-Random features but require modifications on the application source code to meet their functioning requirements. Still, none of these frameworks integrate the task notion, thus calling for further developments if they are to be used within a task execution framework.

TASKLOCALRANDOM IMPLEMENTATION
In this section, we present the Java solutions enabling pseudorandom stream distribution across Java tasks. Apart from our software engineering inputs, the PRNG algorithm we use is a wrapper of the RNGStream class from Pierre L'Ecuyer [29]. It implements the MRG32k3a PRNG algorithm described in [9]. Recall that our software engineering inputs aim at providing each task of a Java application a different pseudorandom sequence. Several features of MRG32k3a retained our attention, from its internal data structure to the results it displays when faced with today's most stringent testing batteries.

The choice of MRG32k3a
Talking about its internal properties, MRG32k3a is really suited to parallelization among small computational elements such as threads and tasks, because its lightweight data structure only stores six integers to handle its state. The algorithm itself is relatively short, relying on simple operations only to issue new random numbers. The parameters chosen for MRG32k3a are such that it has a full period of approximately 2 191 numbers [9]. This period is long enough in regards to what Pierre L'Ecuyer suggests: periods between 2 100 and 2 200 are highly sufficient [20]. Even with our modern large-scale simulations with computing power going to Exascale, we will not reach 2 200 . MRG32k3a has been designed to produce independent streams and substreams from its original random sequence thanks to its parameters that enable safe sequence splitting. Thus, the internal parameters split the initial sequence into 2 64 adjacent streams of 2 127 random numbers, themselves divided into substreams containing 2 76 elements. The ability to issue streams as independent as possible is very important when tackling the safe distribution of random numbers across parallel computational elements. The sequence splitting approach of MRG32k3a suggests an obvious partition of the original sequence by assigning each computational element a stream or a substream, depending on the application eagerness for random numbers. As long as we are focusing on parallel applications that are Java tasks-based, the parallel grain is limited to how many tasks a single many-core machine can handle. This figure depends on the underlying architecture hosting the Java platform, but we really do not expect to deal with more than 2 64 parallel tasks, the total number of independent streams bearing 2 127 random numbers each that MRG32k3a can provide.
In addition, the most important point is that this generator displays a great statistical quality according to its TestU01 results related in [18]. MRG32k3a passes all the tests of BigCrush, the most stringent and complete testing battery that comes with TestU01, and is so referred to as a 'Crushresistant'İ PRNG in [21]. Although being Crush-resistant cannot ensure a perfect randomness of the considered pseudorandom stream, it is a satisfying property of which few PRNGs can be proud; even the current Mersenne Twister family of generators fails some (very limited) tests. PRNGs stated as bad according to TestU01 criteria have led to incorrect simulation results in the past [7,11,30], and even good PRNGs can miss some tests [21,27]. As we did not want to take any risks with our PRNG choice as a replacement to the LCG of ThreadLocalRandom, we focused on Crush-resistant PRNGs such as MRG32k3a.

Assigning independent pseudorandom streams to different tasks
Provided that we are able to uniquely identify tasks (this aspect will be tackled in Section 4.3), an independent pseudorandom sequence can be assigned to each of them. This section determines how these sequences are actually handled within our MRG32k3a implementation. We have seen previously that this PRNG had been designed to partition its original sequence into streams and substreams. We have chosen to give an independent stream to each task, so that they can all benefit of their own independent 2 127 numbers long pseudorandom sequence. As long as streams are contiguous in the original sequence, the beginning state of each independent stream is located every 2 127 elements in the original sequence. Hopefully, [9] details a jump ahead algorithm that enables us to advance the state of the original sequence at almost no extra cost, no matter how many elements we skip. Thus, if a task has been assigned an identifier k, the seed-status of its TaskLocalRandom instance is initialized by the constructor to X n with n D 2 127 k. The latter situation is summed up in Figure 2.

The challenge of uniquely identifying tasks
The main struggle at the heart of TaskLocalRandom is to provide a unique task identifier, which the Java language does not support at the time of writing. As explained previously, ThreadLocalRandom benefits of the ThreadLocal mechanism from the Java SDK. ThreadLocal relies on Java native interface calls, which means that its implementation is directly tied to the underlying operating system that supports the JVM. Each operating system deals with threads on its own way, using native APIs. However, these native APIs do not provide a common concept of threads. It seems consequently difficult to implement a mechanism equivalent to ThreadLocal at a task level. Two approaches can be considered to avoid this lack: either each task can autonomously distinguish itself from the others using a particular algorithm, or a central element in the system needs to uniquely identify each task.
In the first case, the most spread algorithm used to provide unique identifiers without the help of a central element is called universally unique identifier (UUID) [31]. It was designed for the purpose of online Internet services and is now frequently exploited in programming techniques to distinguish elements. For instance, Java uses it to allot a unique version number to classes that support serialization. Several algorithms are referenced by the request for comments standard to produce UUIDs. UUIDs issued by the UUID class from the Java SDK are from class 4. It means that the underlying algorithm used here is powered by a PRNG. The actual PRNG algorithm is not explicitly mentioned, but it is stated as cryptographically secure by the documentation. More pragmatically, a 128-bit identifier issued by UUID would have 50% of chance to overlap with another one if one billion of UUIDs had been picked up every second for 100 years. Consequently, this approach is reliable enough when it comes to generate unique random identifiers.
Still, UUIDs would directly represent the identifier of the task in our case. Let us recall that the latter identifier is also at the heart of the Jump Ahead algorithm of the MRG32k3a PRNG, which allows it to assign independent random streams to each task. Unfortunately, this jump ahead algorithm only accepts 32-bit integers in input to determine the amount of streams to jump over. In order to preserve the uniqueness characteristics of UUIDs, we cannot imagine to shrink them from 128 to 32-bit without introducing a risk of collision between two UUIDs. As a result, UUIDs are not a satisfying approach to uniquely identify tasks in our case. The other option to achieve unique task identification is to request the identifiers atomically to a central element. This way to get identifiers has the drawback to create a bottleneck at the task creation, when each new task will claim its own identifier. Although this assertion is technically true, it is important to consider its impact in a more pragmatical way. To do so, let us figure out the typical number of tasks that might be created at some point in an application.
Tasks are typically created prior to any execution launched in workers. Still, we can imagine that tasks are spawned in parallel in order to fasten this initialization stage. Then, the maximum number of tasks created at a given time cannot exceed the number of worker threads leveraged by the application. We know that the number of workers created is bound to the number of logical cores available in the machine. Any greater number of worker threads would quickly make the performance of the application drop. Thus, the number of worker threads and consequently the maximum number of tasks potentially created at a given time will remain in the range of the number of logical cores. At the time of writing, this number can grow up to hundreds of logical cores in the cutting-edge high performance computing hosts. In Java, the number of logical cores available is obtained through a call to Runtime.getRuntime().availableProcessors().
That being said, we have studied the execution time of several numbers of sequential calls to the getTaskId() method of our own Runnable implementation. The number of calls were not only chosen to match the typical number of logical cores contained in nowadays systems, but also to extrapolate any potential leap ahead this figure could see in the future. Table I sums up the results of this small experiment, executed on an old Intel Core 2 Duo (Intel Corporation Santa Clara, CA USA) running at 2.8 GHz.
As results in Table I show, even a great number of calls to getTaskId() will not introduce an overhead in applications making use of TaskLocalRandom. Eventually, the potential synchronization that could appear when the first tasks are started will quickly vanish because of scheduling considerations. All the worker threads will thus scarcely request for a new task identifier at the very same time. In conclusion, the central-element approach is satisfying because it fulfills our needs without impacting the computation time of the application.

Implementation details
We have designed TaskLocalRandom for it to be used as an alternative to ThreadLocalRandom. It displays the very same interface as its counterpart. The methods contained in our class can produce two kinds of random outputs: double precision floating point values and integers. These are the two kinds of data types that are handled by the original MRG32k3a implementation described in [24]. Double precision numbers are natively produced by the algorithm, which manipulates 64-bit floating point values at its heart in order to take advantage of hardware-implemented operations on modern CPUs.
In contrast with ThreadLocalRandom, TaskLocalRandom does not inherit from java.util.Random, which contains superfluous methods directly bound to the underlying LCG of this class. Still, methods in TaskLocalRandom respect the same interface than java.util.Random so that a minimal compatibility is maintained, without harming our design.
Although TaskLocalRandom might sound similar to the implementation from [24], only the rather classical interface is mimicked. Let us recall that TaskLocalRandom is task-aware. It can then be employed safely by users in order to produce highly independent pseudorandom sequences within the tasks of their Java parallel application.
The Java implementation of the central element described in Section 4.3 is achieved through a new abstract class called RandomSafeRunnable. This class implements the Runnable interface that is traditionally used to describe the behavior of tasks and threads in Java concurrent applications. RandomSafeRunnable stores the identifier of the new task using a single instance of the class AtomicInteger, available in the java.util.concurrent.atomic package. After being initialized to 0 prior to any task creation, the constructor of RandomSafeRunnable performs a call to the thread-safe getAndIncrement method from the AtomicInteger object. The result of this call acts as the unique task identifier for the lifetime of the task represented by an instance of RandomSafeRunnable.
Please note that the way TaskLocalRandom is implemented also ensures that a new task will not keep the identifier of a formerly completed task. In such a case, tasks making use of ThreadLocalRandom would have been assigned the same identifier by the JVM. This would have led different tasks to exploit the same pseudorandom stream.
In order to concretely assign a unique pseudorandom sequence to each task, the jump ahead algorithm evoked in Section 4.2 is called with the identifier of the task as a parameter. As a result, each task now uniquely identified is assigned the stream corresponding to its identifier. Streams are labeled from the starting point of the original MRG32k3a pseudorandom stream. Please note that this is a design choice we made to assign a different pseudorandom stream to each task. There are situations where each task may require several distinct streams. In that case, TaskLocalRandom could be extended to enable the use of the intrinsic substreams of MRG32k3a, which split each stream in 2 76 element-long substreams.

3393
The implementation detailed in this section makes TaskLocalRandom the equivalent of ThreadLocalRandom in regard to its API and features. However, our proposal is more suited to parallelize scientific applications where statistically sound random sources are necessary, and it also fulfills the requirements needed by Java tasks frameworks. That being said, let us compare the performances of TheadLocalRandom and TaskLocalRandom.

RESULTS
In this part, we compare three aspects of the initial ThreadLocalRandom with our proposal TaskLocalRandom: their memory footprint, their numbers throughput, and their statistical quality. Then, TaskLocalRandom is faced with the other software tools of the literature introduced in Section 3.

Memory footprint and speed
ThreadLocalRandom wraps an LCG that uses only one integer to store its internal state, whereas MRG32k3a needs at least six integers. TaskLocalRandom also relies on an extra task identifier to provide reproducibility as required by stochastic simulations. Thus, ThreadLocalRandom is more efficient in terms of memory footprint.
Considering speed, it is hard to isolate accurately the methods involved in random number generation across several threads. That is why we based our comparison on the data produced by the VisualVM ¶ profiler to figure out which algorithm was the most efficient. These results show that TaskLocalRandom is about twice as fast as ThreadLocalRandom, requiring about 0.5 ms to pick up a random number whereas ThreadLocalRandom requires about 0.8 ms. Therefore, our Java wrapper does not impact the original fastness of the MRG32k3a algorithm. MRG32k3a is actually announced faster than the LCG used by ThreadLocalRandom in [9].

Statistical quality
We have already discussed the statistical quality of LCGs, but in our case, the LCG at the heart of ThreadLocalRandom is used in parallel thanks to the random spacing distribution technique. When parallelizing an application, data processing is spread among the available computational elements following a particular pattern: the whole range of input data will be regularly sliced to feed each computational element. This configuration is also encountered for pseudorandom numbers: each thread or task receives its own pseudorandom stream and uses it to process its part of the data. The data of the corresponding sequential process would not only be equivalent to a concatenation of all the data chunks, but also of the pseudorandom streams used. As a result, knowing the parallelization techniques used for both random numbers and input data, we could recreate the computation scenario that would have taken place in a sequential environment. This allows us to check the corresponding random sequence resulting from the concatenation of the subsequences. Although two or more pseudorandom sequences considered independently can produce bad statistical results, their combination can behave differently when faced with the same statistical tests [32].
We know that it is nearly impossible to examine every possible combination; thus, we decided to focus on the most obvious technique to process input data: assign an equally sized subset from the original data to each task. This situation is sketched in Figure 3. Please note that for the purpose of this test, we fall back to standard Java threads so that ThreadLocalRandom can compete fairly with TaskLocalRandom. TaskLocalRandom can actually handle pseudorandom streams distribution across both threads and tasks, the latter being impossible for ThreadLocalRandom. Still, this parameter does not impact the results of our experience.
To simulate this situation, we have faced the two PRNGs to TestU01. The random stream studied by the testing battery was provided by concatenating the substreams of a given number of threads. In Table II, each PRNG is tested using combined streams resulting from what would be the concatenated random sequence of 16-64 threads.   Table II shows that using MRG32k3a instead of the LCG implemented in ThreadLocalRandom is particularly relevant when considering the statistical output of both generators. Here, we see that none of the 180 configurations of ThreadLocalRandom tested can pass the TestU01 Bigcrush testing battery, whereas TaskLocalRandom does not generate any failed output. This figure backs our PRNG choice for the underlying algorithm of TaskLocalRandom. A more thorough examination of the multiple streams of MRG32k3a is to be found in [24]. The authors have tested their algorithm both theoretically via the spectral test and empirically by statistical tests.

3395
How many PRNG algorithms are embedded in the library? Does the library automatically handle pseudorandom streams distribution across threads? Does the library automatically handle pseudorandom streams distribution across tasks?
As we can see in Table III, TaskLocalRandom is the only library that automatically takes into account pseudorandom streams distribution at a task level, whereas the others would force the developer to determine a distribution scheme through the tasks of his concurrent application.

DISCUSSION
In this paper, we propose a Java implementation of L'Ecuyer's MRG32k3a that behaves correctly when used with Java tasks frameworks. However, simulation practitioners often expect to challenge their stochastic models with different random sources. In this way, providing a wider set of PRNGs is relevant for the simulation community. This complete framework would obviously display an API identical to TaskLocalRandom. In this section, we review the algorithms that we plan to include in future versions of this work.
Having already considered a sequence splitting partitioning technique with MRG32k3a, we chose to focus another highly reliable distribution technique: parameterization [3]. Whereas Sequence Splitting intends to slice an original random sequence in several independent random streams, parameterization tackles the problem differently. PRNGs employing parameterization own a parameter that can distinguish one instance of a given PRNG from one another. This unique parameter then contributes to issue highly independent random streams that can be assigned to different PEs, such as tasks.

TinyMT
TinyMT (Free and Open Source Software) is the latest offspring from the Mersenne Twister family. TinyMT is not described in any scientific article yet, but information about it can be found on its dedicated webpage [33]. This PRNG matches the requirements we have formulated for a PRNG to be integrated in TaskLocalRandom: it is stated as producing a good quality output, according to TestU01 statistical tests, and displays a long-enough period of 2 127 numbers. As explained in the introduction of this section, this PRNG champions parameterization to provide highly independent streams. It is shipped with an adapted version of the dynamic creator (DC) software tool [34] that can create more than 2 32 2 16 highly independent statuses. As always with Mersenne Twister-like PRNGs, this algorithm is based upon linear recurrences. Then, Matsumoto and Nishimura assume that 'a set of PRNGs based on linear recurrences is "mutually" independent if the characteristic polynomials are relatively prime to each other' [34].
We are now considering the implementation of this PRNG as another alternative to ThreadLo-calRandom. Part of this development work will be close to what has already been achieved with the implementation of MRG32k3a. However, this PRNG might show less flexible than MRG32k3a because its parameterized statuses need to be precomputed by the DC algorithm. DC relies on several C++ libraries and would thus be difficult to reimplement in Java in a portable way. Thus, to provide a full Java concurrent implementation,we not only need to implement the algorithm, but also to ship precomputed statuses with it. The point is to find a tradeoff between a sufficient amount of parameterized statuses and a reasonable memory footprint, so that the sole PRNG does not bloat the whole application. Each task will then receive an instance of TaskLocalTinyMT initialized by a different status. Because the data structure representing a status weights no more than a hundred of bytes, delivering lots of ready to be used parameterized statuses should be possible.
Mersenne Twister family) to target a smooth integration in Java tasks frameworks because their parameters are formed by a single key that can be set at runtime according to each task's unique identifier. Indeed, Mersenne Twister-like PRNGs might not fit some applications that cannot afford wasting any memory space to store the state and the initialization parameters of each task's PRNG.

CONCLUSION
This work has studied the recent ThreadLocalRandom proposal shipped with JDK 7 that intends to provide independent random streams for parallel Java applications. Having stressed the importance of using statistically sound PRNGs and partitioning techniques, we have asserted that Crush-resistant generators were in our opinion the only category of generators that should be trusted for scientific applications development. Considering this criterion, we have evaluated ThreadLocalRandom as having a satisfying design but a poor implementation. Furthermore, Thread-LocalRandom is intrinsically unable to deal with tasks executed within Java thread pools: it assigns the same pseudorandom stream to all the tasks handled by the same worker thread. We have detailed why this behavior was obviously unsuitable when considering scientific applications.
Additionally, this study surveys the most spread libraries targeting the same goal as ThreadLocalRandom but displaying improved quality. We strongly recommend some of them, such as stochastic simulation in Java or DistRNG, to replace ThreadLocalRandom as much as possible.
In addition, we propose in this work TaskLocalRandom as another alternative to ThreadLocal-Random. Our proposal respects the same API as ThreadLocalRandom, but it relies on MRG32k3a, a well-known Crush-resistant PRNG. TaskLocalRandom displays not only a far better statistical quality than its JDK counterpart but it is also much more suited for scientific applications, given that it issues a reproducible output by default. TaskLocalRandom is a bit greedier than Thread-LocalRandom in terms of memory consumption, but it completely outperforms its counterpart in both speed and statistical quality. According to our measures, TaskLocalRandom is about twice as fast as ThreadLocalRandom and passes all the tests of BigCrush: the most stringent testing battery from TestU01.
The major input brought by TaskLocalRandom lies in its cooperation with the RandomSafeRunnable abstract class that we also introduced in this study. This pair of classes enables a correct distribution of pseudorandom streams among tasks. It is, to our knowledge, the sole PRNG facility that can be used safely within a Java task framework such as Executors or Fork/Join.
Among the simulation community, it is a safe practice to check the results of a stochastic simulation using several PRNGs, which rely on different internal mechanisms. This is why we now plan to implement other Crush-resistant PRNG algorithms such as TinyMT, Threefry or Philox that display statistical properties equivalent to MRG32k3a. This effort would allow simulation practitioners to compare the results of their simulations when fed with different random sources. This way, simulationists could change the PRNG they use in an instant and still benefit of correct pseudorandom streams distribution across their Java tasks.