Équipe-projet Symbiose, IRISA/CNRS, Campus de Beaulieu, Rennes, France

Curso Experimental de Ciências Moleculares da Universidade de São Paulo, Brazil

Instituto de Matemática e Estatística da Universidade de São Paulo, Brazil

Dipartimento di Informatica, Università di Pisa, Italy

Équipe BAOBAB, Laboratoire de Biométrie et Biologie Evolutive (UMR 5558); CNRS; Univ. Lyon 1, Villeurbanne Cedex, France and Équipe-Projet BAMBOO, INRIA Rhône-Alpes, France

King's College, London, UK

Abstract

Background

Identifying local similarity between two or more sequences, or identifying repeats occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding such fragments while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice.

Results

The filter

Conclusion

To the best of our knowledge,

Background

Repeats in genomes come under many forms, such as satellites that are approximate repeats of a pattern of up to a few hundred base pairs appearing in tandem (consecutively) along a genome, segmental duplications that are defined as the duplications of a DNA segment longer than 1 kb, and transposable elements that are sequences of DNA that can move to different positions within a genome in a process known as transposition, or retrotransposition if the element was first copied and the copy then moved. The last two are repeats dispersed along a genome. Most such repeats appear in intergenic regions and were for long believed to be "junk" DNA, that is DNA that has no specific function although the proportion of repeated segments in a genome can be huge. Transposable elements alone cover up to, for example, 45% of the human and 80% of the maize genomes. This view of repeats as "junk" is changing though.

It is believed that transposable elements for instance may have been co-opted by the vertebrate immune system as a means of producing antibody diversity. Transposable elements are also thought to participate in gene regulation. This role had been suggested in the early 1950s by the discoverer of transposable elements herself, Barbara McClintock (she called such elements "mobile"), but she gave up publishing data supporting this idea in view of the strong opposition she was meeting from the academic world. The idea however stubbornly resisted denial or indifference and was resurrected much later. The paper of Lowe

The relation between satellites and recombination, and therefore between satellites and certain types of rearrangements, seems also clear. Less clear is the relation, direct or indirect, that satellites may have with gene regulation although it is increasingly more suspected that such exists, for instance mediated by the chromatin

The quantity of DNA in repeated sequences, the frequency of the repeat (that is, the number of times a given sequence is present per genome), and its conservation, show great variability across species. Frequencies from 100 to 1,000,000 have been observed, and the quantities of DNA involved range from 15 to 80 percent of a whole genome. Families of repeated sequences exhibit a degree of similarity among their members varying from perfect matching to matching of only two-thirds of the nucleotides. All these characteristics, plus the fact that in order to identify such repeats, it is necessary to work with whole genomes, that is with very long "texts", makes the identification of repeated elements a very hard computational problem.

In this paper, we focus on the problem of finding long multiple repeats that may appear dispersed along one whole genome or chromosome, or are common to different genomes/chromosomes. More precisely, since we are working with very long texts, we focus on the problem of filtering one or more sequences prior to a full identification of the multiple repeats that it may contain. Informally put, the idea is to eliminate from the input sequence(s) as many regions as possible that are sure not to contain any repeats of the type and characteristics specified. In some cases, the filter may be efficient enough that it eliminates all regions except those precisely corresponding to the repeats.

In the last few years, there has been an increasing number of papers on the topic of filtering sequences prior to further processing them. The motivations are varied, and include pattern matching

This trend has been motivated by the fact that the problem of aligning sequences has scaled up considerably with the increasing number of genomes, notably of eukaryotes, that are being entirely sequenced and annotated. We say that a filter is

To the best of our knowledge, filters for multiple repeats that take a multiple alignment condition into consideration have been addressed only in

Since we do not know any other work that is a filter for multiple repeats, in particular with the same type of outcomes, we do not consider other methods to compare directly with

The weakest of the filtering conditions we use corresponds to the filter used by

We tested

Our method may also be used to find anchors for global multiple aligners. We thus expect that our filter could serve as a preprocessing step to a local multiple alignment tool. To this purpose,

The rest of the paper is organised as follows. In the next section, we introduce formal definitions and the filtering conditions used in

Methods

Preliminary definitions

A _{1}, _{2},...,_{m }and in this case the term word is applied to a contiguous segment of one of the sequences _{1}, _{2},...,_{m}. The sequence

We define a

**Definition 1 **((**-repeat**)

Figure

A (

**A ( L, d, 2)-repeat and a parallelogram**. An example a (

Searching for multiple repeats means inferring all (

**Definition 2 **

**Definition 3 **

Figure

Let us consider a word

Moreover, in order to make the filtering condition more stringent, we also require that the set above is such that for any pair of

In Figure

**Theorem 1 **

• there is no

•

• the sequence obtained from

•

Let now

• |

for all

We now prove the following lemma.

**Lemma 1 **

runs of

By the way

Observe that the above proof follows a reasoning somewhat similar to the one in

For any word _{k}, for _{k }have edit distance no more than _{k}, the computation of the edit distance would take as much as _{k }a set of

**Definition 4 (parallelogram) **

In Figure

A few observations can be made. The first is that

Given a word _{k }that is candidate to be one of the _{k }= _{k }to within the parallelogram.

Consider now the word _{k}, which in turn is contained in the word

Detection of (

**Detection of ( L, d, r)-repeats and two overlapping parallelograms**. Two parallelograms that overlap. The dark grey parallelogram in the figure detects

Notice that, since _{k }and

We say that two parallelograms Parall(

In Figure

In general, if Parall(_{k }= _{k }=

Since _{k }=

We say that a parallelogram is

_{i}, _{i }∈ {_{1},...,_{r}}.

It is worth noticing that

We are now going to see two more stringent filtering conditions leading to the additional conditions actually applied by

First, we require that the set of

_{i}, _{i }∈ {_{1},...,_{r}}.

Second, we can further require that the set of

_{i}, _{i }∈ {_{1},...,_{r}}.

Necessary condition applied by

Given a sequence

_{i}, _{i }∈ {_{1},...,_{r}}.

Description of the algorithm

We now give an overview of the algorithm applied by

For any possible ^{q }occurrences lists is ^{q }pointers, one for each possible

We move a sliding window

Thus, in order to quickly verify which parallelograms are fine, we associate a

Observe that each

In this way, the parallelograms Parall(

Enlarging parallelograms from

**Enlarging parallelograms from d + 1 to d + b diagonals**. Four parallelograms with

Back to the pseudocode shown in Algorithm 1, for each value of

For a given sliding window

Consider two words

In practice, we define a new alphabet ^{q}. Given a sequence

**Definition 5 (Parallelogram q-hits Chaining Problem) **

Hence, in order to check at line 1 of Algorithm 1 if a good parallelogram is excellent, we solve the Parallelogram

Moreover, we have designed an optimisation that uses some simple incremental information from the test for the sliding window

In order to check for non-overlapping repeats only, given a set of good/excellent parallelograms, both at lines 1 and 1, we look for a subset of non-overlapping parallelograms with maximal cardinality. This can easily be done by applying the following greedy strategy to the sequence of parallelograms ordered by increasing starting diagonals. Let Parall(

Looking for repeats across multiple sequences with

In this section, we describe what is done in order to look for repeats across multiple sequences, modifying _{1}, _{2},...,_{m}. In the set of sequences _{1}, s_{2},...,_{m}, an (_{1}, _{2},...,_{m}).

Simple modifications of the algorithms _{1}, _{2},...,_{m}. While sliding the window on the word _{i}, we look for fine/good/excellent parallelograms in all other sequences as shown in Figure _{i}), then we keep the word _{i}, we test parallelograms from sequence _{j }for _{i }against itself. All counter updates are done as in _{j}, we skip the remaining of _{j }and go to the next sequence. Finally, if already

Application to multiple sequences

**Application to multiple sequences**. Application of _{i}, parallelograms are tested on all other sequences. In this example, we assumed we found three fine/good/excellent parallelograms among four sequences.

Complexity analysis

In the

We present a complexity analysis for

In order to have better parameters for the complexity analysis, besides the length

Concerning space usage, as described in Section "Description of the algorithm", the main data structure, the ^{q }pointers/integers. Its construction is done by applying a simple counting sort on the sequence of ^{q}), that is _{|Σ|}

Concerning time, a critical parameter is the number

Let us now estimate the values for ^{n}), ^{2}, but on average, ^{2}|Σ|^{-q}. As concerns ^{-q}

Therefore, the expected number of fine (good) parallelograms is

and

The worst case time complexity is then

and the average complexity is

Notice that the second part in the sum decreases as

Finally, the complexity of

Results and discussion

We now report a battery of experimental tests that were applied to

We start by giving a few definitions of the values that we used to evaluate the results obtained. The quality of the filtered output is measured by the ratio between the total length of the non-filtered sequence and its original length. We call this the

Time and selectiveness on randomly generated sequences

We first present some tests performed on short randomly generated sequences. Each dataset is composed of five sequences of length 300 kb each, generated using a Bernoulli model (each nucleotide occurring with frequency

Tests on random generated sequences

**Tests on random generated sequences**. Application of the three versions of

Extensive tests with Neisseria meningitidis strain MC58

In order to compare the different variants of

The best strategy for solving the Parallelogram

In order to solve the Parallelogram ^{-q }of a ^{-q}

As concerns the PQCP strategy, it is very difficult to predict how many computations of the Parallelogram

In practice, these expectations were confirmed in the 72 tests we made with different sets of parameters on MC58. In all cases except one, the running time for the PHS strategy did not get worse in comparison to the simple PDP. In fact, the overall observed running time improvement from PDP to PHS was 1.57. In all tests, PQCP performed faster than PHS and the overall observed running time improvement from PHS to PQCP was 1.88. Hence, in all tests PQCP performed faster than the simple PDP and the overall observed running time improvement from PDP to PQCP was 3.22. Since all three strategies provide the same selectiveness, but for some cases PDP was 18 times slower than PQCP, we discarded strategies PDP and PHS from the subsequent systematic comparisons when we have to solve the Parallelogram

The

In the Section "Methods", we saw three possible filtering conditions, depending on what kind of non-overlapping parallelograms we would like to find: fine, good, or excellent. All three filters ensure that all (

As concerns the parameter sets we used for the three algorithms when applied to MC58, we selected all combinations such that:

and such that the restriction

for

SI

SD

SI

SD

SI

SD

overall

198

11.19

1.685

1.032

1.307

1.752

2.309

1.811

99

18.45

2.302

1.043

1.508

2.062

3.428

2.166

99

3.93

1.067

1.021

1.107

1.441

1.191

1.467

105

14.41

2.274

1.043

1.506

1.533

3.377

1.236

93

7.56

1.019

1.019

1.082

1.998

1.104

2.039

159

5.90

1.820

1.033

1.347

1.321

2.546

1.236

39

32.77

1.135

1.027

1.146

3.506

1.347

3.614

183

8.99

1.720

1.031

1.312

1.415

2.363

1.457

15

38.07

1.246

1.038

1.244

5.854

1.653

6.067

69

6.42

1.958

1.044

1.456

1.744

2.829

1.835

69

12.35

1.611

1.032

1.172

2.018

2.113

2.077

60

15.35

1.454

1.019

1.292

1.454

1.938

1.479

66

8.44

1.745

1.033

1.323

1.729

2.45

1.791

66

11.11

1.677

1.032

1.302

1.756

2.280

1.817

66

14.03

1.632

1.030

1.297

1.770

2.196

1.825

66

8.79

2.874

1.049

1.689

1.205

4.451

1.236

78

18.42

2.631

1.047

1.626

1.710

4.038

1.800

90

6.99

2.446

1.045

1.554

1.138

3.670

1.181

63

8.78

2.962

1.050

1.722

1.188

4.614

1.237

Systematic comparison of

Variants

We start with the comparison between

Variants

We now compare

On one hand, ^{3 }> 1.75. Anyway, any slowdown above 4 means that we should also consider decreasing

Variants

In order to complete these comparisons based on MC58, we proceed with the comparison between

Extra tests on Human Chromosome 22

Unfortunately, thresholds such as those present in expressions like

selectiveness

running time

excel./

SU

SI

SD

SI

SD

SI

6.92%

501.28

14

65

1.63%

375.59

1.335

4.253

3.198

1.056

2.396

4.490

1.54%

1201.00+

6.71%

598.08

13

79

0.88%

451.20

1.326

7.637

1.795

1.062

1.354

8.114

0.83%

810.02

6.93%

761.29

12

93

0.51%

596.23

1.277

13.653

1.289

1.075

1.009

14.684

0.47%

768.50

7.24%

1067.52

11

107

0.27%

862.29

1.238

26.321

1.099

1.136

0.887

33.684

0.24%

947.25

7.88%

1647.85

10

121

0.14%

1417.72

1.162

57.805

1.027

1.093

0.883

71.543

0.12%

1455.64

8.47%

3047.66

9

135

0.07%

2769.05

1.101

120.000

1.002

1.124

0.910

148.254

0.06%

2773.61

mean

2.83%

1225.10

1.240

38.278

1.568

1.091

1.240

42.550

Some tests for Human Chromosome 22 with parameters (L, d, r) = (260,13,280) and

Selectiveness was even better, with an average improvement of 38.3%. In these cases, we can also observe that the selectiveness improvement from

Notice also that

Influence of

**Influence of q-gram size q over selectiveness and running time**. Influence of

In order to show the behaviour of

Influence of number of repeats

**Influence of number of repeats r over selectiveness and running time**. Influence of number of repeats

Looking for multiple repeats across different species

In the tests described from now on, we look for multiple repeats across different species. We apply for this

Like with MC58, we chose the same set of parameters, up to the fact that now we fix

selectiveness (%)

running time(s)

Excel

Excel

SU

SI

SD

SI

SD

SI

8

13

13.84

5.10

1.36

123

116

131

1.055

2.71

1.129

3.76

1.070

10.21

7

24

12.85

1.91

0.05

371

370

385

1.003

6.71

1.041

41.90

1.037

281.28

10

6

35

14.28

0.89

0.01

1286

1292

1326

0.995

16.05

1.026

81.23

1.031

1304.19

5

46

21.92

0.65

0.00

5080

5138

5183

0.989

33.95

1.009

235.94

1.020

8011.07

4

57

57.96

1.02

0.00

13441

13362

13564

1.006

56.77

1.015

391.15

1.009

22208.66

7

10

50.85

24.51

13.50

405

382

468

1.059

2.07

1.223

1.81

1.155

3.76

12

6

23

28.09

3.99

0.13

1274

1262

1338

1.009

7.04

1.060

30.37

1.051

214.04

5

36

36.84

2.30

0.04

4972

4952

5055

1.004

16.02

1.021

53.37

1.017

855.33

4

49

85.10

3.53

0.02

13834

13612

13676

1.016

24.08

1.004

162.60

0.988

3916.57

6

11

99.63

96.79

85.71

1609

1405

2159

1.145

1.02

1.536

1.12

1.342

1.16

14

5

26

75.19

12.46

0.35

5017

4794

5080

1.047

6.03

1.060

35.70

1.013

215.46

4

41

99.68

25.08

0.08

14290

13969

14112

1.023

3.97

1.010

299.49

0.988

1190.06

mean

1.029

14.70

1.094

111.54

1.060

3184.32

Measures for

Influence of q-gram size

**Influence of q-gram size q over selectiveness and running time**. Influence of q-gram size

Influence of maximal error

**Influence of maximal error d over selectiveness and running time**. Influence of maximal error

Figure

We thus focus our attention on large maximal error rates

Applying the filtered sequences to a local multiple aligner

Finally, we discuss the application of ^{k-1 }^{k }using dynamic programming. For this reason, existing multiple aligners provide only a suboptimal solution. The algorithms will still provide a suboptimal solution even when a filter is applied upstream. This is important to observe for what will follow. It means that although

To our purposes, a local as opposed to a multiple aligner was also a preferable choice to illustrate the use of ^{k},

We first applied

On the same unfiltered CFTR dataset, we then applied

Improvements on

filter

time(s)

length

time(s)

scrbits

total time (s)

9.26

264368

4732

244.794

4741

11

7

7.27

109244

2364

303.747

2371

**9.76**

**10256**

**228**

**283.022**

**238**

456.55

1556839

36357

287.102

36814

6

12

422.53

221127

5705

356.874

6128

**439.48**

**7289**

**135**

**469.664**

**575**

1439.12

4159686

83055

262.977

84494

5

14

1387.27

691442

14545

262.977

15932

**1499.74**

**19393**

**395**

**406.321**

**1895**

1640.49

5437974

107908

287.295

109548

5

15

1446.02

3267656

71303

256.076

72749

**1814.78**

**375805**

**7242**

**268.878**

**9057**

Improvements on

In Table

• For

• For

• The last column shows the sum of the time taken by the filter and that taken by the alignment on the filtered data.

It turns out that the best performances are obtained by

For instance, applying

In two cases ((

As shown in Table

Conclusion

To the best of our knowledge,

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MFS, NP and PP initiated the work, while all authors contributed to the conceptual and algorithmic choices. GS developed the prototype. Tests were mainly designed and performed by AP. All authors participated in the analysis of the results, and in editing the manuscript that was first drafted by PP.

Appendix

Appendix 1 – Algorithm: overview of

**Require: **sequence

**Ensure: **set of positions of

1:

2: Create

3: Initialise with 0 all counters associated with the parallelograms

4: Initialise counter with respect to

5: **for **every sliding window [**do**

6: **for **every occurrence **do**

7: Update the counters whose parallelograms the

8: **for **the updated counters that become **do**

9: Insert the parallelogram into the set of good parallelograms

10: **end for**

11: **end for**

12: **for **every occurrence **do**

13: Unset the counters whose parallelograms the

14: **for **the updated counters that becomes **do**

15: Remove the parallelogram from the set of good parallelograms

16: **end for**

17: **end for**

18: **if **number of good non-overlapping parallelograms ≥

19: **for **all good parallelograms **do**

20: Test whether

21: **end for**

22: **if **number of excellent non-overlapping parallelogram ≥

23: Conserve positions [

24: **end if**

25: **end if**

26: **end for**