A first experiment on the STANISLAS cohort using closed frequent pattern search

Sandy Maumus 1 Amedeo Napoli 2 Sophie Visvikis 1
2 ORPAILLEUR - Knowledge representation, reasonning
INRIA Lorraine, LORIA - Laboratoire Lorrain de Recherche en Informatique et ses Applications
Abstract : Knowledge Discovery in databases (KDD) can be seen as the analysis of large sets of observational data to find unsuspected relationships and to summarize the data in novel ways that may be both understandable and useful for the analyst. The relationships and summaries derived through the KDD process are referred as models or patterns [1]. Data mining is the central step of the KDD process, where algorithms are run for extracting models or patterns of interest. Methods and Data In the present view, the analyst is an expert of the data domain who is in charge of controlling the whole KDD process, mainly by interaction with the system and by iteration of some important steps of the process, such as data selection, parameter adjustment, result interpretation and validation. In a preliminary study conducted on a test biological database in the domain of mushrooms [2], satisfying and positive results have been obtained. This has led us to start the analysis of data from a real-world database, namely the STANISLAS cohort. In both experiments, we have used the Close algorithm, for extracting closed frequent patterns and association rules [3]. Patterns are sets of items that are extracted from a formal database, i.e. a boolean array of the form individuals items, where an individual may own or not a given item. A pattern is said to be frequent if the number of individuals owning the pattern is greater than a given frequency threshold. Then approximation rules can be extracted from a pattern, with a confidence measuring the proportion of individuals verifying the rule within the formal database. The STANISLAS cohort is a longitudinal study started in 1993 which is made up of 1006 caucasian families supposed to be healthy and from homogeneous origin, recruited for medical examination at the Centre for Preventive Medicine of Vandoeuvre-Lès-Nancy [4]. These families are studied for exploring genotypes and intermediate phenotypes of cardiovascular diseases (CVD). CVD are multifactorial pathologies resulting from gene-gene and gene-environment interactions. There is an increasing number of studies led in the field of CVD. Many results are obtained enlightening new potential risk factors. In parallel, the volume of data generated is growing, due to the development of leading technologies (like multiplex technologies or microarrays) coupled with studies involving big populations. Facing this huge volume of data, new kinds of data analysis methods are required, such as symbolic data mining methods, in order to determine disease susceptibility profiles. The collected information being at our disposal can be either qualitative or quantitative, and it is of many type: - environmental data : personal past history, life habits,... - clinical data : bodymass index, height, weight, blood pressure... - biological data related to risk factors : lipids and apolipoproteins (apo) such as concentration of total cholesterol, triglycerides, HDL and LDL cholesterol, apoB, apoE..., concentration of ACE, cellular adhesion molecules, and inflammation molecules - genetic data corresponding to gene polymorphisms related to cardiovascular diseases, and dealing for instance with lipids metabolism, blood pressure, or inflammation. Results In our first experiments, we have chosen to work on a subset of data from the STANISLAS cohort that has already been studied, and has given results with classical statistical analysis previously published in [5]. Briefly, the subset concerns 772 men and 780 premenopausal women, unrelated genetically, without any treatment that could interfere with cardiovascular physiopathology. For practical reasons, we have chosen to work on concentration of LDL-cholesterol (LDL-C) in mmol/L, genetic polymorphisms of apolipoprotein (apo) E and apo B that has proven to be associated with LDL-C in the Pallaud's study, and common risk factors used in this kind of study, e.g. age, sex, alcohol consumption (g/day), smoking, body mass index (kg/m2), use of oral contraceptive. The extracted rules and patterns are in agreement with the knowledge of the analyst and with the literature. In this way, we have have obtained two types of results : 1) Already known results. For example, we have done a projection on people with genotype 4/4 for the apo E polymorphism. An extracted rule states that 19 of the 25 individuals who are apo E 4/4 have their LDL concentration that exceeds the norm established by the National Cholesterol Education Program (LDL concentration must be smaller or equal to 1.60 g/l (inferior 3.44 mmol/l)). This rule is in accordance with the knowledge of the analyst and with published results [6]. 2) New results. For instance, the interpretation of an interesting rule has led us to invest the genotype distributions in a subset of our population. This study has given to us significant statistical results, never published in the literature to our knowledge. Other results of the same kind have been found, and further investigations are currently under development. These first results of the application of symbolic data mining methods on the STANISLAS cohort are very encouraging. In the next future, we plan to conduct investigations both in the application and the theoretical fields, e.g. studying sequences of data taking time into account. The parallel validation of our results by statistical tests has given some guarantees with respect to the extracted knowledge units. These units can be considered as hypotheses, and in turn, data mining methods may provide new hypothesis to be tested using statistics. This is an original and promising way of combining data mining methods, i.e. association rule extraction, and statistics. We are currently working on this combination and we plan to obtain more interesting results and a general methodology for mining biological data. Acknowledgments This work is supported by INSERM and the Région Lorraine References [1] D. Hand, H. Mannila and P.Smyth, Principles of Data Mining, The MIT Press, Cambridge (MA), 2001 [2] S. Maumus, A. Napoli, R. Taouil and S. Visvikis, A first study of the central role of the analyst in the knowledge discovery process in biology. Poster presentation, ISMB 2002, August 3-7, Edmonton, Canada. [3] N. Pasquier, Y. Bastide, R. Taouil and L. Lakhal, Pruning Closed Itemset Lattices for Association Rules. International Journal of Information Systems, 24:25-46, 1999 [4] G. Siest, S. Visvikis, B. Herbeth, R. Gueguen, M. Vincent-Viry, C. Sass, B. Beaud, E. Lecomte, J. Steinmetz, J. Locuty and P. Chevrier, Objectives, design and recruitment of a familial and longitudinal cohort for studying gene-environment interactions in the field of cardiovascular risk: the Stanislas cohort. Clinical Chemistry and Laboratory Medecine, 36:35-42, 1998 [5] C. Pallaud, R. Gueguen, C. Sass, M. Grow, S. Cheng, G. Siest ,and S. Visvikis, Genetic influences on lipid metabolism trait variability within the STANISLAS Cohort. Journal of Lipid Research, 42:1879-1890, 2001 [6] G. Siest, T. Pillot, A. Regis-Bailly, B. Leininger-Muller, J. Steinmetz, M.M. Galteau and S. Visvikis, Apolipoprotein E: an important gene and protein to follow in laboratory medicine. Clinical Chemistry, 41:1068-86, 1995
Type de document :
Communication dans un congrès
European Conference on Computational Biology - ECCB'2003, Sep 2003, Paris, France, 2 p, 2003
Liste complète des métadonnées

Contributeur : Publications Loria <>
Soumis le : mardi 26 septembre 2006 - 09:40:24
Dernière modification le : jeudi 11 janvier 2018 - 06:19:55


  • HAL Id : inria-00099693, version 1



Sandy Maumus, Amedeo Napoli, Sophie Visvikis. A first experiment on the STANISLAS cohort using closed frequent pattern search. European Conference on Computational Biology - ECCB'2003, Sep 2003, Paris, France, 2 p, 2003. 〈inria-00099693〉



Consultations de la notice