BIOINFORMATICS 2012 Abstracts


Full Papers
Paper Nr: 17
Title:

FORESTS OF LATENT TREE MODELS FOR THE DETECTION OF GENETIC ASSOCIATIONS

Authors:

Christine Sinoquet, Raphaël Mourad and Philippe Leray

Abstract: Together with the population aging concern, increasing health care costs require understanding the causal basis for common genetic diseases. The high dimensionality and complexity of genetic data hamper the detection of genetic associations. To alleviate the core risks (missing of the causal factor, spurious discoveries), machine learning offers an appealing alternative framework to standard statistical approaches. A novel class of probabilistic graphical models has recently been proposed - the forest of latent tree models - , to obtain a trade-off between faithful modeling of data dependences and tractability. In this paper, we evaluate the soundness of this modeling approach in an association genetics context. We have performed intensive tests, in various controlled conditions, on realistic simulated data. We have also tested the model on real data. Beside guaranteeing data dimension reduction through latent variables, the model is empirically proven able to capture indirect genetic associations with the disease, both on simulated and real data. Strong associations are evidenced between the disease and the ancestor nodes of the causal genetic marker node, in the forest. In contrast, very weak associations are obtained for other nodes.
Download

Paper Nr: 20
Title:

A DETERMINISTIC MODEL OF BONE MARROW WITH HOMEOSTATIC PROPERTIES AND WITH STEADY PRODUCTION OF DIFFERENTIATED CELLS

Authors:

Manish P. Kurhekar and Umesh A. Deshpande

Abstract: There is a significant interest in studying stem cells, to learn about the biological functions during development and adulthood as well as to learn how to utilize them as new sources of specialized cells for tissue repair. Modeling of stem cells not only describes, but also predicts, how a stem cell’s environment can control its fate. The first stem cell populations discovered were Hematopoietic Stem Cells (HSCs). In this paper, we present a biologically feasible deterministic model of bone marrow that hosts HSCs. Our model demonstrates that a single HSC can populate the entire bone marrow. It almost always produces sufficient number of differentiated cells (RBCs, WBCs, etc.). It also overcomes the biological feasibility limitations of previously reported models. We have performed agent-based simulation of the model of bone marrow system proposed in this paper. We have included the details and the results of this validation using simulation in the Appendix. The simulation also demonstrates that a large fraction of stem cells do remain in the quiescent state. The program of the agent-based simulation of the proposed model is made available on a public website.
Download

Paper Nr: 25
Title:

LASER DOPPLER FLOWMETERS PROTOTYPES VALIDATION USING MONTE CARLO SIMULATIONS

Authors:

Edite Figueiras, Rita Campos, Ricardo Oliveira, Luís F. Requicha Ferreira, Frits de Mul and Anne Humeau-Heurtier

Abstract: Two new laser Doppler flowmeter prototypes are herein validated with Monte Carlo simulations. The first prototype is a multi-wavelength laser Doppler flowmeter with different spaced detection fibres that will add depth discrimination capabilities to LDF skin monitoring. The other prototype is a self-mixing based laser Doppler flowmeter for brain perfusion estimation. In order to validate these two prototypes, Monte Carlo simulations are performed. For the first prototype validation, Monte Carlo simulations in a phantom consisting of moving fluid (pumped milk) at six different depths as well as in a skin model are proposed. For this prototype, the results show that the first order moment of the photocurrent power spectrum (M1) and mean depth measured both increase with the fibre distances tested. Moreover, M1 increases with the concentration of milk, whereas the mean depth measured decreases with the milk concentration for the phantom results. Furthermore, we show that increasing the wavelength of incoming light, in the skin model, increases the mean depth probed. For the second prototype validation, Monte Carlo simulations are carried out on a rat brain model. We show that the mean measurement depth in the rat brain with our probe is 0.15 mm.
Download

Paper Nr: 42
Title:

A GENERALIZED HIDDEN MARKOV MODEL FOR PREDICTION OF CIS-REGULATORY MODULES IN EUKARYOTE GENOMES AND DESCRIPTION OF THEIR INTERNAL STRUCTURE

Authors:

Anna A. Nilulova, Alexander V. Favorov, Vsevolod Yu. Makeev and Andrey A. Mironov

Abstract: Eukaryotic regulatory regions have been studied extensively due to their importance for gene regulation in higher eukaryotes. However, the understanding of their organization is clearly incomplete. In particular, we lack accurate in silico methods for their prediction. Here we present a new HMM-based method for the prediction of regulatory regions in eukaryotic genomes using position weight matrices of the relevant transcription factors. The method reveals and then utilizes the regulatory region structure (preferred binding site arrangements) to increase the quality of the prediction, as well as to provide a new knowledge of the regulatory region organization. We show that our method can be successfully used for the identification of regulatory regions in eukaryotic genomes with a quality higher than that of other methods. We also demonstrate the ability of our algorithm to reveal structural features of the regulatory regions, which could be helpful for the deciphering of the transcriptional regulation mechanisms in higher eukaryotes.
Download

Paper Nr: 44
Title:

INTRODUCING DATA PROVENANCE AND ERROR HANDLING FOR NGS WORKFLOWS WITHIN THE MOLGENIS COMPUTATIONAL FRAMEWORK

Authors:

H. V. Byelas, M. Dijkstra and M. A. Swertz

Abstract: Running bioinformatics analyses in a distributed computational environment and monitoring their executions has become a huge challenge due to the size of data and complexity of analysis workflows. Some attempts have been made to combine computational and data management in a single solution using the MOLGENIS software generator. However, it was not clear how to explicitly specify output data for a particular research, evaluate its quality or possibly repeat the analysis depending on results. We present here a new version of a MOLGENIS computational framework for bioinformatics, which reflects lessons learnt and new requirements from end users. We have improved our initial solution in two ways. First, we propose a new data model, which describes a workflow as a graph in a relational database, where nodes are analysis operations and edges are transactions between them. Inputs and outputs of the workflow nodes are explicitly specified. Second, we have extended the executional logic to trace data, show how final results were created and how to handle errors in the distributed environment. We illustrate system applications on several analysis workflows for next generation sequencing.
Download

Paper Nr: 47
Title:

SIMULATION OF BACTERIAL GENOME EVOLUTION UNDER REPLICATIONAL MUTATIONAL PRESSURES

Authors:

Paweł Błażej, Paweł Mackiewicz and Stanisław Cebrat

Abstract: Directional mutational pressure associated with DNA replication is one of the most significant forces shaping nucleotide composition and structure of bacterial chromosomes as well as influencing the evolution of their genes. Here we introduced the model of bacterial genome evolution including two mutational pressures acting in differently replicated DNA strands (called leading and lagging). The simulations were performed on the population of protein coding genes from the Borrelia burgdorferi genome which shows a very strong compositional bias between the DNA strands. The simulated genomes were eliminated by selection because of: (i) stop translation codon occurrence in their gene sequences and (ii) the loss of their coding signal which was calculated according to the algorithm for recognition of protein coding sequences. This algorithm considers three independent homogeneous Markov chains to describe transition between nucleotides separately for each of three codon positions in a given DNA sequence. The negative selection for stop codons appeared much stronger than the one based on the coding signal and led to elimination of more genomes from the population. The genes were subjected both to the direct mutational pressure, characteristic of the strand on which they are located and to the reverse pressure, characteristic of the opposite strand. Generally, the elimination of genomes because of stop codons occurrence was the most frequent for the reverse pressure whereas the coding signal selection eliminated the genome most often for the direct pressure. The leading strand mutational pressure was more destructive for coding signal whereas the the lagging strand pressure generated more stop codons in the gene sequences.
Download

Paper Nr: 48
Title:

GENOME HALVING BY BLOCK INTERCHANGE

Authors:

Antoine Thomas, Aïda Ouangraoua and Jean-Stéphane Varré

Abstract: We address the problem of finding the minimal number of block interchanges required to transform a duplicated linear genome into a tandem duplicated linear genome. We provide a formula for the distance as well as a polynomial time algorithm for the sorting problem.
Download

Paper Nr: 50
Title:

A n2 RNA SECONDARY STRUCTURE PREDICTION ALGORITHM

Authors:

Markus E. Nebel and Anika Scheid

Abstract: Several state-of-the-art tools for predicting RNA secondary structures have worst-case time and space requirements of O(n3) and O(n2) for sequence length n, limiting their applicability for practical purposes. Accordingly, biologists are interested in getting results faster, where a moderate loss of accuracy would willingly be tolerated. For this reason, we propose a novel algorithm for structure prediction that reduces the time complexity by a linear factor to O(n2), while still being able to produce high quality results. Basically, our method relies on a probabilistic sampling approach based on an appropriate stochastic context-free grammar (SCFG): using a well-known or a newly introduced sampling strategy it generates a random set of candidate structures (from the ensemble of all feasible foldings) according to a “noisy” distribution (obtained by heuristically approximating the inside-outside values) for a given sequence, such that finally a corresponding prediction can be efficiently derived. Sampling can easily be parallelized. Furthermore, it can be done in-place, i.e. only the best (most probable) candidate structure generated so far needs to be stored and finally communicated. Together, this allows to efficiently handle increased sample sizes necessary to achieve competitive prediction accuracy in connection with the noisy distribution.
Download

Paper Nr: 51
Title:

A VACCINATION CONTROL LAW BASED ON FEEDBACK LINEARIZATION TECHNIQUES FOR SEIR EPIDEMIC MODELS

Authors:

S. Alonso-Quesada, M. De la Sen and A. Ibeas

Abstract: This paper presents a vaccination strategy for fighting against the propagation of epidemic diseases. The disease propagation is described by a SEIR (susceptible plus infected plus infectious plus removed by immunity populations) epidemic model. The model takes into account the total population amounts as a refrain for the illness transmission since its increase makes more difficult contacts among susceptible and infected. The vaccination strategy is based on a continuous-time nonlinear control law synthesized via an exact feedback input-output linearization approach. The control objective is to asymptotically eradicate the infection. Moreover, the positivity and stability properties of the controlled system are investigated.
Download

Paper Nr: 53
Title:

TRADING RUNNING TIME FOR MEMORY IN PHYLOGENETIC LIKELIHOOD COMPUTATIONS

Authors:

Fernando Izquierdo-Carrasco, Julien Gagneur and Alexandros Stamatakis

Abstract: The revolution in wet-lab sequencing techniques that has given rise to a plethora of whole-genome or wholetranscriptome sequencing projects, often targeting 50 up to 1000 species, poses new challenges for efficiently computing the phylogenetic likelihood function both for phylogenetic inference and statistical post-analysis purposes. The phylogenetic likelihood function as deployed in maximum likelihood and Bayesian inference programs consumes the vast majority of computational resources, that is, memory and CPU time. Here, we introduce and implement a novel, general, and versatile concept to trade additional computations for memory consumption in the likelihood function which exhibits a surprisingly small impact on overall execution times. When trading 50% of the required RAM for additional computations, the average execution time increase because of additional computations amounts to only 15%. We demonstrate that, for a phylogeny with n species only log(n)+2 memory space is required for computing the likelihood. This is a promising result given the exponential growth of molecular datasets.
Download

Paper Nr: 60
Title:

DUAL-ENERGY X-RAY ABSORPTIOMETRY AS AN INDICATOR FOR FRAGILITY FRACTURE RISKS OF THE FEMORAL NECK

Authors:

Alexander Tsouknidas, Nikolaos Michailidis, Kleovoulos Anagnostidis and Antonios Lontos

Abstract: Osteoporosis is a clinically silent bone pathology usually manifesting in the form of fragility bone fractures. Due to the high morbidity of the disease, the association of noninvasive imaging techniques to the implicated risk factors, could serve as a valuable indicator for surgeons. In the present investigations, the evaluation of 30 patients femurs' bone mineral density was performed in vivo by Dual-energy X-ray absorptiometry (DXA), while the strength characteristics of the examined specimens were determined ex-vivo using uniaxial compression experiments. The obtained stress strain curves, reflect the mechanical properties of the femur while facilitating their correlation to the obtained DXA measurements. FEM simulations revealed critical stress values within the femoral neck, indicating which DXA values represent abnormal high fragility fracture risks and thus should be considered for surgical intervention.
Download

Paper Nr: 72
Title:

AN EFFICIENT PARALLEL GPU EVALUATION OF SMALL ANGLE X-RAY SCATTERING PROFILES

Authors:

Lubomir D. Antonov, Christian Andreetta and Thomas Hamelryck

Abstract: The inference of protein structure from experimental data is of crucial interest in science, medicine and biotechnology. Unfortunately, high-resolution experimental methods can not yet provide a detailed analysis of the ensemble of conformations adopted under physiological conditions. Low resolution techniques are often better suited for this task. Small angle X-ray scattering (SAXS) plays a major role in investigating important biological questions regarding the structure of multidomain proteins connected by flexible linkers or the aggregation processes that underlie several major diseases in humans. In silico simulations can bridge the gap between low resolution information and models derived from highresolution techniques. For that, it is necessary to be able to calculate the low resolution information from a given detailed model using a so-called forward model. These calculations need to be performed many times during a conformational search, and therefore need to be computationally efficient. We present an efficient implementation of the forward model for SAXS experiments with full hardware utilization of General Purpose Graphical Processor Units (GPGPUs). The proposed algorithm is orders of magnitude faster than an efficient CPU implementation, and implements a caching procedure ready to be employed in the partial SAXS evaluations required by in silico simulations.
Download

Short Papers
Paper Nr: 15
Title:

MATCHING TWO-DIMENSIONAL GEL ELECTROPHORESIS’ SPOTS

Authors:

António dos Anjos, Bjarne Kjær Ersbøll, Faroq AL-Tam and Hamid Reza Shahbazkia

Abstract: This paper describes an approach for matching Two-Dimensional Electrophoresis (2-DE) gels’ spots, involving the use of image registration. The number of false positive matches produced by the proposed approach is small, when compared to academic and commercial state-of-the-art approaches. This article contributes to solving one of the greatest bottlenecks in the 2-DE analysis pipeline.
Download

Paper Nr: 18
Title:

APPLICATION OF GENOME LINGUISTIC APPROACHES FOR IDENTIFICATION OF GENOMIC ISLAND IN BACTERIAL GENOMES AND TRACKING DOWN THEIR ORIGINS - Genome Linguistics to Visualize Horizontal Gene Exchange

Authors:

Oliver Bezuidt, Kingdom Mncube and Oleg N. Reva

Abstract: With more sequences of complete bacterial genomes getting public availability the approaches of genome comparison by frequencies of oligonucleotides (k-mers) known also as the genome linguistics are becoming popular and practical to resolve problems which can not be tackled by the traditional sequence comparison tools. In this work we present several innovative approaches based on k-mer statistics for detection of inserts of genomic islands and tracing down the ontological links and origins of mobile genetic elements. 637 bacterial genomes were analyzed by SeqWord Sniffer program that has detected 2,622 putative genomic islands. These genomic islands were clustered by DNA compositional similarity. A stratigraphic analysis was introduced that allows distinguishing between new and old genomic inserts. A method of reconstruction of donor-recipient relations between micro-organisms was proposed. The strain E. coli TY-2,482 isolated from the latest deadly outbreak of a haemorrhagic infection in Europe in 2011 was used for the case study. It was shown that this strain appeared on an intersection of two independent fluxes of horizontal gene exchange, one of which is a conventional for Enterobacteria stream of vectors generated in marine gamma-Proteobacteria; and the second is a new channel of antibiotic resistance genomic islands originated from environmental beta-Proteobacteria.
Download

Paper Nr: 19
Title:

PREDICTION OF SIGNIFICANT CRUCIFORM STRUCTURES FROM SEQUENCE IN TOPOLOGICALLY CONSTRAINED DNA - A Probabilistic Modelling Approach

Authors:

Matej Lexa, Lucie Navrátilová, Karel Nejedlý and Marie Brázdová

Abstract: Sequence-dependent secondary DNA structures, such as cruciform or triplex DNA, are implicated in regulation of gene transcription and other important biological processes at the molecular level. Sequences capable of forming these structures can readily be identified in entire genomes by appropriate searching techniques. However, not every DNA segment containing the proper sequence has equal probability of forming an alternative structure. Calculating the free energy of the potential structures provides an estimate of their stability in vivo, but there are other structural factors, both local and non-local, not taken into account by such simplistic approach. In is paper we present the procedure we currently use to identify potential cruciform structures in DNA sequences. The procedure relies on identification of palindromes (or inverted repeats) and their evaluation by a nucleic acid folding program (UNAFold). We further extended the procedure by adding a modelling step to filter the predicted cruciforms. The model takes into account superhelical density of the analyzed segments of DNA and calculates the probability of cruciforms forming at several locations of the analyzed DNA, based on the sequences in the stem and loop areas of the structures and competition among them.
Download

Paper Nr: 28
Title:

A MEMETIC ALGORITHM FOR PROTEIN STRUCTURE PREDICTION BASED ON THE 2D TRIANGULAR LATTICE MODEL

Authors:

Jyh-Jong Tsay and Shih-Chieh Su

Abstract: Proteins play fundamental and crucial roles in nearly all biological processes, such as, enzymatic catalysis, signaling transduction, embryonic development, and DNA and RNA synthesis. The main function of the protein is decided by its structure. Therefore, many researchers are interested in the prediction of protein structure. The HP model is one of the commonly used models. But most research on the HP lattice model focuses on how to solve the problem of optimization and ignores the purpose of protein structure prediction, namely the prediction of structure similarity between proteins. The 2D triangular lattice model used in this study can predicate protein structure more closely to its topology compared to the 2D square model commonly used in the past. Besides proposing an effective memetic algorithm (MA), this study also investigated structure similarity of natural proteins.
Download

Paper Nr: 41
Title:

PREDICTING NEW HUMAN DRUG TARGETS BY USING FEATURE SELECTION TECHNIQUES

Authors:

Eduardo Campos dos Santos, Braulio Roberto Gonçalves Marinho Couto, Marcos A. dos Santos and Julio Cesar Dias Lopes

Abstract: Drug target identification and validation are critical steps in the drug discovery pipeline. Hence, predicting potential “druggable targets”, or targets that can be modulated by some drug, is very relevant to drug discovery. Approaches using structural bioinformatics to predict “druggable domains” have been proposed, but they have only been applied to proteins that have solved structures or that have a reliable model predicted by homology. We show that available protein annotation terms may be used to explore semantic-based measures to provide target similarity searching and develop a tool for potential drug target prediction. We analysed 1,541 human protein drug targets and 29,580 human proteins not validated as drug targets but which share some InterPro annotations with a known drug target. We developed a semantic-based similarity measure by using singular value decomposition over InterPro terms associated with drug targets, performed statistical analyses and built logistic regression models. We present a probabilistic model summarised in a closed mathematical formula that allows human protein drug targets to be predicted with a sensitivity of 89% and a specificity of 67%.
Download

Paper Nr: 43
Title:

PYCOEVOL - A Python Workflow to Study Protein-protein Coevolution

Authors:

Fábio Madeira and Ludwig Krippahl

Abstract: Protein coevolution has emerged as an important research topic. Several methods and scoring systems were developed to quantify coevolution, though the quality of the results usually depends on the completeness of the biological data. To simplify the computation of coevolution indicators from the data, we have implemented a fully integrated and automated workflow which enables efficient analysis of protein coevolution, using the Python scripting language. Pycoevol automates access to remote or local databases and third-party applications, including also data processing functions. For a given protein complex under study, Pycoevol retrieves and processes all the information needed to undergo the analysis, namely homologous sequence search, multiple sequence alignment computation and coevolution analysis, using a Mutual Information indicator. In addition, friendly output results are created, namely histograms and heatmaps of inter-protein mutual information scores, as well as lists of significant coevolving residue pairs. An illustrative example is presented. Pycoevol is platform independent, and is available under the general public license from http://code.google.com/p/pycoevol.
Download

Paper Nr: 46
Title:

INFERENCE OF GENE REGULATORY NETWORKS BY EXTENDED KALMAN FILTERING USING GENE EXPRESSION TIME SERIES DATA

Authors:

Ramouna Fouladi, Emad Fatemizadeh and S. Shahriar Arab

Abstract: In this paper, the Extended Kalman filtering (EKF) approach has been used to infer gene regulatory networks using time-series gene expression data. Gene expression values are considered stochastic processes and the gene regulatory network, a dynamical nonlinear stochastic model. Using these values and a modified Kalman filtering approach, the model’s parameters and consequently the interactions amongst genes are predicted. In this paper, each gene-gene interaction is modeled using a linear term, a nonlinear one, and a constant term. The linear and nonlinear term coefficients are included in the state vector together with the gene expressions’ true values. Through the extended Kalman filtering process, these coefficients are updated in such a way that the predicted gene expressions follow the ones observed. Finally, connections between each two genes are inferred based on these coefficients.
Download

Paper Nr: 55
Title:

INTEGRATING PATHWAY ENRICHMENT AND GENE NETWORK ANALYSIS PROVIDES ACCURATE DISEASE CLASSIFICATION

Authors:

Maysson Al-Haj Ibrahim, Sabah Jassim, Michael A. Cawthorne and Kenneth Langlands

Abstract: At present, a range of clinical indicators are used to gain insight into the course a newly-presented individual’s disease may take, and so inform treatment regimes. However, such indicators are not absolutely predictive and patients with apparently low-risk disease may follow a more aggressive course. Advances in molecular medicine offer the hope of improved disease stratification and personalised treatment. For example, the identification of “genetic signatures” characteristic of disease subtypes is facilitated by high-throughput transcriptional profiling techniques (microarrays) in which gene expression levels for thousands of genes are measured across a range of biopsy samples. However, the selection of a compact gene set conferring the most clinically-relevant information from complex and high-dimensional microarray datasets is a challenging task. We reduced this complexity using a Pathway Enrichment and Gene Network Analysis (PEGNA) method, which integrates gene expression data with prior biological knowledge to select a group of strongly-correlated genes providing accurate discrimination of complex disease subtypes. In our method, pathway enrichment analysis was applied to a microarray dataset in order to identify the most impacted biological processes. Secondly, we used gene network analysis to find a group of strongly-correlated genes from which subsets of genes were selected to use for disease classification with a support vector machine classifier. In this way, we were able to more accurately classify disease states, using smaller numbers of genes, compared to other methods across a range of biological datasets.
Download

Paper Nr: 57
Title:

PREDICTION OF PROTEIN INTERACTIONS ON HIV-1–HUMAN PPI DATA USING A NOVEL CLOSURE-BASED INTEGRATED APPROACH

Authors:

Kartick C. Mondal, Nicolas Pasquier, Anirban Mukhopadhyay, Célia da Costa Pereira, Ujjwal Maulik and Andrea G. B. Tettamanzi

Abstract: Discovering Protein-Protein Interactions (PPI) is a new interesting challenge in computational biology. Identifying interactions among proteins was shown to be useful for finding new drugs and preventing several kinds of diseases. The identification of interactions between HIV-1 proteins and Human proteins is a particular PPI problem whose study might lead to the discovery of drugs and important interactions responsible for AIDS. We present the FIST algorithm for extracting hierarchical bi-clusters and minimal covers of association rules in one process. This algorithm is based on the frequent closed itemsets framework to efficiently generate a hierarchy of conceptual clusters and non-redundant sets of association rules with supporting object lists. Experiments conducted on a HIV-1 and Human proteins interaction dataset show that the approach efficiently identifies interactions previously predicted in the literature and can be used to predict new interactions based on previous biological knowledge.
Download

Paper Nr: 58
Title:

STUDY OF PROTEIN STRUCTURE ALIGNMENT PROBLEM IN PARAMETERIZED COMPUTATION

Authors:

Cody Ashby, Kun Wang, Carole L. Cramer and Xiuzhen Huang

Abstract: Motivated by the practical application of protein structure-structure alignment, we have studied the problem of maximum common subgraph within the framework of parameterized complexity. We investigated the lower bound for the exact algorithms of the problem. We proved it is unlikely that there is an algorithm of time p(n,m) ∗ ko(m) for the problem, where p is a polynomial function, k is a parameter of map width, and m and n are the numbers of vertices of the two graphs respectively. In consideration of the upper bound of p(n,m)∗km based on the brute-force approach, our lower bound result is asymptotically tight. Although the algorithm with the running time p(n,m) ∗ km could not be significantly improved from our lower bound result, it is still possible to develop efficient algorithms for the practical application of the protein structure-structure alignment. We developed an efficient algorithm integrating the color coding method and parameterized computation for identifying the maximum common subgraph of two protein structure graphs. We have applied the algorithm to protein structure-structure alignment and conducted experimental testing of more than 600 protein pairs. Our parameterized approach shows improvement in structure alignment efficiency and will be very useful for structure comparisons of proteins with large sizes.
Download

Paper Nr: 59
Title:

PERFORMANCE STUDY OF PARALLEL HYBRID MULTIPLE PATTERN MATCHING ALGORITHMS FOR BIOLOGICAL SEQUENCES

Authors:

Charalampos S. Kouzinopoulos, Panagiotis D. Michailidis and Konstantinos G. Margaritis

Abstract: Multiple pattern matching is widely used in computational biology to locate any number of nucleotides in genome databases. Processing data of this size often requires more computing power than a sequential computer can provide. A viable and cost-effective solution that can offer the power required by computationally intensive applications at low cost is to share computational tasks among the processing nodes of a high performance hybrid distributed and shared memory platform that consists of cluster workstations and multi-core processors. This paper presents experimental results and a theoretical performance model of the hybrid implementations of the Commentz-Walter, Wu-Manber, Set Backward Oracle Matching and the Salmela-Tarhio-Kytöjoki family of multiple pattern matching algorithms when executed in parallel on biological sequence databases.
Download

Paper Nr: 61
Title:

BENEFITS OF GENETIC ALGORITHM FEATURE-BASED RESAMPLING FOR PROTEIN STRUCTURE PREDICTION

Authors:

Trent Higgs, Bela Stantic, Tamjidul Hoque and Abdul Sattar

Abstract: Protein structure prediction (PSP) is an important task as the three-dimensional structure of a protein dictates what function it performs. PSP can be modelled on computers by searching for the global free energy minimum based on Afinsen’s ‘Thermodynamic Hypothesis’. To explore this free energy landscape Monte Carlo (MC) based search algorithms have been heavily utilised in the literature. However, evolutionary search approaches, like Genetic Algorithms (GA), have shown a lot of potential in low-resolution models to produce more accurate predictions. In this paper we have evaluated a GA feature-based resampling approach, which uses a heavy-atom based model, by selecting 17 random CASP 8 sequences and evaluating it against two different MC approaches. Our results indicate that our GA improves both its root mean square deviation (RMSD) and template modelling score (TM-Score). From our analysis we can conclude that by combining feature-based resampling with Genetic Algorithms we can create structures with more native-like features due to the use of crossover and mutation operators, which is supported by the low RMSD values we obtained.
Download

Paper Nr: 66
Title:

MASon: MILLION ALIGNMENTS IN SECONDS - A Platform Independent Pairwise Sequence Alignment Library for Next Generation Sequencing Data

Authors:

Philipp Rescheneder, Arndt von Haeseler and Fritz J. Sedlazeck

Abstract: The advent of Next Generation Sequencing (NGS) technologies and the increase in read length and number of reads per run poses a computational challenge to bioinformatics. The demand for sensible, inexpensive, and fast methods to align reads to a reference genome is constantly increasing. Due to the high sensitivity the Smith-Waterman (SW) algorithm is best suited for that. However, its high demand for computational resources makes it unpractical. Here we present an optimal SWimplementation for NGS data and demonstrate the advantages of using common and inexpensive high performance architectures to improve the computing time of NGS applications. We implemented a C++ library (MASon) that exploits graphic cards (CUDA, OpenCL) and CPU vector instructions (SSE, OpenCL) to efficiently handle millions of short local pairwise sequence alignments (36bp - 1,000bp). All libraries can be easily integrated into existing and upcoming NGS applications and allow programmers to optimally utilize modern hardware, ranging from desktop computers to high-end cluster.
Download

Paper Nr: 68
Title:

EFFICIENT PATH KERNELS FOR REACTION FUNCTION PREDICTION

Authors:

Markus Heinonen, Niko Välimäki, Veli Mäkinen and Juho Rousu

Abstract: Kernels for structured data are rapidly becoming an essential part of the machine learning toolbox. Graph kernels provide similarity measures for complex relational objects, such as molecules and enzymes. Graph kernels based on walks are popular due their fast computation but their predictive performance is often not satisfactory, while kernels based on subgraphs suffer from high computational cost and are limited to small substructures. Kernels based on paths offer a promising middle ground between these two extremes. However, the computation of path kernels has so far been assumed computationally too challenging. In this paper we introduce an effective method for computing path based kernels; we employ a Burrows-Wheeler transform based compressed path index for fast and space-efficient enumeration of paths. Unlike many kernel algorithms the index representation retains fast access to individual features. In our experiments with chemical reaction graphs, path based kernels surpass state-of-the-art graph kernels in prediction accuracy.
Download

Paper Nr: 70
Title:

cswHMM: A NOVEL CONTEXT SWITCHING HIDDEN MARKOV MODEL FOR BIOLOGICAL SEQUENCE ANALYSIS

Authors:

Vojtěch Bystrý and Matej Lexa

Abstract: In this work we created a sequence model that goes beyond simple linear patterns to model a specific type of higher-order relationship possible in biological sequences. Particularly, we seek models that can account for partially overlaid and interleaved patterns in biological sequences. Our proposed context-switching model (cswHMM) is designed as a variable-order hidden Markov model (HMM) with a specific structure that allows switching control between two or more sub-models. An important feature of our model is the ability of its sub-models to store their last active state, so when each sub-model resumes control it can continue uninterrupted. This is a fundamental variation on the closely related jumping HMMs. A combination of as few as two simple linear HMMs can describe sequences with complicated mixed dependencies. Tests of this approach suggest that a combination of HMMs for protein sequence analysis, such as pattern mining based HMMs or profile HMMs, with the context-switching approach can improve the descriptive ability and performance of the models.
Download

Paper Nr: 74
Title:

WRAPPER AND FILTER METRICS FOR PSO-BASED CLASS BALANCE APPLIED TO PROTEIN SUBCELLULAR LOCALIZATION

Authors:

S. Garcia López, J. A. Jaramillo-Garzón, J. C. Higuita-Vásquez and C. G. Castellanos-Domínguez

Abstract: Recent advances in proteomic research have generated an unprecedented amount of stored data. Given the size of current databases, manual annotation has become an almost intractable process, paving the way to the use of computational methods. In this context, considering that a single protein can belong to several functional classes, a multi-label classification problem is generated. The most common way to cope with these problems is by training a number of classifiers equal to the number of classes that will allow taking independent decisions on the membership of proteins. Nevertheless, this methodology leads to a high degree of imbalance between classes, magnifying the disparity already present in their size. Current balancing techniques are based on the optimization of criteria leading to a better subset that represent the data. Moreover, most of the sample selection criteria are based on the Wrapper type metrics. However, Wrapper metrics are computationally quite expensive. This work presents a comparative analysis between the Wrapper and Filter metrics as the sample selection criteria in balance techniques. In order to accomplish this task, a subsampling technique based on the Particle Swarm Optimization method to obtain the optimal balance subset is used. The results show that filter metrics notably improved the computational cost obtaining a similar performance when compared with the Wrapper type metrics.
Download

Paper Nr: 77
Title:

STATISTICAL MECHANICS OF PROTEINS IN THE RANDOM COIL STATE

Authors:

Cigdem Sevim Bayrak and Burak Erman

Abstract: Denatured proteins are mostly partially folded and compact proteins. A statistical analysis on thermodynamic properties is presented to describe and characterize denatured proteins. Conformational free energy, energy, entropy and heat capacity expressions are derived using the Rotational Isomeric States model of polymer theory. The state space and the probabilities of each state are comprised from a coil database. Properties for the denatured state are obtained for a sample set of proteins taken from the Protein Data Bank. Thermodynamic expressions of denatured state are derived.
Download

Paper Nr: 83
Title:

IMPROVEMENTS TO A MULTIPLE PROTEIN SEQUENCE ALIGNMENT TOOL

Authors:

André Atanasio M. Almeida and Zanoni Dias

Abstract: Sequence alignment is the most common task in the bioinformatics field. It is a required method for the execution of a wide range of procedures such as the search for homologue sequences in a database or protein structure prediction. The main goal of the experiments in this work was to improve on the accuracy of the multiple sequence alignments. Our experiments concentrated on the MUMMALS multiple aligner, experimenting with three distinct modifications to the algorithm. Our first experiment was to modify the substring length of the k-mer count method. The second experiment we attempted was to substitute the commonly used Dayhoff(6) with alternative compressed alphabets. The third experiment was to modify the distance matrix computation and the guide tree construction. Each of the experiments showed a gain in result accuracy.
Download

Paper Nr: 85
Title:

PREDICTION OF CHIMERIC PROTEIN FOLD

Authors:

Ruben Acuña, Zoé Lacroix, Fayez Hadji, Jacques Chomilier and Nikolaos Papandreou

Abstract: We propose two computational methods for predicting if a protein produced by fusion of genes will conserve the structures of the fused proteins. We use two complementary paths for prediction. The former is a simulation from the sequence while the latter exploits its expected structure. Early stages of protein folding are simulated from their amino acid sequence by capturing the most interacting residues (MIR). Individual domain structures (or models) are superposed onto the predicted complex structure (or model). When no structure exists, a model is calculated using a set of ab initio and fold recognition tools. These results are used to predict the validity of the chimeric protein. We test the two methods against a dataset of 10 proteins.
Download

Paper Nr: 87
Title:

SEMI-SUPERVISED LEARNING OF ALTERNATIVELY SPLICED EXONS USING EXPECTATION MAXIMIZATION TYPE APPROACHES

Authors:

Ana Stanescu and Doina Caragea

Abstract: Successful advances in DNA sequencing technologies have made it possible to obtain tremendous amounts of data fast and inexpensively. As a consequence, the afferent genome annotation has become the bottleneck in our understanding of genes and their functions. Traditionally, data from biological domains have been analyzed using supervised learning techniques. However, given the large amounts of unlabeled genomics data available, together with small amounts of labeled data, the use of semi-supervised learning algorithms is desirable. Our purpose is to study the applicability of semi-supervised learning frameworks to DNA prediction problems, with focus on alternative splicing, a natural biological process that contributes to protein diversity. More specifically, we address the problem of predicting alternatively spliced exons. To utilize the unlabeled data, we train classifiers via the Expectation Maximization method and variants of this method. The experiments conducted show an increase in the quality of the prediction models when unlabeled data is used in the training phase, as compared to supervised prediction models which do not make use of the unlabeled data.
Download

Paper Nr: 97
Title:

HOW TO DEAL WITH SMALL OPEN READING FRAMES?

Authors:

Małgorzata Wańczyk, Paweł Błażej, Paweł Mackiewicz and Stanisław Cebrat

Abstract: Current ’classical’ algorithms recognizing protein coding sequences do not work effectively with sequences of small length. To deal with this problem we have proposed some improvements of the existing gene finders without any assumed arbitrary threshold. Introduced parameters describe position of tested sequences in the ranking of all small Open Reading Frames and short protein coding genes found in the analyzed genome. The sequences can be ranked according to the coding potential calculated by ’standard’ gene prediction algorithms. As an example, we used two algorithms for gene recognition and tested the set of selected small ORFs which were selected from prokaryotic genomes using sequence similarity methods. The applied approach enabled to identify promising sequence that can code for small proteins.
Download

Paper Nr: 98
Title:

ANALYSIS OF CORRELATION STRUCTURES IN RENAL CELL CARCINOMA PATIENT DATA

Authors:

Italo Zoppis, Massimiliano Borsani, Erica Gianazza, Clizia Chinello, Francesco Rocco, Giancarlo Albo, André M. Deelder, Yuri E. M. van der Burgt, Fulvio Magni, Giancarlo Mauri and Marco Antoniotti

Abstract: Mass Spectrometry (MS)-based technologies represent a promising area of research in clinical analysis. They are primarily concerned with measuring the relative intensity (abundance) of many protein/peptide molecules associated with their mass-to-charge ratios over a particular range of molecular masses. These measurements (generally referred as proteomic signals or spectra) constitute a huge amount of information which requires adequate tools to be investigated and interpreted. Following the methodology for testing hypotheses, we investigate the proteomic signals of the most common type of Renal Cell Carcinoma, the Clear Cell variant (ccRCC). Specifically, the aim of our investigation is to detect changes of the signal correlations from control to case group (ccRCC or non–ccRCC). To this end, we sample and represent each population group through a graph providing, as it will be defined below, the observed signal correlation structure. This way, graphs establish abstract frames of reference in our analysis giving the opportunity to test hypotheses over their properties. In other terms, changes are detected by testing graph property modifications from group to group. We show the results by reporting the mass-to-charge values which identify bounded regions where changes have been detected. The main interest in handling these regions is to perceive which signal ranges are associated with some specific factors of interest (e.g., studying differentially expressed peaks between case and control groups) and thus, to suggest potential biomarkers for future analysis or for clinical monitoring. Data were collected, from patients and healthy volunteers at the Ospedale Maggiore Policlinico Foundation (Milano, Italy).
Download

Paper Nr: 100
Title:

NUMERICAL ANALYSIS OF IMAGE BASED HIGH THROUGHPUT ZEBRAFISH INFECTION SCREENS - Matching Meaning with Data

Authors:

Alexander E. Nezhinsky, Esther Stoop, Astrid van der Sar and Fons J. Verbeek

Abstract: Tuberculosis is an ancient disease; however, the molecular mechanism of tuberculosis pathology is not completely elucidated yet. In our research we aim to contribute to the understanding of the genes/proteins that are involved in the infection. As a model for the infection study we use the bacterium Mycobacterium marinum, which is closely related to Mycobacterium tuberculosis, the causative agent of tuberculosis in humans. M. marinum causes tuberculosis like disease and is applied to the zebrafish larva as a model (host) organism. We are using a novel pattern recognition framework which allows for in depth analysis of the spread of infection within the zebrafish organism. The amount of infection has been analyzed. However, in depth analysis of the spatial distribution was not yet accomplished. Therefore, as a proof of concept we investigate the presence of specific spatial and quantitive infection patterns.
Download

Paper Nr: 106
Title:

BRIDGING THE GAP BETWEEN DESIGN AND REALITY - A Dual Evolutionary Strategy for the Design of Synthetic Genetic Circuits

Authors:

J. S. Hallinan, S. Park and A. Wipat

Abstract: Computational design is essential to the field of synthetic biology, particularly as its practitioners become more ambitious, and system designs become larger and more complex. However, computational models derived from abstract designs are unlikely to behave in the same way as organisms engineered from those same designs. We propose an automated, iterative strategy involving evolution both in silico and in vivo, with feedback between strands as necessary, combined with automated reasoning. This system can help bridge the gap between the behaviour of computational models and that of engineered organisms in as rapid and cost-effective a manner as possible.
Download

Paper Nr: 14
Title:

A TASTE OF YEAST MOBILOMICS

Authors:

Giulia Menconi, Giovanni Battaglia, Roberto Grossi, Nadia Pisanti and Roberto Marangoni

Abstract: Mobilomics calls for detecting all the mobile elements in a genome so as to understand their dynamic behavior. We devise and apply a method that extends a pairwise strain comparison tool for mobile genetic elements (MGE) inference, and perform experiments on a whole dataset of 39 complete genomes of as many yeast (S.cerevisiae) strains. We locate a priori all the MGEs regions that are annotated in the reference sequence at hand, and map all the putative MGEs in all the other (non-annotated) strains. Interestingly, evolutionary relation among the strains based on the presence/absence of candidate MGEs, turns out to be quite close to that inferred by classic phylogenetic methods based on SNPs analysis.
Download

Paper Nr: 23
Title:

3D VISUALIZATION OF HAPLOTYPE RISK MAPS

Authors:

Sergio Torres-Sánchez, Manuel García-Sánchez, Germán Arroyo, Nuria Medina-Medina, Rosana Montes-Soldado, Francisco Soler-Martínez and María M. Abad-Grau

Abstract: Traditionally, genetic risk maps consider genotypic differences in a small number of single markers. However, a more recent approach considers a very large set of input variables some of them with very little effect and haplotypes with several consecutive markers instead of genotypes. While a bidimensional map can only show the first of the two approaches, a 3D map together with a powerful visualization tool of virtual reality may combine both approaches, so that the molecular biologist can get immerse and explore every genetic risk factor represented in the map. Maps enriched with information from different annotation sources may fully benefit of this 3D immersive feature.
Download

Paper Nr: 40
Title:

APPROACH TO ENABLE AN AUTOMATIC PRE-PROCESSING OF qRT-PCR - Analysis

Authors:

Marco Franke, Klaus-Dieter Thoben and Rainer Söller

Abstract: A fully automated analysis method for real-time PCR samples can be achieved through an automated noise removal. This results in a guaranteed robustness against the factors that repeatedly impact the analysis of real-time PCR data and usually lead to a manual analysis by skilled experts. The presented approach attains a degree of robustness which allows an automated analysis for real-time PCR. This new automated pre-processing method improves the real-time PCR data for the analysis in such a way that accepted qualitative analysis methods can be used in an automated way. Furthermore, the article shows an evaluation of an implemented demonstrator for an automated analysis which combines the developed pre-processing method with the second derivation method for the qualitative analysis.
Download

Paper Nr: 49
Title:

IDENTIFICATION OF HIV-1 DYNAMICS - Estimating the Noise Model, Constant and Time-varying Parameters of Long-term Clinical Data

Authors:

András Hartmann, Susana Vinga and Joao M. Lemos

Abstract: The importance of a system theory based approach in understanding immunological diseases, in particular the HIV-1 infection, is being increasingly recognized. This is because the dynamics of virus infection may be effectively represented by relatively compact state space models in the form of nonlinear ordinary differential equations. This work focuses on the identification of constant and time-varying parameters in long-term dynamic HIV-1 data.We introduce a novel strategy for parameter identification. Constant parameters were estimated using Particle Swarm Optimization (PSO), and time-varying parameters were captured with Extended Kalman Filter (EKF). As EKF relies on the noise strongly, the measurement noise was also inferred. The results are convincing on clinical data: similar noise parameters were detected for two different subjects, a good overall fit was reached to the data, and EKF was found efficient in estimating the time-varying parameters, overcoming drawbacks and limitations of existing methods.
Download

Paper Nr: 52
Title:

ABSTRACTIONS FOR SCALING eSCIENCE APPLICATIONS TO DISTRIBUTED COMPUTING ENVIRONMENTS - A StratUm Integration Case Study in Molecular Systems Biology

Authors:

Per-Olov Östberg, Andreas Hellander, Brian Drawert, Erik Elmroth, Sverker Holmgren and Linda Petzold

Abstract: Management of eScience computations and resulting data in distributed computing environments is complicated and often introduces considerable overhead. In this work we address a lack of integration tools that provide the abstraction levels, performance, and usability required to facilitate migration of eScience applications to distributed computing environments. In particular, we explore an approach to raising abstraction levels based on separation of computation design from computation management and present StratUm, a computation enactment tool for distributed computing environments. Results are illustrated in a case study of integration of a software from the systems biology community with a grid computation management system.
Download

Paper Nr: 54
Title:

DIFFERENTIAL EVOLUTION TO MULTI-OBJECTIVE PROTEIN STRUCTURE PREDICTION

Authors:

Sandra M. Venske, Richard A. Gonçalves and Myriam R. Delgado

Abstract: Protein structure prediction (PSP) is one of the most challenging problems nowadays and an important Bioinformatics research topic. In this paper we propose an optimization method based on differential evolution for PSP problem. We model PSP as an optimization problem in order to minimize the potential energy using ab initio approach. This problem is handled here as multi-objective optimization, and it is solved by the evolutionary method of Differential Evolution (DE). An innovative way of choosing the best individual of the population is proposed in this work: the minimum distance to the empirical ideal point. The idea is to guide the population individuals to areas of the Pareto front that correspond to a good compromise of the bonded and non-bonded energies. The proposed approach is validated on some peptides with promising results.
Download

Paper Nr: 62
Title:

BAYESIAN NETWORK ANALYSIS OF RELATIONSHIPS BETWEEN NUCLEOSOME DYNAMICS AND TRANSCRIPTIONAL REGULATORY FACTORS

Authors:

Bich Hai Ho, Ngoc Tu Le and Tu Bao Ho

Abstract: Intergenic regions are unstable, owing to trans-regulatory factors that regulate chromatin structure. Nucleosome organization at promoter has been shown to exhibit distinct patterns corresponding to the level of gene expression. Post-translational modifications (PTMs) of histone proteins and transcriptional regulators, including chromatin remodeling complexes (CRCs), general transcription factors (GTFs), and RNA polymerase II (PolII), are presumably related to the establishment of such nucleosome dynamics. However, their concrete relationships, especially in gene regulation, remain elusive. We, therefore, sought to understand the functional linkages among these factors and nucleosome dynamics by deriving a Bayesian network (BN)-based model representing their interactions. Based on the recovered network, learnt from 8 PTMs and 15 transcriptional regulators at 4034 S.cerevisiae promoters, we speculate that nucleosome organization at promoter is intentionally volatile in various regulatory pathways. Notably, interactions of CRCs/GTFs and H3 histone methylation were inferred to co-function with nucleosome dynamics in gene repression and pre-initiation complex (PIC) formation. Our results affirm the hypothesis that extrinsic factors take part in regulating nucleosome dynamics. More thorough investigation can be made by adding more factors and using our proposed method.
Download

Paper Nr: 67
Title:

UNIVERSAL k-NN (UNN) CLASSIFICATION OF CELL IMAGES USING HISTOGRAMS OF DoG COEFFICIENTS

Authors:

Paolo Piro, Wafa Bel Haj Ali, Lydie Crescence, Omelkheir Ferhat, Jacques Darcourt, Thierry Pourcher and Michel Barlaud

Abstract: Cellular imaging is an emerging technology for studying many biological phenomena. Cellular image analysis generally requires to identify and classify cells according to their morphological aspect, staining intensity, subcellular localization and other parameters. Hence, this task may be very time-consuming and poorly reproducible when carried out by experimenters. In order to overcome such limitations, we propose an automatic segmentation and classification software tool that was tested on cellular images acquired for the analysis of NIS phosphorylation and the identification of NIS-interacting proteins. On the algorithmic side, our method is based on a novel texture-based descriptor that is highly discriminative in representing the main visual features at the subcellular level. These descriptors are then used in a supervised learning framework where the most relevant prototypical samples are used to predict the class of unlabeled cells, using a new methodology we have recently proposed, called UNN, which grounds on the boosting framework. In order to evaluate the automatic classification performances, we tested our algorithm on a significantly large database of cellular images annotated by an expert of our group. Results are very promising, providing precision of about 84% on average, thus suggesting our method as a valuable decision-support tool in such cellular imaging applications.
Download

Paper Nr: 69
Title:

COMPUTATION OF THE NORMALIZED COMPRESSION DISTANCE OF DNA SEQUENCES USING A MIXTURE OF FINITE-CONTEXT MODELS

Authors:

Diogo Pratas, Armando J. Pinho and Sara P. Garcia

Abstract: A compression-based similarity measure assesses the similarity between two objects using the number of bits needed to describe one of them when a description of the other is available. For being effective, these measures have to rely on “normal” compression algorithms, roughly meaning that they have to be able to build an internal model of the data being compressed. Often, we find that good “normal” compression methods are slow and those that are fast do not provide acceptable results. In this paper, we propose a method for measuring the similarity of DNA sequences that balances these two goals. The method relies on a mixture of finite-context models and is compared with other methods, including XM, the state-of-the-art DNA compression technique. Moreover, we present a comprehensive study of the inter-chromosomal similarity of the human genome.
Download

Paper Nr: 73
Title:

COMPUTATIONAL PREDICTIONS FOR THE NUCLEATION MASS AND LAG TIMES INVOLVED IN Aβ42 PEPTIDE AGGREGATION

Authors:

Preetam Ghosh, Bhaswati Datta and Vijayaraghavan Rangachari

Abstract: The aggregates of amyloid-β (Aβ) peptide are the primary neurotoxic species in the brains of Alzheimer’s patients. We study the molecular-level dynamics of this process employing chemical kinetic simulations by dissecting the aggregation pathway into pre-nucleation, post-nucleation and protofibril elongation stages. Here, we discuss how our earlier identified rate constants for protofibril elongation were incorporated into a simplified simulation of the complete aggregation process to understand the lag-times in the sigmoidal fibril growth curves of fibril formation. We also present some initial findings on the rate constants and possible hypotheses on the nucleation mass involved in the pre-nucleation stage.
Download

Paper Nr: 78
Title:

MULTI-LEVEL DYNAMIC MODELING IN BIOLOGICAL SYSTEMS - Application of Hybrid Petri Nets to Network Simulation

Authors:

Rafael S. Costa, Daniel Machado, A. R. Neves and Susana Vinga

Abstract: The recent progress in the high-throughput experimental technologies allows the reconstruction of many biological networks and to evaluate changes in proteins, genes and metabolites levels in different conditions. On the other hand, computational models, when complemented with regulatory information, can be used to predict the phenotype of an organism under different genetic and environmental conditions. These computational methods can be used for example to identify molecular targets capable of inactivating a bacterium and to understand its virulence factors. This work proposes a hybrid metabolic-regulatory Petri net approach that is based on the combination of approximate enzyme-kinetic rate laws and Petri nets. A prototypic network model is used as a test-case to illustrate the application of these concepts in Systems Biology.
Download

Paper Nr: 79
Title:

AUTOMATED REGULON CONTENT PREDICTION AND ESTIMATION OF PWM QUALITY

Authors:

Elena Stavrovskaya, Andrey Mironov, Dmitry Rodionov, Inna Dubchak and Pavel Novichkov

Abstract: Identification of genes regulated by the same transcription factor (TF) is a major problem in analysis of regulation. The key step in detection of a group of co-regulated genes (regulon) is prediction of TF binding sites (TFBS). This is what positional weight matrix (PWM) is for. This matrix is applied to upstream region of a gene, and high-scoring sites are considered as putative TFBSs. Choice of threshold for the scoring function is a separate complicated problem. Usually, the threshold is chosen manually. Some methods for automated threshold detection exist, but they are based on selection of threshold for different functions. In this paper, we present an approach for regulon prediction based on a probabilistic method of threshold detection. The optimal probability computed by this method can be used to estimate the quality of the PWM itself. It can be useful when the matrix is a result of regulatory motif prediction program.
Download

Paper Nr: 80
Title:

FUZZY CONCEPT LATTICE-BASED APPROACH FOR REACTIVE MOTIFS DISCOVERY

Authors:

Thanapat Kangkachit and Kitsana Waiyamai

Abstract: Reactive motifs are short conserved regions discovered from binding and catalytic sites of enzymes sequences. Thus, reactive motifs provide more biological meaning than statistic-based motifs because they are directly extracted from where the chemical reaction mechanism occurs. Main problem of discovering reactive motifs is that only 4.94% enzymes sequences contain sites information. To overcome this problem, we present fuzzy concept lattice-based (FCL-based) method for discovering more general reactive motifs by incorporating biochemical knowledge. Fuzzy concept lattices are used to represent both binary and multi-value biochemical knowledge. The fuzzy concept lattice Join operator is applied to determine complete substitution groups that obtains more general reactive motifs. Experiments are conducted among different methods of determining complete substitution groups: FCL-based, concecpt lattice-based (CL-based) and similarity-based method. Experimental results show that FCL-based method significantly outperforms other methods in term of coverage value and F-measure with SVM learning algorithm. Therefore, fuzzy concept lattice provides more efficient computational support for complete substitution groups operation than that of other existing methods.
Download

Paper Nr: 82
Title:

A NOVEL ANALYSIS FLOW FOR FUSED TRANSCRIPTS DISCOVERY FROM PAIRED-END RNA-SEQ DATA

Authors:

F. Abate, G. Paciello, A. Acquaviva, E. Ficarra, A. Ferrarini, M. Delledonne and E. Macii

Abstract: Chimeric phenomena have been recently recognized to play a significant role in the investigation and understanding of the fundamental mechanisms behind highly diffused pathologies such as tumors. In this paper we present a new methodology for the detection of fusion transcript from Next Generation Sequencing (NGS) data. The methodology exploits short paired-end reads coming from RNA-Seq experiments to determine a list of fused genes and to exactly identify the fusion boundaries, so that the exact chimeric sequence can be analysed. Both known and unknown transcripts are considered, enabling the detection of fusions involving unannotated genes. An automated toolflow that reports a set of candidate fused genes and the associated junctions has been implemented and applied to a publicly available data set of melanoma.
Download

Paper Nr: 84
Title:

A NOVEL GAUSSIAN FITTING APPROACH FOR 2D GEL ELECTROPHORESIS SATURATED PROTEIN SPOTS

Authors:

Massimo Natale, Alfonso Caiazzo, Enrico M. Bucci and Elisa Ficarra

Abstract: Analysis of 2D-GE images is a hot topic in bioinformatics research, since currently available commercial and academic software has proven to be not really effective and not completely automatic, often requiring manual revision of spots detection and refinement of computer generated matches. In this work, we present an effective technique for the detection and the reconstruction of over-saturated protein spots. Firstly, it reveals overexposed areas where spots may be truncated, and plateau regions caused by smeared and overlapped spots. As next, the correct distribution of pixel values in the overexposed areas and plateau regions is recovered by a two-dimensional fitting based on a generalized Gaussian distribution approximating the spots volume. Pixel correction according to the generalized Gaussian curve in saturated and smeared spots allows more accurate quantifications, providing more reliable image analysis results. As validation, we process highly exposed 2D-GE image, containing saturate spots, with respect to the corre-sponding non-saturated image, confirming that the method can effectively fix the saturated spots and enable correct spots quantification.
Download

Paper Nr: 86
Title:

CHARACTERIZATION OF COARSE GRAIN MOLECULAR DYNAMIC SIMULATION PERFORMANCE ON GRAPHIC PROCESSING UNIT ARCHITECTURES

Authors:

Ardita Shkurti, Andrea Acquaviva, Elisa Ficarra, Mario Orsi and Enrico Macii

Abstract: Coarse grain (CG) molecular models have been proposed to simulate complex systems with lower computational overhead and longer timescales with respect to atomistic level timescales. However, their acceleration on parallel architectures such as Graphic Processing Units (GPU) presents original challenges that must be carefully evaluated. The objective of this work is to characterize the impact of CG model features on parallel simulation performance. To achieve this target, we implemented a GPU-accelerated version of a CG biomembrane simulator called BRAHMS, to which we apply specific optimizations for CG models, such as dedicated data structures to handle different bead type interactions. Moreover, we explore different GPU architectures to characterize the behavior of the optimized CG model.
Download

Paper Nr: 91
Title:

PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS

Authors:

Kirill Lassounski, Sahudy Montenegro González, Annabell del Real Tamariz and Gabriel Lima de Oliveira

Abstract: The NCBI (National Center for Biotechnology Information) provides information about genes, proteins, scientific literature, molecular structures among other resources related to bio-medicine. The NCBI has a database called PubMed that stores about 21 millions of scientific articles. There are many researches in the information retrieval field that need to automatically obtain useful data from PubMed to perform evaluation and testing. This work describes a Java library to construct datasets, so that numerous scientific researches could evaluate their results easily and quickly. Users must set input and output parameters such as article’s attributes (title, abstract, keywords, etc.) to conform the dataset constructed as a serializable file. The creation of PubMed Dataset came from the fact that the authors needed to build their own datasets to evaluate their system results. In this article it is also presented the BioSearch Refinement system as a case study. The system utilizes the library to construct the datasets used to evaluate its algorithm for automatic extraction of keyphrases. We also discuss the benefits obtained from the usage of the PubMed Dataset.
Download

Paper Nr: 92
Title:

AUTOMATED DETECTION OF INTERPHASE AND METAPHASE NUCLEI IN THE FISH IMAGES

Authors:

Jan Schier, Bohumil Kovár and Eduard Kocárek

Abstract: The fluorescence in-situ hybridization (FISH) belongs to the most common cytogenetic methods and is widely applied in routine clinical genetic diagnostics. We are paying attention to FISH analysis of chromosomal aneuploidies – the deviations from chromosomal number. Such analysis is based on evaluation of up to several hundreds of microscopic images. Computer support for this process includes using methods of image processing and data mining. In this paper, we focus on the image processing part in more detail: first, the properties of FISH images are reviewed, then, the processing flow is outlined. Our aim is to find the interphase and metaphase nuclei and the hybridization signals contained in the image. A simple method using the raw and central moments of detected objects as measures to distinguish between the two types of nuclei is proposed.
Download

Paper Nr: 93
Title:

APPEARANCE BASED DISEASE RECOGNITION OF HUMAN BRAINS

Authors:

Gopi Chand Nutakki, Leyla Zhuhadar and Robert Wyatt

Abstract: The way a human brain functions is a great wonder. Numerous diseases evolve in different sections of the brain causing various functions of human body to halt. Manual detection of brain diseases is becoming a bottleneck under the circumstance of high throughput and the complexity of brain images. Automatic recognition based on the appearances of the brain cross-sectional images is becoming a more desirable scheme. This problem, however, is very challenging due to severe variations of illumination. In this research, we propose an appearance based recognition method using orientation histograms. Furthermore, we look at the possibility of applying Principal Component Analysis to reduce the dimension of the low-level features, aiming to accelerate the speed of recognition. With the experiments on the Harvard Whole Brain Atlas images (The whole brain atlas), we show the promise of the proposed method. In our study we have observed a high classification accuracy rate when using the Orientation Histogram.

Paper Nr: 101
Title:

LANDSCAPING THE FRAMEWORK OF BIO RESEARCH PROJECT - Generation of the 3D Atlas for Drug Target Discovery

Authors:

Byung-Cheol Kim and Sunghoon Kim

Abstract: As the size of pharmacological research organizations is getting bigger, it is ever more critical to grasp the entire R&D scene as a vivid image, because a lack of such an image makes it hard to understand a project as a whole and thus to make high-level decisions. We provide one prototype of such a kind so that all the researchers of a project can intuitively share the current situation, recognize the bottlenecks, and collaborate with one another to resolve the problems. Our approach is to exploit the faculty of human vision, especially spatial perception and memory, by defining the axes of information space to be commensurable and interpolatable, making shapes of the process and data on those axes, and providing referential cues in the space.
Download

Paper Nr: 102
Title:

HAPLOTYPE-BASED CLASSIFIERS TO PREDICT INDIVIDUAL SUSCEPTIBILITY TO COMPLEX DISEASES - An Example for Multiple Sclerosis

Authors:

María M. Abad-Grau, Nuria Medina-Medina, Andrés Masegosa and Serafín Moral

Abstract: The enormous amount of genetic data that is currently being produced with the explosion of genome-wide association studies is yielding an important effort in the construction of genetic-based predictive models for individual susceptibility to complex diseases. However, a constant pattern of low accuracy is observed in most of them. We hypothesize that a main cause of their low accuracy is the strong reduction of genetic information considered by the classifiers, and propose a three-fold solution that considers haplotype instead of genotype individual data, whole-genome markers instead of a more stringent selection and several-marker risk variants instead of only one or two. We have compared the performance of our approach with current approaches to predict individual genetic risk to multiple sclerosis, and have found that our method yielded significantly more accurate classifiers.
Download