Here we present CaosDB, a Research Data Management System (RDMS) designed to ensure seamless integration of inhomogeneous data sources and repositories of legacy data. Its primary purpose is the management of data from biomedical sciences, both from simulations and experiments during the complete research data lifecycle. An RDMS for this domain faces particular challenges: Research data arise in huge amounts, from a wide variety of sources, and traverse a highly branched path of further processing. To be accepted by its users, an RDMS must be built around workflows of the scientists and practices and thus support changes in workflow and data structure. Nevertheless it should encourage and support the development and observation of standards and furthermore facilitate the automation of data acquisition and processing with specialized software. The storage data model of an RDMS must reflect these complexities with appropriate semantics and ontologies while offering simple methods for finding, retrieving, and understanding relevant data. We show how CaosDB responds to these challenges and give an overview of the CaosDB Server, its data model and its easy-to-learn CaosDB Query Language. We briefly discuss the status of the implementation, how we currently use CaosDB, and how we plan to use and extend it.
Determining whether two given questions are semantically similar is a fairly challenging task given the different structures and forms that the questions can take. In this paper, we use Gated Recurrent Units(GRU) in combination with other highly used machine learning algorithms like Random Forest, Adaboost and SVM for the similarity prediction task on a dataset released by Quora, consisting of about 400k labeled question pairs. We got the best result by using the Siamese adaptation of a Bidirectional GRU with a Random Forest classifier, which landed us among the top 24% in the competition Quora Question Pairs hosted on Kaggle.
Value aggregation is a general framework for solving imitation learning problems. Based on the idea of data aggregation, it generates a policy sequence by iteratively interleaving policy optimization and evaluation in an online learning setting. While the existence of a good policy in the policy sequence can be guaranteed non-asymptotically, little is known about the convergence of the sequence or the performance of the last policy. In this paper, we debunk the common belief that value aggregation always produces a convergent policy sequence with improving performance. Moreover, we identify a critical stability condition for convergence and provide a tight non-asymptotic bound on the performance of the last policy. These new theoretical insights let us stabilize problems with regularization, which removes the inconvenient process of identifying the best policy in the policy sequence in stochastic problems.
The central aim in this paper is to address variable selection questions in nonlinear and nonparametric regression. Motivated within the context of statistical genetics, where nonlinear interactions are of particular interest, we introduce a novel and interpretable way to summarize the relative importance of predictor variables. Methodologically, we develop the ‘RelATive cEntrality’ (RATE) measure to prioritize candidate predictors that are not just marginally important, but whose associations also stem from significant covarying relationships with other variables in the data. We focus on illustrating RATE through Bayesian Gaussian process regression; although, the methodological innovations apply to other and more general methods. It is known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for outcomes generated by complex architectures. With detailed simulations and a botanical QTL mapping study, we show that applying RATE enables an explanation for this improved performance.
Limitations of CAP theorem imply that if availability is desired in the presence of network partitions, one must sacrifice sequential consistency, a consistency model that is more natural for system design. We focus on the problem of what a designer should do if she has an algorithm that works correctly with sequential consistency but is faced with an underlying key-value store that provides a weaker (e.g., eventual or causal) consistency. We propose a detect-rollback based approach: The designer identifies a correctness predicate, say P , and continue to run the protocol, as our system monitors P . If P is violated (because the underlying key-value store provides a weaker consistency), the system rolls back and resumes the computation at a state where P holds. We evaluate this approach in the Voldemort key-value store. Our experiments with deployment of Voldemort on Amazon AWS shows that using eventual consistency with monitoring can provide 20 – 40% increase in throughput when compared with sequential consistency. We also show that the overhead of the monitor itself is small (typically less than 8%) and the latency of detecting violations is very low. For example, more than 99.9% violations are detected in less than 1 second.
From longitudinal biomedical studies to social networks, graphs have emerged as a powerful framework for describing evolving interactions between agents in complex systems. In such studies, the data typically consists of a set of graphs representing a system’s state at different points in time or space. The analysis of the system’s dynamics depends on the selection of the appropriate tools. In particular, after specifying properties characterizing similarities between states, a critical step lies in the choice of a distance capable of reflecting such similarities. While the literature offers a number of distances that one could a priori choose from, their properties have been little investigated and no guidelines regarding the choice of such a distance have yet been provided. However, these distances’ sensitivity to perturbations in the network’s structure and their ability to identify important changes are crucial to the analysis, making the selection of an adequate metric a decisive — yet delicate — practical matter. In the spirit of Goldenberg, Zheng and Fienberg’s seminal 2009 review, the purpose of this article is to provide an overview of commonly-used graph distances and an explicit characterization of the structural changes that they are best able to capture. To see how this translates in real-life situations, we use as a guiding thread to our discussion the application of these distances to the analysis a longitudinal microbiome study — as well as on synthetic examples. Having unveiled some of traditional distances’ shortcomings, we also suggest alternative similarity metrics and highlight their relative advantages in specific analysis scenarios. Above all, we provide some guidance for choosing one distance over another in certain types of applications. Finally, we show an application of these different distances to a network created from worldwide recipes.
The recent success of Deep Neural Networks (DNNs) has drastically improved the state of the art for many application domains. While achieving high accuracy performance, deploying state-of-the-art DNNs is a challenge since they typically require billions of expensive arithmetic computations. In addition, DNNs are typically deployed in ensemble to boost accuracy performance, which further exacerbates the system requirements. This computational overhead is an issue for many platforms, e.g. data centers and embedded systems, with tight latency and energy budgets. In this article, we introduce flexible DNNs ensemble processing technique, which achieves large reduction in average inference latency while incurring small to negligible accuracy drop. Our technique is flexible in that it allows for dynamic adaptation between quality of results (QoR) and execution runtime. We demonstrate the effectiveness of the technique on AlexNet and ResNet-50 using the ImageNet dataset. This technique can also easily handle other types of networks.
Many state-of-the-art computer vision algorithms use large scale convolutional neural networks (CNNs) as basic building blocks. These CNNs are known for their huge number of parameters, high redundancy in weights, and tremendous computing resource consumptions. This paper presents a learning algorithm to simplify and speed up these CNNs. Specifically, we introduce a ‘try-and-learn’ algorithm to train pruning agents that remove unnecessary CNN filters in a data-driven way. With the help of a novel reward function, our agents removes a significant number of filters in CNNs while maintaining performance at a desired level. Moreover, this method provides an easy control of the tradeoff between network performance and its scale. Per- formance of our algorithm is validated with comprehensive pruning experiments on several popular CNNs for visual recognition and semantic segmentation tasks.
Recent advances show that two-dimensional linear discriminant analysis (2DLDA) is a successful matrix based dimensionality reduction method. However, 2DLDA may encounter the singularity issue theoretically and the sensitivity to outliers. In this paper, a generalized Lp-norm 2DLDA framework with regularization for an arbitrary $p>0$ is proposed, named G2DLDA. There are mainly two contributions of G2DLDA: one is G2DLDA model uses an arbitrary Lp-norm to measure the between-class and within-class scatter, and hence a proper $p$ can be selected to achieve the robustness. The other one is that by introducing an extra regularization term, G2DLDA achieves better generalization performance, and solves the singularity problem. In addition, G2DLDA can be solved through a series of convex problems with equality constraint, and it has closed solution for each single problem. Its convergence can be guaranteed theoretically when $1\leq p\leq2$. Preliminary experimental results on three contaminated human face databases show the effectiveness of the proposed G2DLDA.
We propose a curiosity reward based on information theory principles and consistent with the animal instinct to maintain certain critical parameters within a bounded range. Our experimental validation shows the added value of the additional homeostatic drive to enhance the overall information gain of a reinforcement learning agent interacting with a complex environment using continuous actions. Our method builds upon two ideas: i) To take advantage of a new Bellman-like equation of information gain and ii) to simplify the computation of the local rewards by avoiding the approximation of complex distributions over continuous states and actions.
We consider a sliding window over a stream of characters from some finite alphabet. The user wants to perform deterministic substring matching on the current sliding window content and obtain positions of the matches. We present an indexed version of the sliding window based on a suffix tree. The data structure has optimal time queries $\Theta(m+occ)$ and amortized constant time updates, where $m$ is the length of the query string and $occ$ the number of occurrences.
Second-order pooling, a.k.a. bilinear pooling, has proven effective for visual recognition. The recent progress in this area has focused on either designing normalization techniques for second-order models, or compressing the second-order representations. However, these two directions have typically been followed separately, and without any clear statistical motivation. Here, by contrast, we introduce a statistically-motivated framework that jointly tackles normalization and compression of second-order representations. To this end, we design a parametric vectorization layer, which maps a covariance matrix, known to follow a Wishart distribution, to a vector whose elements can be shown to follow a Chi-square distribution. We then propose to make use of a square-root normalization, which makes the distribution of the resulting representation converge to a Gaussian, thus complying with the standard machine learning assumption. As evidenced by our experiments, this lets us outperform the state-of-the-art second-order models on several benchmark recognition datasets.
Clustering is a fundamental machine learning method. The quality of its results is dependent on the data distribution. For this reason, deep neural networks can be used for learning better representations of the data. In this paper, we propose a systematic taxonomy for clustering with deep learning, in addition to a review of methods from the field. Based on our taxonomy, creating new methods is more straightforward. We also propose a new approach which is built on the taxonomy and surpasses some of the limitations of some previous work. Our experimental evaluation on image datasets shows that the method approaches state-of-the-art clustering quality, and performs better in some cases.
Deep neural networks (DNNs) are powerful machine learning models and have succeeded in various artificial intelligence tasks. Although various architectures and modules for the DNNs have been proposed, selecting and designing the appropriate network structure for a target problem is a challenging task. In this paper, we propose a method to simultaneously optimize the network structure and weight parameters during neural network training. We consider a probability distribution that generates network structures, and optimize the parameters of the distribution instead of directly optimizing the network structure. The proposed method can apply to the various network structure optimization problems under the same framework. We apply the proposed method to several structure optimization problems such as selection of layers, selection of unit types, and selection of connections using the MNIST, CIFAR-10, and CIFAR-100 datasets. The experimental results show that the proposed method can find the appropriate and competitive network structures.
This paper develops a data-driven inverse reinforcement learning technique for a class of linear systems to estimate the cost function of an agent online, using input-output measurements. A simultaneous state and parameter estimator is utilized to facilitate output-feedback inverse reinforcement learning, and cost function estimation is achieved up to multiplication by a constant.
A wide variety of optimization techniques, both exact and heuristic, tend to be biased samplers. This means that when attempting to find multiple uncorrelated solutions of a degenerate Boolean optimization problem a subset of the solution space tends to be favored while, in the worst case, some solutions can never be accessed by the used algorithm. Here we present a simple post-processing technique that improves sampling for any optimization approach, either quantum or classical. More precisely, starting from a pool of a few optimal configurations, the algorithm generates potentially new solutions via rejection-free cluster updates at zero temperature. Although the method is not ergodic and there is no guarantee that all the solutions can be found, fair sampling is typically improved. We illustrate the effectiveness of our method by improving the exponentially biased data produced by the D-Wave 2X quantum annealer [Phys. Rev. Lett. 118, 07052 (2017)], as well as data from three-dimensional Ising spin glasses. As part of the study, we also show that sampling is improved when sub-optimal states are included and discuss sampling at a finite fixed temperature.
Probabilistic Boolean Networks (PBNs) have been previously proposed so as to gain insights into complex dynamical systems. However, identification of large networks and of the underlying discrete Markov Chain which describes their temporal evolution, still remains a challenge. In this paper, we introduce an equivalent representation for the PBN, the Stochastic Conjunctive Normal Form (SCNF), which paves the way to a scalable learning algorithm and helps predict long-run dynamic behavior of large-scale systems. Moreover, SCNF allows its efficient sampling so as to statistically infer multi-step transition probabilities which can provide knowledge on the activity levels of individual nodes in the long run.
