A new feature selection method based on kernelized fuzzy rough sets (KFRS) and the memetic algorithm (MA) is proposed for transient stability assessment of power systems. Considering the possible real-time information provided by wide-area measurement systems, a group of system-level classification features are extracted from the power system operation parameters to build the original feature set. By defining a KFRS-based generalized classification function as the separability criterion, the memetic algorithm based on binary differential evolution (BDE) and Tabu search (TS) is employed to obtain the optimal feature subsets with the maximized classification capability. The proposed method may avoid the information loss caused by the feature discretization process of the rough-set based attribute selection, and comprehensively utilize the advantages of BDE and TS to improve the solution quality and search efficiency. The effectiveness of the proposed method is validated by the application results on the New England 39-bus power system and the southern power system of Hebei province.
We propose a generic and interpretable learning framework for building robust text classification model that achieves accuracy comparable to full models under test-time budget constraints. Our approach learns a selector to identify words that are relevant to the prediction tasks and passes them to the classifier for processing. The selector is trained jointly with the classifier and directly learns to incorporate with the classifier. We further propose a data aggregation scheme to improve the robustness of the classifier. Our learning framework is general and can be incorporated with any type of text classification model. On real-world data, we show that the proposed approach improves the performance of a given classifier and speeds up the model with a mere loss in accuracy performance.
To solve the limitation of Hadoop on scalability, resource sharing, and application support, the open-source community proposes the next generation of Hadoop’s compute platform called Yet Another Resource Negotiator (YARN) by separating resource management functions from the programming model. This separation enables various application types to run on YARN in parallel. To achieve fair resource sharing and high resource utilization, YARN provides the capacity scheduler and the fair scheduler. However, the performance impacts of the two schedulers are not clear when mixed applications run on a YARN cluster. Therefore, in this paper, we study four scheduling-policy combinations (SPCs for short) derived from the two schedulers and then evaluate the four SPCs in extensive scenarios, which consider not only four application types, but also three different queue structures for organizing applications. The experimental results enable YARN managers to comprehend the influences of different SPCs and different queue structures on mixed applications. The results also help them to select a proper SPC and an appropriate queue structure to achieve better application execution performance.
Over the past decades, researchers and ML practitioners have come up with better and better ways to build, understand and improve the quality of ML models, but mostly under the key assumption that the training data is distributed identically to the testing data. In many real-world applications, however, some potential training examples are unknown to the modeler, due to sample selection bias or, more generally, covariate shift, i.e., a distribution shift between the training and deployment stage. The resulting discrepancy between training and testing distributions leads to poor generalization performance of the ML model and hence biased predictions. We provide novel algorithms that estimate the number and properties of these unknown training examples—unknown unknowns. This information can then be used to correct the training set, prior to seeing any test data. The key idea is to combine species-estimation techniques with data-driven methods for estimating the feature values for the unknown unknowns. Experiments on a variety of ML models and datasets indicate that taking the unknown examples into account can yield a more robust ML model that generalizes better.
DenseNets have been shown to be a competitive model among recent convolutional network architectures. These networks utilize Dense Blocks, which are groups of densely connected layers where the output of a hidden layer is fed in as the input of every other layer following it. In this paper, we aim to improve certain aspects of DenseNet, especially when it comes to practicality. We introduce ParaNet, a new architecture that constructs three pipelines which allow for early inference. We additionally introduce a cascading mechanism such that different pipelines are able to share parameters, as well as logit matching between the outputs of the pipelines. We separately evaluate each of the newly introduced mechanisms of ParaNet, then evaluate our proposed architecture on CIFAR-100.
Clustering is an essential data mining tool that aims to discover inherent cluster structure in data. For most applications, applying clustering is only appropriate when cluster structure is present. As such, the study of clusterability, which evaluates whether data possesses such structure, is an integral part of cluster analysis. However, methods for evaluating clusterability vary radically, making it challenging to select a suitable measure. In this paper, we perform an extensive comparison of measures of clusterability and provide guidelines that clustering users can reference to select suitable measures for their applications.
Various forums and question answering (Q&A) sites are available online that allow Ubuntu users to find results similar to their queries. However, searching for a result is often time consuming as it requires the user to find a specific problem instance relevant to his/her query from a large set of questions. In this paper, we present an automated question answering system for Ubuntu users called Dr. Tux that is designed to answer user’s queries by selecting the most similar question from an online database. The prototype was implemented in Python and uses NLTK and CoreNLP tools for Natural Language Processing. The data for the prototype was taken from the AskUbuntu website which contains about 150k questions. The results obtained from the manual evaluation of the prototype were promising while also presenting some interesting opportunities for improvement.
Embedding-based methods for knowledge base completion (KBC) learn representations of entities and relations in a vector space, along with the scoring function to estimate the likelihood of relations between entities. The learnable class of scoring functions is designed to be expressive enough to cover a variety of real-world relations, but this expressive comes at the cost of an increased number of parameters. In particular, parameters in these methods are superfluous for relations that are either symmetric or antisymmetric. To mitigate this problem, we propose a new L1 regularizer for Complex Embeddings, which is one of the state-of-the-art embedding-based methods for KBC. This regularizer promotes symmetry or antisymmetry of the scoring function on a relation-by-relation basis, in accordance with the observed data. Our empirical evaluation shows that the proposed method outperforms the original Complex Embeddings and other baseline methods on the FB15k dataset.
Training a good deep learning model often requires a lot of annotated data. As a large amount of labeled data is typically difficult to collect and even more difficult to annotate, data augmentation and data generation are widely used in the process of training deep neural networks. However, there is no clear common understanding on how much labeled data is needed to get satisfactory performance. In this paper, we try to address such a question using vehicle license plate character recognition as an example application. We apply computer graphic scripts and Generative Adversarial Networks to generate and augment a large number of annotated, synthesized license plate images with realistic colors, fonts, and character composition from a small number of real, manually labeled license plate images. Generated and augmented data are mixed and used as training data for the license plate recognition network modified from DenseNet. The experimental results show that the model trained from the generated mixed training data has good generalization ability, and the proposed approach achieves a new state-of-the-art accuracy on Dataset-1 and AOLP, even with a very limited number of original real license plates. In addition, the accuracy improvement caused by data generation becomes more significant when the number of labeled images is reduced. Data augmentation also plays a more significant role when the number of labeled images is increased.
Feature selection methods are widely used in order to solve the ‘curse of dimensionality’ problem. Many proposed feature selection frameworks, treat all data points equally; neglecting their different representation power and importance. In this paper, we propose an unsupervised hypergraph feature selection method via a novel point-weighting framework and low-rank representation that captures the importance of different data points. We introduce a novel soft hypergraph with low complexity to model data. Then, we formulate the feature selection as an optimization problem to preserve local relationships and also global structure of data. Our approach for global structure preservation helps the framework overcome the problem of unavailability of data labels in unsupervised learning. The proposed feature selection method treats with different data points based on their importance in defining data structure and representation power. Moreover, since the robustness of feature selection methods against noise and outlier is of great importance, we adopt low-rank representation in our model. Also, we provide an efficient algorithm to solve the proposed optimization problem. The computational cost of the proposed algorithm is lower than many state-of-the-art methods which is of high importance in feature selection tasks. We conducted comprehensive experiments with various evaluation methods on different benchmark data sets. These experiments indicate significant improvement, compared with state-of-the-art feature selection methods.
In this paper, we investigate how we can leverage Spark platform for efficiently processing provenance queries on large volumes of workflow provenance data. We focus on processing provenance queries at attribute-value level which is the finest granularity available. We propose a novel weakly connected component based framework which is carefully engineered to quickly determine a minimal volume of data containing the entire lineage of the queried attribute-value. This minimal volume of data is then processed to figure out the provenance of the queried attribute-value. The proposed framework computes weakly connected components on the workflow provenance graph and further partitions the large components as a collection of weakly connected sets. The framework exploits the workflow dependency graph to effectively partition the large components into a collection of weakly connected sets. We study the effectiveness of the proposed framework through experiments on a provenance trace obtained from a real-life unstructured text curation workflow. On provenance graphs containing upto 500M nodes and edges, we show that the proposed framework answers provenance queries in real-time and easily outperforms the naive approaches.
We propose a new method to detect when users express the intent to leave a service, also known as churn. While previous work focuses solely on social media, we show that this intent can be detected in chatbot conversations. As companies increasingly rely on chatbots they need an overview of potentially churny users. To this end, we crowdsource and publish a dataset of churn intent expressions in chatbot interactions in German and English. We show that classifiers trained on social media data can detect the same intent in the context of chatbots. We introduce a classification architecture that outperforms existing work on churn intent detection in social media. Moreover, we show that, using bilingual word embeddings, a system trained on combined English and German data outperforms monolingual approaches. As the only existing dataset is in English, we crowdsource and publish a novel dataset of German tweets. We thus underline the universal aspect of the problem, as examples of churn intent in English help us identify churn in German tweets and chatbot conversations.
We provide a detailed example for modular ontology modeling based on ontology design patterns.
In causal inference, and specifically in the \textit{Causes of Effects} problem, one is interested in how to use statistical evidence to understand causation in an individual case, and so how to assess the so-called {\em probability of causation} (PC). The answer relies on the potential responses, which can incorporate information about what would have happened to the outcome as we had observed a different value of the exposure. However, even given the best possible statistical evidence for the association between exposure and outcome, we can typically only provide bounds for the PC. Dawid et al. (2016) highlighted some fundamental conditions, namely, exogeneity, comparability, and sufficiency, required to obtain such bounds, based on experimental data. The aim of the present paper is to provide methods to find, in specific cases, the best subsample of the reference dataset to satisfy such requirements. To this end, we introduce a new variable, expressing the desire to be exposed or not, and we set the question up as a model selection problem. The best model will be selected using the marginal probability of the responses and a suitable prior proposal over the model space. An application in the educational field is presented.
The Big Data is the most popular paradigm nowadays and it has almost no untouched area. For instance, science, engineering, economics, business, social science, and government. The Big Data are used to boost up the organization performance using massive amount of dataset. The Data are assets of the organization, and these data gives revenue to the organizations. Therefore, the Big Data is spawning everywhere to enhance the organizations’ revenue. Thus, many new technologies emerging based on Big Data. In this paper, we present the taxonomy of Big Data. Besides, we present in-depth insight on the Big Data paradigm.
Bayesian hypothesis testing is re-examined from the perspective of an a priori assessment of the test statistic distribution under the alternative. By assessing the distribution of an observable test statistic, rather than prior parameter values, we provide a practical default Bayes factor which is straightforward to interpret. To illustrate our methodology, we provide examples where evidence for a Bayesian strikingly supports the null, but leads to rejection under a classical test. Finally, we conclude with directions for future research.
Artificial intelligence (AI) is the core technology of technological revolution and industrial transformation. As one of the new intelligent needs in the AI 2.0 era, financial intelligence has elicited much attention from the academia and industry. In our current dynamic capital market, financial intelligence demonstrates a fast and accurate machine learning capability to handle complex data and has gradually acquired the potential to become a ‘financial brain’. In this work, we survey existing studies on financial intelligence. First, we describe the concept of financial intelligence and elaborate on its position in the financial technology field. Second, we introduce the development of financial intelligence and review state-of-the-art techniques in wealth management, risk management, financial security, financial consulting, and blockchain. Finally, we propose a research framework called FinBrain and summarize four open issues, namely, explainable financial agents and causality, perception and prediction under uncertainty, risk-sensitive and robust decision making, and multi-agent game and mechanism design. We believe that these research directions can lay the foundation for the development of AI 2.0 in the finance field.
When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $\rho_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both functions of $\rho_T$, with thresholding behavior starkly illustrated in matchability.
Detecting events and classifying them into predefined types is an important step in knowledge extraction from natural language texts. While the neural network models have generally led the state-of-the-art, the differences in performance between different architectures have not been rigorously studied. In this paper we present a novel GRU-based model that combines syntactic information along with temporal structure through an attention mechanism. We show that it is competitive with other neural network architectures through empirical evaluations under different random initializations and training-validation-test splits of ACE2005 dataset.
The complexity and diversity of big data and AI workloads make understanding them difficult and challenging. This paper proposes a new approach to modelling and characterizing big data and AI workloads. We consider each big data and AI workload as a pipeline of one or more classes of units of computation performed on different initial or intermediate data inputs. Each class of unit of computation captures the common requirements while being reasonably divorced from individual implementations, and hence we call it a data motif. For the first time, among a wide variety of big data and AI workloads, we identify eight data motifs that take up most of the run time of those workloads, including Matrix, Sampling, Logic, Transform, Set, Graph, Sort and Statistic. We implement the eight data motifs on different software stacks as the micro benchmarks of an open-source big data and AI benchmark suite —BigDataBench 4.0 (publicly available from http://…/BigDataBench ), and perform comprehensive characterization of those data motifs from perspective of data sizes, types, sources, and patterns as a lens towards fully understanding big data and AI workloads. We believe the eight data motifs are promising abstractions and tools for not only big data and AI benchmarking, but also domain-specific hardware and software co-design.
Existing fuzzy neural networks (FNNs) are mostly developed under a shallow network configuration having lower generalization power than those of deep structures. This paper proposes a novel self-organizing deep fuzzy neural network, namely deep evolving fuzzy neural networks (DEVFNN). Fuzzy rules can be automatically extracted from data streams or removed if they play little role during their lifespan. The structure of the network can be deepened on demand by stacking additional layers using a drift detection method which not only detects the covariate drift, variations of input space, but also accurately identifies the real drift, dynamic changes of both feature space and target space. DEVFNN is developed under the stacked generalization principle via the feature augmentation concept where a recently developed algorithm, namely Generic Classifier (gClass), drives the hidden layer. It is equipped by an automatic feature selection method which controls activation and deactivation of input attributes to induce varying subsets of input features. A deep network simplification procedure is put forward using the concept of hidden layer merging to prevent uncontrollable growth of input space dimension due to the nature of feature augmentation approach in building a deep network structure. DEVFNN works in the sample-wise fashion and is compatible for data stream applications. The efficacy of DEVFNN has been thoroughly evaluated using six datasets with non-stationary properties under the prequential test-then-train protocol. It has been compared with four state-of the art data stream methods and its shallow counterpart where DEVFNN demonstrates improvement of classification accuracy.
In this paper, we propose distributed feature extraction tool from high spatial resolution remote sensing images. Tool is based on Apache Hadoop framework and Hadoop Image Processing Interface. Two corner detection (Harris and Shi-Tomasi) algorithms and five feature descriptors (SIFT, SURF, FAST, BRIEF, and ORB) are considered. Robustness of the tool in the task of feature extraction from LandSat-8 imageries are evaluated in terms of horizontal scalability.