# Whats new on arXiv

The goal of system identification is to learn about underlying physics dynamics behind the observed time-series data. To model the nonparametric and probabilistic dynamics model, Gaussian process state-space models (GPSSMs) have been widely studied; GPs are not only capable to represent nonlinear dynamics, but estimate the uncertainty of prediction and avoid over-fitting. Traditional GPSSMs, however, are based on Gaussian transition model, thus often have difficulty in describing multi-modal motions. To resolve the challenge, this thesis proposes a model using multiple GPs and extends the GPSSM to information-theoretic framework by introducing a mutual information regularizer helping the model to learn interpretable and disentangled representation of multi-modal transition dynamics model. Experiment results show that the proposed model not only successfully represents the observed system but distinguishes the dynamics mode that governs the given observation sequence.
Classical epidemiology has focused on the control of confounding but it is only recently that epidemiologists have started to focus on the bias produced by colliders. A collider for a certain pair of variables (e.g., an outcome Y and an exposure A) is a third variable (C) that is caused by both. In DAGs terminology, a collider is the variable in the middle of an inverted fork (i.e., the variable C in A -> C <- Y). Controlling for, or conditioning an analysis on a collider (i.e., through stratification or regression) can introduce a spurious association between its causes. This potentially explains many paradoxical findings in the medical literature, where established risk factors for a particular outcome appear protective. We used an example from non-communicable disease epidemiology to contextualize and explain the effect of conditioning on a collider. We generated a dataset with 1,000 observations and ran Monte-Carlo simulations to estimate the effect of 24-hour dietary sodium intake on systolic blood pressure, controlling for age, which acts as a confounder, and 24-hour urinary protein excretion, which acts as a collider. We illustrate how adding a collider to a regression model introduces bias. Thus, to prevent paradoxical associations, epidemiologists estimating causal effects should be wary of conditioning on colliders. We provide R-code in easy-to-read boxes throughout the manuscript and a GitHub repository (https://…/ColliderApp ) for the reader to reproduce our example. We also provide an educational web application allowing real-time interaction to visualize the paradoxical effect of conditioning on a collider http://…/.
The autoregressive (AR) model is a widely used model to understand time series data. Traditionally, the innovation noise of the AR is modeled as Gaussian. However, many time series applications, for example, financial time series data are non-Gaussian, therefore, the AR model with more general heavy-tailed innovations are preferred. Another issue that frequently occurs in time series is missing values, due to the system data record failure or unexpected data loss. Although there are numerous works about Gaussian AR time series with missing values, as far as we know, there does not exist any work addressing the issue of missing data for the heavy-tailed AR model. In this paper, we consider this issue for the first time, and propose an efficient framework for the parameter estimation from incomplete heavy-tailed time series based on the stochastic approximation expectation maximization (SAEM) coupled with a Markov Chain Monte Carlo (MCMC) procedure. The proposed algorithm is computationally cheap and easy to implement. The convergence of the proposed algorithm to a stationary point of the observed data likelihood is rigorously proved. Extensive simulations on synthetic and real datasets demonstrate the efficacy of the proposed framework.
The semantic information regulates the expressiveness of a web service. State-of-the-art approaches in web services research have used the semantics of a web service for different purposes, mainly for service discovery, composition, execution etc. In this paper, our main focus is on semantic driven Quality of Service (QoS) aware service composition. Most of the contemporary approaches on service composition have used the semantic information to combine the services appropriately to generate the composition solution. However, in this paper, our intention is to use the semantic information to expedite the service composition algorithm. Here, we present a service composition framework that uses semantic information of a web service to generate different clusters, where the services are semantically related within a cluster. Our final aim is to construct a composition solution using these clusters that can efficiently scale to large service spaces, while ensuring solution quality. Experimental results show the efficiency of our proposed method.
Deep neural networks show great potential as solutions to many sensing application problems, but their excessive resource demand slows down execution time, pausing a serious impediment to deployment on low-end devices. To address this challenge, recent literature focused on compressing neural network size to improve performance. We show that changing neural network size does not proportionally affect performance attributes of interest, such as execution time. Rather, extreme run-time nonlinearities exist over the network configuration space. Hence, we propose a novel framework, called FastDeepIoT, that uncovers the non-linear relation between neural network structure and execution time, then exploits that understanding to find network configurations that significantly improve the trade-off between execution time and accuracy on mobile and embedded devices. FastDeepIoT makes two key contributions. First, FastDeepIoT automatically learns an accurate and highly interpretable execution time model for deep neural networks on the target device. This is done without prior knowledge of either the hardware specifications or the detailed implementation of the used deep learning library. Second, FastDeepIoT informs a compression algorithm how to minimize execution time on the profiled device without impacting accuracy. We evaluate FastDeepIoT using three different sensing-related tasks on two mobile devices: Nexus 5 and Galaxy Nexus. FastDeepIoT further reduces the neural network execution time by $48\%$ to $78\%$ and energy consumption by $37\%$ to $69\%$ compared with the state-of-the-art compression algorithms.
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd’s algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a ‘typical instance’ may vary depending on the application, and different clustering heuristics perform differently depending on the instance. In this paper, we define an infinite family of algorithms generalizing Lloyd’s algorithm, with one parameter controlling the the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
The growing influence and decision-making capacities of Autonomous systems and Artificial Intelligence in our lives force us to consider the values embedded in these systems. But how ethics should be implemented into these systems? In this study, the solution is seen on philosophical conceptualization as a framework to form practical implementation model for ethics of AI. To take the first steps on conceptualization main concepts used on the field needs to be identified. A keyword based Systematic Mapping Study (SMS) on the keywords used in AI and ethics was conducted to help in identifying, defying and comparing main concepts used in current AI ethics discourse. Out of 1062 papers retrieved SMS discovered 37 re-occurring keywords in 83 academic papers. We suggest that the focus on finding keywords is the first step in guiding and providing direction for future research in the AI ethics field.
Convolutional Neural Networks (CNNs) are extremely computationally demanding, presenting a large barrier to their deployment on resource-constrained devices. Since such systems are where some of their most useful applications lie (e.g. obstacle detection for mobile robots, vision-based medical assistive technology), significant bodies of work from both machine learning and systems communities have attempted to provide optimisations that will make CNNs available to edge devices. In this paper we unify the two viewpoints in a Deep Learning Inference Stack and take an across-stack approach by implementing and evaluating the most common neural network compression techniques (weight pruning, channel pruning, and quantisation) and optimising their parallel execution with a range of programming approaches (OpenMP, OpenCL) and hardware architectures (CPU, GPU). We provide comprehensive Pareto curves to instruct trade-offs under constraints of accuracy, execution time, and memory space.
Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In this paper, we develop a neat, simple yet effective way to learn \emph{FRequency-AGnostic word Embedding} (FRAGE) using adversarial training. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that with FRAGE, we achieve higher performance than the baselines in all tasks.
HDT (Header, Dictionary, Triples) is a serialization for RDF. HDT has become very popular in the last years because it allows to store RDF data with a small disk footprint, while remaining at the same time queriable. For this reason HDT is often used when scalability becomes an issue. Once RDF data is serialized into HDT, the disk footprint to store it and the memory footprint to query it are very low. However, generating HDT files from raw text RDF serializations (like N-Triples) is a time-consuming and (especially) memory-consuming task. In this publication we present HDTCat, an algorithm and command line tool to join two HDT files with low memory footprint. HDTCat can be used in a divide-and-conquer strategy to generate HDT files from huge datasets using a low-memory footprint.
Growing amount of comments make online discussions difficult to moderate by human moderators only. Antisocial behavior is a common occurrence that often discourages other users from participating in discussion. We propose a neural network based method that partially automates the moderation process. It consists of two steps. First, we detect inappropriate comments for moderators to see. Second, we highlight inappropriate parts within these comments to make the moderation faster. We evaluated our method on data from a major Slovak news discussion platform.
An online labor platform faces an online learning problem in matching workers with jobs and using the performance on these jobs to create better future matches. This learning problem is complicated by the rise of complex tasks on these platforms, such as web development and product design, that require a team of workers to complete. The success of a job is now a function of the skills and contributions of all workers involved, which may be unknown to both the platform and the client who posted the job. These team matchings result in a structured correlation between what is known about the individuals and this information can be utilized to create better future matches. We analyze two natural settings where the performance of a team is dictated by its strongest and its weakest member, respectively. We find that both problems pose an exploration-exploitation tradeoff between learning the performance of untested teams and repeating previously tested teams that resulted in a good performance. We establish fundamental regret bounds and design near-optimal algorithms that uncover several insights into these tradeoffs.
The field of Argumentation Mining has arisen from the need of determining the underlying causes from an expressed opinion and the urgency to develop the established fields of Opinion Mining and Sentiment Analysis. The recent progress in the wider field of Artificial Intelligence in combination with the available data through Social Web has create great potential for every sub-field of Natural Language Process including Argumentation Mining.
Compressed sensing proposes to reconstruct more degrees of freedom in a signal than the number of values actually measured. Compressed sensing therefore risks introducing errors — inserting spurious artifacts or masking the abnormalities that medical imaging seeks to discover. The present case study of estimating errors using the standard statistical tools of a jackknife and a bootstrap yields error ‘bars’ in the form of full images that are remarkably representative of the actual errors (at least when evaluated and validated on data sets for which the ground truth and hence the actual error is available). These images show the structure of possible errors — without recourse to measuring the entire ground truth directly — and build confidence in regions of the images where the estimated errors are small.
We propose a multi-task learning framework to jointly train a Machine Reading Comprehension (MRC) model on multiple datasets across different domains. Key to the proposed method is to learn robust and general contextual representations with the help of out-domain data in a multi-task framework. Empirical study shows that the proposed approach is orthogonal to the existing pre-trained representation models, such as word embedding and language models. Experiments on the Stanford Question Answering Dataset (SQuAD), the Microsoft MAchine Reading COmprehension Dataset (MS MARCO), NewsQA and other datasets show that our multi-task learning approach achieves significant improvement over state-of-the-art models in most MRC tasks.
We propose to use boosted regression trees as a way to compute human-interpretable solutions to reinforcement learning problems. Boosting combines several regression trees to improve their accuracy without significantly reducing their inherent interpretability. Prior work has focused independently on reinforcement learning and on interpretable machine learning, but there has been little progress in interpretable reinforcement learning. Our experimental results show that boosted regression trees compute solutions that are both interpretable and match the quality of leading reinforcement learning methods.
Central to many inferential situations is the estimation of rational functions of parameters. The mainstream in statistics and econometrics estimates these quantities based on the plug-in approach without consideration of the main objective of the inferential situation. We propose the Bayesian Minimum Expected Loss (MELO) approach focusing explicitly on the function of interest, and calculating its frequentist variability. Asymptotic properties of the MELO estimator are similar to the plug-in approach. Nevertheless, simulation exercises show that our proposal is better in situations characterized by small sample sizes and noisy models. In addition, we observe in the applications that our approach gives lower standard errors than frequently used alternatives when datasets are not very informative.
The ability to estimate joint, conditional and marginal probability distributions over some set of variables is of great utility for many common machine learning tasks. However, estimating these distributions can be challenging, particularly in the case of data containing a mix of discrete and continuous variables. This paper presents a non-parametric method for estimating these distributions directly from a dataset. The data are first represented as a graph consisting of object nodes and attribute value nodes. Depending on the distribution to be estimated, an appropriate eigenvector equation is then constructed. This equation is then solved to find the corresponding stationary distribution of the graph, from which the required distributions can then be estimated and sampled from. The paper demonstrates how the method can be applied to many common machine learning tasks including classification, regression, missing value imputation, outlier detection, random vector generation, and clustering.
Multiplicative noise, including dropout, is widely used to regularize deep neural networks (DNNs), and is shown to be effective in a wide range of architectures and tasks. From an information perspective, we consider injecting multiplicative noise into a DNN as training the network to solve the task with noisy information pathways, which leads to the observation that multiplicative noise tends to increase the correlation between features, so as to increase the signal-to-noise ratio of information pathways. However, high feature correlation is undesirable, as it increases redundancy in representations. In this work, we propose non-correlating multiplicative noise (NCMN), which exploits batch normalization to remove the correlation effect in a simple yet effective way. We show that NCMN significantly improves the performance of standard multiplicative noise on image classification tasks, providing a better alternative to dropout for batch-normalized networks. Additionally, we present a unified view of NCMN and shake-shake regularization, which explains the performance gain of the latter.
Item-to-item collaborative filtering (aka. item-based CF) has been long used for building recommender systems in industrial settings, owing to its interpretability and efficiency in real-time personalization. It builds a user’s profile as her historically interacted items, recommending new items that are similar to the user’s profile. As such, the key to an item-based CF method is in the estimation of item similarities. Early approaches use statistical measures such as cosine similarity and Pearson coefficient to estimate item similarities, which are less accurate since they lack tailored optimization for the recommendation task. In recent years, several works attempt to learn item similarities from data, by expressing the similarity as an underlying model and estimating model parameters by optimizing a recommendation-aware objective function. While extensive efforts have been made to use shallow linear models for learning item similarities, there has been relatively less work exploring nonlinear neural network models for item-based CF. In this work, we propose a neural network model named Neural Attentive Item Similarity model (NAIS) for item-based CF. The key to our design of NAIS is an attention network, which is capable of distinguishing which historical items in a user profile are more important for a prediction. Compared to the state-of-the-art item-based CF method Factored Item Similarity Model (FISM), our NAIS has stronger representation power with only a few additional parameters brought by the attention network. Extensive experiments on two public benchmarks demonstrate the effectiveness of NAIS. This work is the first attempt that designs neural network models for item-based CF, opening up new research possibilities for future developments of neural recommender systems.
With the prevalence of multimedia content on the Web, developing recommender solutions that can effectively leverage the rich signal in multimedia data is in urgent need. Owing to the success of deep neural networks in representation learning, recent advance on multimedia recommendation has largely focused on exploring deep learning methods to improve the recommendation accuracy. To date, however, there has been little effort to investigate the robustness of multimedia representation and its impact on the performance of multimedia recommendation. In this paper, we shed light on the robustness of multimedia recommender system. Using the state-of-the-art recommendation framework and deep image features, we demonstrate that the overall system is not robust, such that a small (but purposeful) perturbation on the input image will severely decrease the recommendation accuracy. This implies the possible weakness of multimedia recommender system in predicting user preference, and more importantly, the potential of improvement by enhancing its robustness. To this end, we propose a novel solution named Adversarial Multimedia Recommendation (AMR), which can lead to a more robust multimedia recommender model by using adversarial learning. The idea is to train the model to defend an adversary, which adds perturbations to the target image with the purpose of decreasing the model’s accuracy. We conduct experiments on two representative multimedia recommendation tasks, namely, image recommendation and visually-aware product recommendation. Extensive results verify the positive effect of adversarial learning and demonstrate the effectiveness of our AMR method. Source codes are available in https://…/AMR.
We present an effective technique for training deep learning agents capable of negotiating on a set of clauses in a contract agreement using a simple communication protocol. We use Multi Agent Reinforcement Learning to train both agents simultaneously as they negotiate with each other in the training environment. We also model selfish and prosocial behavior to varying degrees in these agents. Empirical evidence is provided showing consistency in agent behaviors. We further train a meta agent with a mixture of behaviors by learning an ensemble of different models using reinforcement learning. Finally, to ascertain the deployability of the negotiating agents, we conducted experiments pitting the trained agents against human players. Results demonstrate that the agents are able to hold their own against human players, often emerging as winners in the negotiation. Our experiments demonstrate that the meta agent is able to reasonably emulate human behavior.
Latent variable models have been a preferred choice in conversational modeling compared to sequence-to-sequence (seq2seq) models which tend to generate generic and repetitive responses. Despite so, training latent variable models remains to be difficult. In this paper, we propose Latent Topic Conversational Model (LTCM) which augments seq2seq with a neural latent topic component to better guide response generation and make training easier. The neural topic component encodes information from the source sentence to build a global ‘topic’ distribution over words, which is then consulted by the seq2seq model at each generation step. We study in details how the latent representation is learnt in both the vanilla model and LTCM. Our extensive experiments contribute to better understanding and training of conditional latent models for languages. Our results show that by sampling from the learnt latent representations, LTCM can generate diverse and interesting responses. In a subjective human evaluation, the judges also confirm that LTCM is the overall preferred option.
In the real world, the environment is constantly changing with the input variables under the effect of noise. However, few algorithms were shown to be able to work under those circumstances. Here, Novelty-Organizing Team of Classifiers (NOTC) is applied to the continuous action mountain car as well as two variations of it: a noisy mountain car and an unstable weather mountain car. These problems take respectively noise and change of problem dynamics into account. Moreover, NOTC is compared with NeuroEvolution of Augmenting Topologies (NEAT) in these problems, revealing a trade-off between the approaches. While NOTC achieves the best performance in all of the problems, NEAT needs less trials to converge. It is demonstrated that NOTC achieves better performance because of its division of the input space (creating easier problems). Unfortunately, this division of input space also requires a bit of time to bootstrap.
We propose a simple procedure to test for changes in correlation matrix at an unknown point in time. This test requires constant expectations and variances, but only mild assumptions on the serial dependence structure. We test for a breakdown in correlation structure using eigenvalue decomposition. We derive the asymptotic distribution under the null hypothesis and apply the test to stock returns. We compute the power of our test and compare it with the power of other known tests.
Recently, path norm was proposed as a new capacity measure for neural networks with Rectified Linear Unit (ReLU) activation function, which takes the rescaling-invariant property of ReLU into account. It has been shown that the generalization error bound in terms of the path norm explains the empirical generalization behaviors of the ReLU neural networks better than that of other capacity measures. Moreover, optimization algorithms which take path norm as the regularization term to the loss function, like Path-SGD, have been shown to achieve better generalization performance. However, the path norm counts the values of all paths, and hence the capacity measure based on path norm could be improperly influenced by the dependency among different paths. It is also known that each path of a ReLU network can be represented by a small group of linearly independent basis paths with multiplication and division operation, which indicates that the generalization behavior of the network only depends on only a few basis paths. Motivated by this, we propose a new norm \emph{Basis-path Norm} based on a group of linearly independent paths to measure the capacity of neural networks more accurately. We establish a generalization error bound based on this basis path norm, and show it explains the generalization behaviors of ReLU networks more accurately than previous capacity measures via extensive experiments. In addition, we develop optimization algorithms which minimize the empirical risk regularized by the basis-path norm. Our experiments on benchmark datasets demonstrate that the proposed regularization method achieves clearly better performance on the test set than the previous regularization approaches.
We consider the task of generating draws from a Markov jump process (MJP) between two time points at which the process is known. Resulting draws are typically termed bridges and the generation of such bridges plays a key role in simulation-based inference algorithms for MJPs. The problem is challenging due to the intractability of the conditioned process, necessitating the use of computationally intensive methods such as weighted resampling or Markov chain Monte Carlo. An efficient implementation of such schemes requires an approximation of the intractable conditioned hazard/propensity function that is both cheap and accurate. In this paper, we review some existing approaches to this problem before outlining our novel contribution. Essentially, we leverage the tractability of a Gaussian approximation of the MJP and suggest a computationally efficient implementation of the resulting conditioned hazard approximation. We compare and contrast our approach with existing methods using three examples.
Real world experiments are expensive, and thus it is important to reach a target in minimum number of experiments. Experimental processes often involve control variables that changes over time. Such problems can be formulated as a functional optimisation problem. We develop a novel Bayesian optimisation framework for such functional optimisation of expensive black-box processes. We represent the control function using Bernstein polynomial basis and optimise in the coefficient space. We derive the theory and practice required to dynamically adjust the order of the polynomial degree, and show how prior information about shape can be integrated. We demonstrate the effectiveness of our approach for short polymer fibre design and optimising learning rate schedules for deep networks.
Input optimization methods, such as Google Deep Dream, create interpretable representations of neurons for computer vision DNNs. We propose and evaluate ways of transferring this technology to NLP. Our results suggest that gradient ascent with a gumbel softmax layer produces n-gram representations that outperform naive corpus search in terms of target neuron activation. The representations highlight differences in syntax awareness between the language and visual models of the Imaginet architecture.
Generative adversarial networks have gained a lot of attention in general computer vision community due to their capability of data generation without explicitly modelling the probability density function and robustness to overfitting. The adversarial loss brought by the discriminator provides a clever way of incorporating unlabeled samples into the training and imposing higher order consistency that is proven to be useful in many cases, such as in domain adaptation, data augmentation, and image-to-image translation. These nice properties have attracted researcher in the medical imaging community and we have seen quick adoptions in many traditional tasks and some novel applications. This trend will continue to grow based on our observation, therefore we conducted a review of the recent advances in medical imaging using the adversarial training scheme in the hope of benefiting researchers that are interested in this technique.
In the Sequential Selection Problem (SSP), immediate and irrevocable decisions need to be made while candidates from a finite set are being examined one-by-one. The goal is to assign a limited number of $b$ available jobs to the best possible candidates. Standard SSP variants begin with an empty selection set (cold-starting) and perform the selection process once (single-round), over a single candidate set. In this paper we introduce the Multi-round Sequential Selection Problem (MSSP) which launches a new round of sequential selection each time a new set of candidates becomes available. Each new round has at hand the output of the previous one, i.e. its $b$ selected employees, and tries to update optimally that selection by reassigning each job at most once. Our setting allows changes to take place between two subsequent selection rounds: resignations of previously selected subjects or/and alterations of the quality score across the population. The challenge for a selection strategy is thus to efficiently adapt to such changes. For this novel problem we adopt a cutoff-based approach, where a precise number of candidates should be rejected first before starting to select. We set a rank-based objective of the process over the final job-to-employee assignment and we investigate analytically the optimal cutoff values with respect to the important parameters of the problem. Finally, we present experimental results that compare the efficiency of different selection strategies, as well as their convergence rates towards the optimal solution in the case of stationary score distributions.

# Document worth reading: “Automatic Language Identification in Texts: A Survey”

Language identification (LI) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used so far in the LI literature. For describing the features and methods we introduce a unified notation. We discuss evaluation methods, applications of LI, as well as off-the-shelf LI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI. Automatic Language Identification in Texts: A Survey

# Distilled News

Visualizations are a powerful way to simplify and interpret the underlying patterns in data. The first thing I do, whenever I work on a new dataset is to explore it through visualization. And this approach has worked well for me. Sadly, I don´t see many people using visualizations as much. That is why I thought I will share some of my ‘secret sauce’ with the world! Use of graphs is one such visualization technique. It is incredibly useful and helps businesses make better data-driven decisions. But to understand the concepts of graphs in detail, we must first understand it´s base – Graph Theory.
Online Statistics: An Interactive Multimedia Course of Study is a resource for learning and teachingintroductory statistics. It contains material presented in textbook format and as video presentations. Thisresource features interactive demonstrations and simulations, case studies, and an analysis lab.
I have started reading about Deep Learning for over a year now through several articles and research papers that I came across mainly in LinkedIn, Medium and Arxiv. When I virtually attended the MIT 6.S191 Deep Learning courses during the last few weeks (Here is a link to the course site), I decided to begin to put some structure in my understanding of Neural Networks through this series of articles.
Scalable Deep Learning services are contingent on several constraints. Depending on your target application, you may require low latency, enhanced security or long-term cost effectiveness. Hosting your Deep Learning model on the cloud may not be the best solution in such cases.
In a web application, routing is the process of using URLs to drive the user interface. Routing adds more possibilities and flexibility while building a complex and advanced web application, offering dividing app into separated sections.
Deep learning continues to be the hottest thing in data science. Deep learning frameworks are changing rapidly. Just five years ago, none of the leaders other than Theano were even around. I wanted to find evidence for which frameworks merit attention, so I developed this power ranking. The Python language is the clear leader for deep learning, so I focussed on frameworks compatible with it. I used 11 data sources across 7 distinct categories to gauge framework usage, interest, and popularity. Then I weighted and combined the data in this Kaggle Kernel.
In many decisive battles, battles lines are drawn, strategies are made, and a winner emerges. However, sometimes, pitifully, battles are lost because of utter ignorance?-?ignorance of even where the battle front is?-?and more sadly, self-deception. A brilliant piece of analysis from Forrester reports that 40 ‘insight-driven companies’ are going to grab \$1.8 trillion by 2021?-?most likely a part of this is going to be carved out of market cap of your organization. In this list we have young companies that are less than 8 years old. What unifies them? Their obsession with data and AI. Broadly, with respect to AI adoption, organizations fall into one of the two categories: First, we have the ‘talkers’: there are organizations wetting their feet on what they typically call ‘AI initiatives’?-?taking small risk averse steps in organizational silos, getting tangled by bureaucracies and a minority few unfortunately focusing more on press coverage than actual outcome. Then we have ‘Do-ers’: These are the insights driven companies, that have integrated (or on a strong path to integrate) Analytics & AI into their organizational fabric. These organizations have a holistic approach to what I would like to call ‘AI enabled Value Chain’. Which one are you and where do you want to be?
Exploring data using natural language (‘plain English’) query expressions isn’t a new concept, but it has become more relevant and more feasible lately. People are used to search engines and like the metaphor as data querying experience. Products like Thoughtspot and Answer Rocket specialize in this teaming of search and data discovery. And the Q&A feature of Microsoft Power BI enables this, both for ad hoc queries in dashboards and even for use as an authoring tool when designing reports. Many natural language analytics products, however, require data to be moved into their own repositories or index structures. But today, Arcadia Data is announcing a new Search feature, in the latest release of its Arcadia Enterprise product, that adapts the natural language query paradigm to work directly on top of data lakes.
In this course, you will learn the foundations of A/B testing, including hypothesis testing, experimental design, and confounding variables. You will also be exposed to a couple more advanced topics, sequential analysis and multivariate testing. The first dataset will be a generated example of a cat adoption website. You will investigate if changing the homepage image affects conversion rates (the percentage of people who click a specific button). For the remainder of the course you will use another generated dataset of a hypothetical data visualization website.
This is part-2 in the feature encoding tips and tricks series with the latest Spark 2.3.0. Please refer to part-1, before, as a lot of concepts from there will be used here. As mentioned before, I assume that you have a basic understanding of spark and its datatypes. If not, spark has an amazing documentation and it would be great to go through. For background on spark itself, go here for a summary.
In a previous post, we explained how neural networks work to predict a continuous value (like house price) from several features. One of the questions we got is how neural networks can encode concepts, categories or classes. For instance, how can neural networks convert a number of pixels to a true/false answer whether or not the underlying picture contains a cat?

# Magister Dixit

“There is a very important reason we use math. Our intuitions are often trained from a lifetime of experience, throwing that out in the name of ‘objectivity’ is foolishly excluding important information. But intuition also likes to skips steps, making it very easy to make errors in our reasoning if we don’t pull apart our intuition.” Will Kurt ( February 24, 2015 )

# Whats new on arXiv

Increasingly fast development and update cycle of online course contents, and diverse demographics of students in each online classroom, make student performance prediction in real-time (before the course finishes) an interesting topic for both industrial research and practical needs. In that, we tackle the problem of real-time student performance prediction with on-going courses in domain adaptation framework, which is a system trained on students’ labeled outcome from one previous coursework but is meant to be deployed on another. In particular, we first review recently-developed GritNet architecture which is the current state of the art for student performance prediction problem, and introduce a new unsupervised domain adaptation method to transfer a GritNet trained on a past course to a new course without any (students’ outcome) label. Our results for real Udacity students’ graduation predictions show that the GritNet not only generalizes well from one course to another across different Nanodegree programs, but enhances real-time predictions explicitly in the first few weeks when accurate predictions are most challenging.
Modern vision-based reinforcement learning techniques often use convolutional neural networks (CNN) as universal function approximators to choose which action to take for a given visual input. Until recently, CNNs have been treated like black-box functions, but this mindset is especially dangerous when used for control in safety-critical settings. In this paper, we present our extensions of CNN visualization algorithms to the domain of vision-based reinforcement learning. We use a simulated drone environment as an example scenario. These visualization algorithms are an important tool for behavior introspection and provide insight into the qualities and flaws of trained policies when interacting with the physical world. A video may be seen at https://…/drlvisual.
The primary focus of this paper is on designing inexact first-order methods for solving large-scale constrained nonlinear optimization problems. By controlling the inexactness of the subproblem solution, we can significantly reduce the computational cost needed for each iteration. A penalty parameter updating strategy during the subproblem solve enables the algorithm to automatically detect infeasibility. Global convergence for both feasible and infeasible cases are proved. Complexity analysis for the KKT residual is also derived under loose assumptions. Numerical experiments exhibit the ability of the proposed algorithm to rapidly find inexact optimal solution through cheap computational cost.
We address two challenges in topic models: (1) Context information around words helps in determining their actual meaning, e.g., ‘networks’ used in the contexts artificial neural networks vs. biological neuron networks. Generative topic models infer topic-word distributions, taking no or only little context into account. Here, we extend a neural autoregressive topic model to exploit the full context information around words in a document in a language modeling fashion. The proposed model is named as iDocNADE. (2) Due to the small number of word occurrences (i.e., lack of context) in short text and data sparsity in a corpus of few documents, the application of topic models is challenging on such texts. Therefore, we propose a simple and efficient way of incorporating external knowledge into neural autoregressive topic models: we use embeddings as a distributional prior. The proposed variants are named as DocNADE2 and iDocNADE2. We present novel neural autoregressive topic model variants that consistently outperform state-of-the-art generative topic models in terms of generalization, interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.
Sparse reward problems are one of the biggest challenges in Reinforcement Learning. Goal-directed tasks are one such sparse reward problems where a reward signal is received only when the goal is reached. One promising way to train an agent to perform goal-directed tasks is to use Hindsight Learning approaches. In these approaches, even when an agent fails to reach the desired goal, the agent learns to reach the goal it achieved instead. Doing this over multiple trajectories while generalizing the policy learned from the achieved goals, the agent learns a goal conditioned policy to reach any goal. One such approach is Hindsight Experience replay which uses an off-policy Reinforcement Learning algorithm to learn a goal conditioned policy. In this approach, a replay of the past transitions happens in a uniformly random fashion. Another approach is to use a Hindsight version of the policy gradients to directly learn a policy. In this work, we discuss different ways to replay past transitions to improve learning in hindsight experience replay focusing on prioritized variants in particular. Also, we implement the Hindsight Policy gradient methods to robotic tasks.
Open data refers to data that is freely available for reuse. Although there has been rapid increase in availability of open data to public in the last decade, this has not translated into better decision-support tools for them. We propose intelligent conversation generators as a grand challenge that would automatically create data-driven conversation interfaces (CIs), also known as chatbots or dialog systems, from open data and deliver personalized analytical insights to users based on their contextual needs. Such generators will not only help bring Artificial Intelligence (AI)-based solutions for important societal problems to the masses but also advance AI by providing an integrative testbed for human-centric AI and filling gaps in the state-of-art towards this aim.
Convolutional neural networks memorize part of their training data, which is why strategies such as data augmentation and drop-out are employed to mitigate overfitting. This paper considers the related question of ‘membership inference’, where the goal is to determine if an image was used during training. We consider it under three complementary angles. We show how to detect which dataset was used to train a model, and in particular whether some validation images were used at train time. We then analyze explicit memorization and extend classical random label experiments to the problem of learning a model that predicts if an image belongs to an arbitrary set. Finally, we propose a new approach to infer membership when a few of the top layers are not available or have been fine-tuned, and show that lower layers still carry information about the training samples. To support our findings, we conduct large-scale experiments on Imagenet and subsets of YFCC-100M with modern architectures such as VGG and Resnet.
Learning interpretable features from complex multilayer networks is a challenging and important problem. The need for such representations is particularly evident in multilayer networks of the brain, where nodal characteristics may help model and differentiate regions of the brain according to individual, cognitive task, or disease. Motivated by this problem, we introduce the multi-node2vec algorithm, an efficient and scalable feature engineering method that automatically learns continuous node feature representations from multilayer networks. Multi-node2vec relies upon a second-order random walk sampling procedure that efficiently explores the inner- and intra-layer ties of the observed multilayer network is utilized to identify multilayer neighborhoods. Maximum likelihood estimators of the nodal features are identified through the use of the Skip-gram neural network model on the collection of sampled neighborhoods. We investigate the conditions under which multi-node2vec is an approximation of a closed-form matrix factorization problem. We demonstrate the efficacy of multi-node2vec on a multilayer functional brain network from resting state fMRI scans over a group of 74 healthy individuals. We find that multi-node2vec outperforms contemporary methods on complex networks, and that multi-node2vec identifies nodal characteristics that closely associate with the functional organization of the brain.
In this paper we first present a class of algorithms for training multi-level neural networks with a quadratic cost function one layer at a time starting from the input layer. The algorithm is based on the fact that for any layer to be trained, the effect of a direct connection to an optimized linear output layer can be computed without the connection being made. Thus, starting from the input layer, we can train each layer in succession in isolation from the other layers. Once trained, the weights are kept fixed and the outputs of the trained layer then serve as the inputs to the next layer to be trained. The result is a very fast algorithm. The simplicity of this training arrangement allows the activation function and step size in weight adjustment to be adaptive and self-adjusting. Furthermore, the stability of the training process allows relatively large steps to be taken and thereby achieving in even greater speeds. Finally, in our context configuring the network means determining the number of outputs for each layer. By decomposing the overall cost function into separate components related to approximation and estimation, we obtain an optimization formula for determining the number of outputs for each layer. With the ability to self-configure and set parameters, we now have more than a fast training algorithm, but the ability to build automatically a fully trained deep neural network starting with nothing more than data.
In critical applications of anomaly detection including computer security and fraud prevention, the anomaly detector must be configurable by the analyst to minimize the effort on false positives. One important way to configure the anomaly detector is by providing true labels for a few instances. We study the problem of label-efficient active learning to automatically tune anomaly detection ensembles and make four main contributions. First, we present an important insight into how anomaly detector ensembles are naturally suited for active learning. This insight allows us to relate the greedy querying strategy to uncertainty sampling, with implications for label-efficiency. Second, we present a novel formalism called compact description to describe the discovered anomalies and show that it can also be employed to improve the diversity of the instances presented to the analyst without loss in the anomaly discovery rate. Third, we present a novel data drift detection algorithm that not only detects the drift robustly, but also allows us to take corrective actions to adapt the detector in a principled manner. Fourth, we present extensive experiments to evaluate our insights and algorithms in both batch and streaming settings. Our results show that in addition to discovering significantly more anomalies than state-of-the-art unsupervised baselines, our active learning algorithms under the streaming-data setup are competitive with the batch setup.
Emerging applications in autonomy require control techniques that take into account uncertain environments, communication and sensing constraints, while satisfying highlevel mission specifications. Motivated by this need, we consider a class of Markov decision processes (MDPs), along with a transfer entropy cost function. In this context, we study highlevel mission specifications as co-safe linear temporal logic (LTL) formulae. We provide a method to synthesize a policy that minimizes the weighted sum of the transfer entropy and the probability of failure to satisfy the specification. We derive a set of coupled non-linear equations that an optimal policy must satisfy. We then use a modified Arimoto-Blahut algorithm to solve the non-linear equations. Finally, we demonstrated the proposed method on a navigation and path planning scenario of a Mars rover.
In a variety of applications, an agent’s success depends on the knowledge that an adversarial observer has or can gather about the agent’s decisions. It is therefore desirable for the agent to achieve a task while reducing the ability of an observer to infer the agent’s policy. We consider the task of the agent as a reachability problem in a Markov decision process and study the synthesis of policies that minimize the observer’s ability to infer the transition probabilities of the agent between the states of the Markov decision process. We introduce a metric that is based on the Fisher information as a proxy for the information leaked to the observer and using this metric formulate a problem that minimizes expected total information subject to the reachability constraint. We proceed to solve the problem using convex optimization methods. To verify the proposed method, we analyze the relationship between the expected total information and the estimation error of the observer, and show that, for a particular class of Markov decision processes, these two values are inversely proportional.
The widespread online misinformation could cause public panic and serious economic damages. The misinformation containment problem aims at limiting the spread of misinformation in online social networks by launching competing campaigns. Motivated by realistic scenarios, we present the first analysis of the misinformation containment problem for the case when an arbitrary number of cascades are allowed. This paper makes four contributions. First, we provide a formal model for multi-cascade diffusion and introduce an important concept called as cascade priority. Second, we show that the misinformation containment problem cannot be approximated within a factor of $\Omega(2^{\log^{1-\epsilon}n^4})$ in polynomial time unless $NP \subseteq DTIME(n^{\polylog{n}})$. Third, we introduce several types of cascade priority that are frequently seen in real social networks. Finally, we design novel algorithms for solving the misinformation containment problem. The effectiveness of the proposed algorithm is supported by encouraging experimental results.
Adversarial machine learning in the context of image processing and related applications has received a large amount of attention. However, adversarial machine learning, especially adversarial deep learning, in the context of malware detection has received much less attention despite its apparent importance. In this paper, we present a framework for enhancing the robustness of Deep Neural Networks (DNNs) against adversarial malware samples, dubbed Hashing Transformation Deep Neural Networks} (HashTran-DNN). The core idea is to use hash functions with a certain locality-preserving property to transform samples to enhance the robustness of DNNs in malware classification. The framework further uses a Denoising Auto-Encoder (DAE) regularizer to reconstruct the hash representations of samples, making the resulting DNN classifiers capable of attaining the locality information in the latent space. We experiment with two concrete instantiations of the HashTran-DNN framework to classify Android malware. Experimental results show that four known attacks can render standard DNNs useless in classifying Android malware, that known defenses can at most defend three of the four attacks, and that HashTran-DNN can effectively defend against all of the four attacks.
Sample entropy ($SampEn$) has been accepted as an alternate, and sometimes a replacement, measure to approximate entropy ($ApEn$) for characterizing temporal complexity of time series. However, it still suffers from issues such as inconsistency over short-length signals and its tolerance parameter $r$, susceptibility to signal amplitude changes and insensitivity to self-similarity of time series. We propose modifications to the $ApEn$ and $SampEn$ measures which are defined for 0<$r$<1, are more robust to signal amplitude changes and sensitive to self-similarity property of time series. We modified $ApEn$ and $SampEn$ by redefining the distance function used originally in their definitions. We then evaluated the new entropy measures, called range entropies ($RangeEn$) using different random processes and nonlinear deterministic signals. We further applied the proposed entropies to normal and epileptic electroencephalographic (EEG) signals under different states. Our results suggest that, unlike $ApEn$ and $SampEn$, $RangeEn$ measures are robust to stationary and nonstationary signal amplitude variations and that their trajectories in the tolerance r-plane are constrained between 0 (maximum entropy) and 1 (minimum entropy). We also showed that $RangeEn$ have direct relationships with the Hurst exponent; suggesting that the new definitions are sensitive to self-similarity structures of signals. $RangeEn$ analysis of epileptic EEG data showed distinct behaviours in the $r$-domain for extracranial versus intracranial recordings as well as different states of epileptic EEG data. The constrained trajectory of $RangeEn$ in the r-plane makes them a good candidate for studying complex biological signals such as EEG during seizure and non-seizure states. The Python package used to generate the results shown in this paper is publicly available at: https://…/RangeEn.
Despite its simplicity, bag-of-n-grams sentence representation has been found to excel in some NLP tasks. However, it has not received much attention in recent years and further analysis on its properties is necessary. We propose a framework to investigate the amount and type of information captured in a general-purposed bag-of-n-grams sentence representation. We first use sentence reconstruction as a tool to obtain bag-of-n-grams representation that contains general information of the sentence. We then run prediction tasks (sentence length, word content, phrase content and word order) using the obtained representation to look into the specific type of information captured in the representation. Our analysis demonstrates that bag-of-n-grams representation does contain sentence structure level information. However, incorporating n-grams with higher order n empirically helps little with encoding more information in general, except for phrase content information.
Classification models are often used to make decisions that affect humans: whether to approve a loan application, extend a job offer, or provide insurance. In such applications, individuals should have the ability to change the decision of the model. When a person is denied a loan by a credit scoring model, for example, they should be able to change the input variables of the model in a way that will guarantee approval. Otherwise, this person will be denied the loan so long as the model is deployed, and — more importantly — will lack agency over a decision that affects their livelihood. In this paper, we propose to audit a linear classification model in terms of recourse, which we define as the ability of a person to change the decision of the model through actionable input variables (e.g., income vs. gender, age, or marital status). We present an integer programming toolkit to: (i) measure the feasibility and difficulty of recourse in a target population; and (ii) generate a list of actionable changes for an individual to obtain a desired outcome. We demonstrate how our tools can inform practitioners, policymakers, and consumers by auditing credit scoring models built using real-world datasets. Our results illustrate how recourse can be significantly impacted by common modeling practices, and motivate the need to guarantee recourse as a policy objective for regulation in algorithmic decision-making.
Black box discrete optimization (BBDO) appears in wide range of engineering tasks. Evolutionary or other BBDO approaches have been applied, aiming at automating necessary tuning of system parameters, such as hyper parameter tuning of machine learning based systems when being installed for a specific task. However, automation is often jeopardized by the need of strategy parameter tuning for BBDO algorithms. An expert with the domain knowledge must undergo time-consuming strategy parameter tuning. This paper proposes a parameterless BBDO algorithm based on information geometric optimization, a recent framework for black box optimization using stochastic natural gradient. Inspired by some theoretical implications, we develop an adaptation mechanism for strategy parameters of the stochastic natural gradient method for discrete search domains. The proposed algorithm is evaluated on commonly used test problems. It is further extended to two examples of simultaneous optimization of the hyper parameters and the connection weights of deep learning models, leading to a faster optimization than the existing approaches without any effort of parameter tuning.
R (Version 3.5.1 patched) has two issues with its random sampling functionality. First, it uses a version of the Mersenne Twister known to have a seeding problem, which was corrected by the authors of the Mersenne Twister in 2002. Updated C source code is available at http://…/mt19937ar.c. Second, R generates random integers between $1$ and $m$ by multiplying random floats by $m$, taking the floor, and adding $1$ to the result. Well-known quantization effects in this approach result in a non-uniform distribution on $\{ 1, \ldots, m\}$. The difference, which depends on $m$, can be substantial. Because the sample function in R relies on generating random integers, random sampling in R is biased. There is an easy fix: construct random integers directly from random bits, rather than multiplying a random float by $m$. That is the strategy taken in Python’s numpy.random.randint() function, among others. Example source code in Python is available at https://…/cryptorandom.py (see functions getrandbits() and randbelow_from_randbits()).
We estimate the proper channel (width) scaling of Convolution Neural Networks (CNNs) for model reduction. Unlike the traditional scaling method that reduces every CNN channel width by the same scaling factor, we address each CNN macroblock adaptively depending on its information redundancy measured by our proposed effective flops. Our proposed macroblock scaling (MBS) algorithm can be applied to various CNN architectures to reduce their model size. These applicable models range from compact CNN models such as MobileNet (25.53% reduction, ImageNet) and ShuffleNet (20.74% reduction, ImageNet) to ultra-deep ones such as ResNet-101 (51.67% reduction, ImageNet) and ResNet-1202 (72.71% reduction, CIFAR-10) with negligible accuracy degradation. MBS also performs better reduction at a much lower cost than does the state-of-the-art optimization-based method. MBS’s simplicity and efficiency, its flexibility to work with any CNN model, and its scalability to work with models of any depth makes it an attractive choice for CNN model size reduction.
For using neural networks in safety critical domains, it is important to know if a decision made by a neural network is supported by prior similarities in training. We propose runtime neuron activation pattern monitoring – after the standard training process, one creates a monitor by feeding the training data to the network again in order to store the neuron activation patterns in abstract form. In operation, a classification decision over an input is further supplemented by examining if a pattern similar (measured by Hamming distance) to the generated pattern is contained in the monitor. If the monitor does not contain any pattern similar to the generated pattern, it raises a warning that the decision is not based on the training data. Our experiments show that, by adjusting the similarity-threshold for activation patterns, the monitors can report a significant portion of misclassfications to be not supported by training with a small false-positive rate, when evaluated on a test set.
Conversational agents are gaining popularity with the increasing ubiquity of smart devices. However, training agents in a data driven manner is challenging due to a lack of suitable corpora. This paper presents a novel method for gathering topical, unstructured conversational data in an efficient way: self-dialogues through crowd-sourcing. Alongside this paper, we include a corpus of 3.6 million words across 23 topics. We argue the utility of the corpus by comparing self-dialogues with standard two-party conversations as well as data from other corpora.
Rotation forest is a tree based ensemble that performs transforms on subsets of attributes prior to constructing each tree. We present an empirical comparison of classifiers for problems with only real valued features. We evaluate classifiers from three families of algorithms: support vector machines; tree-based ensembles; and neural networks. We compare classifiers on unseen data based on the quality of the decision rule (using classification error) the ability to rank cases (area under the receiver operator curve) and the probability estimates (using negative log likelihood). We conclude that, in answer to the question posed in the title, yes, rotation forest, is significantly more accurate on average than competing techniques when compared on three distinct sets of datasets. The same pattern of results are observed when tuning classifiers on the train data using a grid search. We investigate why rotation forest does so well by testing whether the characteristics of the data can be used to differentiate classifier performance. We assess the impact of the design features of rotation forest through an ablative study that transforms random forest into rotation forest. We identify the major limitation of rotation forest as its scalability, particularly in number of attributes. To overcome this problem we develop a model to predict the train time of the algorithm and hence propose a contract version of rotation forest where a run time cap {\em a priori}. We demonstrate that on large problems rotation forest can be made an order of magnitude faster without significant loss of accuracy and that there is no real benefit (on average) from tuning the ensemble. We conclude that without any domain knowledge to indicate an algorithm preference, rotation forest should be the default algorithm of choice for problems with continuous attributes.
A family of algorithms for time series classification (TSC) involve running a sliding window across each series, discretising the window to form a word, forming a histogram of word counts over the dictionary, then constructing a classifier on the histograms. A recent evaluation of two of this type of algorithm, Bag of Patterns (BOP) and Bag of Symbolic Fourier Approximation Symbols (BOSS) found a significant difference in accuracy between these seemingly similar algorithms. We investigate this phenomenon by deconstructing the classifiers and measuring the relative importance of the four key components between BOP and BOSS. We find that whilst ensembling is a key component for both algorithms, the effect of the other components is mixed and more complex. We conclude that BOSS represents the state of the art for dictionary based TSC. Both BOP and BOSS can be classed as bag of words approaches. These are particularly popular in Computer Vision for tasks such as image classification. Converting approaches from vision requires careful engineering. We adapt three techniques used in Computer Vision for TSC: Scale Invariant Feature Transform; Spatial Pyramids; and Histrogram Intersection. We find that using Spatial Pyramids in conjunction with BOSS (SP) produces a significantly more accurate classifier. SP is significantly more accurate than standard benchmarks and the original BOSS algorithm. It is not significantly worse than the best shapelet based approach, and is only outperformed by HIVE-COTE, an ensemble that includes BOSS as a constituent module.
We propose the genetic algorithm for time window optimization, which is an embedded genetic algorithm (GA), to optimize the time window (TW) of the attributes using feature selection and support vector machine. This GA is evolved using the results of a trading simulation, and it determines the best TW for each technical indicator. An appropriate evaluation was conducted using a walk-forward trading simulation, and the trained model was verified to be generalizable for forecasting other stock data. The results show that using the GA to determine the TW can improve the rate of return, leading to better prediction models than those resulting from using the default TW.
Parallel dataflow systems have become a standard technology for large-scale data analytics. Complex data analysis programs in areas such as machine learning and graph analytics often involve control flow, i.e., iterations and branching. Therefore, systems for advanced analytics should include control flow constructs that are efficient and easy to use. A natural approach is to provide imperative control flow constructs similar to those of mainstream programming languages: while-loops, if-statements, and mutable variables, whose values can change between iteration steps. However, current parallel dataflow systems execute programs written using imperative control flow constructs by launching a separate dataflow job after every control flow decision (e.g., for every step of a loop). The performance of this approach is suboptimal, because (a) launching a dataflow job incurs scheduling overhead; and (b) it prevents certain optimizations across iteration steps. In this paper, we introduce Labyrinth, a method to compile programs written using imperative control flow constructs to a single dataflow job, which executes the whole program, including all iteration steps. This way, we achieve both efficiency and ease of use. We also conduct an experimental evaluation, which shows that Labyrinth has orders of magnitude smaller per-iteration-step overhead than launching new dataflow jobs, and also allows for significant optimizations across iteration steps.
This paper evaluates the performance of the K-nearest neighbor classification algorithm on the MNIST dataset of the handwritten digits. The L2 Euclidean distance metric is compared to a modified distance metric which utilizes the sliding window technique in order to avoid performance degradations due to slight spatial misalignments. Accuracy and confusion matrix are used as the performance indicators to compare the performance of the baseline algorithm versus the enhanced sliding window method and results show significant improvement using this simple method.
While a lot of progress has been made in recent years, the dynamics of learning in deep nonlinear neural networks remain to this day largely misunderstood. In this work, we study the case of binary classification and prove various properties of learning in such networks under strong assumptions such as linear separability of the data. Extending existing results from the linear case, we confirm empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. We show that given proper initialization, learning expounds parallel independent modes and that certain regions of parameter space might lead to failed training. We also demonstrate that input norm and features’ frequency in the dataset lead to distinct convergence speeds which might shed some light on the generalization capabilities of deep neural networks. We provide a comparison between the dynamics of learning with cross-entropy and hinge losses, which could prove useful to understand recent progress in the training of generative adversarial networks. Finally, we identify a phenomenon that we baptize gradient starvation where the most frequent features in a dataset prevent the learning of other less frequent but equally informative features.

# If you did not already know

Quasi-KL Divergence (QKL)
Dropout, a stochastic regularisation technique for training of neural networks, has recently been reinterpreted as a specific type of approximate inference algorithm for Bayesian neural networks. The main contribution of the reinterpretation is in providing a theoretical framework useful for analysing and extending the algorithm. We show that the proposed framework suffers from several issues; from undefined or pathological behaviour of the true posterior related to use of improper priors, to an ill-defined variational objective due to singularity of the approximating distribution relative to the true posterior. Our analysis of the improper log uniform prior used in variational Gaussian dropout suggests the pathologies are generally irredeemable, and that the algorithm still works only because the variational formulation annuls some of the pathologies. To address the singularity issue, we proffer Quasi-KL (QKL) divergence, a new approximate inference objective for approximation of high-dimensional distributions. We show that motivations for variational Bernoulli dropout based on discretisation and noise have QKL as a limit. Properties of QKL are studied both theoretically and on a simple practical example which shows that the QKL-optimal approximation of a full rank Gaussian with a degenerate one naturally leads to the Principal Component Analysis solution. …

Geometric Operator Convolutional Neural Network (GO-CNN)
The Convolutional Neural Network (CNN) has been successfully applied in many fields during recent decades; however it lacks the ability to utilize prior domain knowledge when dealing with many realistic problems. We present a framework called Geometric Operator Convolutional Neural Network (GO-CNN) that uses domain knowledge, wherein the kernel of the first convolutional layer is replaced with a kernel generated by a geometric operator function. This framework integrates many conventional geometric operators, which allows it to adapt to a diverse range of problems. Under certain conditions, we theoretically analyze the convergence and the bound of the generalization errors between GO-CNNs and common CNNs. Although the geometric operator convolution kernels have fewer trainable parameters than common convolution kernels, the experimental results indicate that GO-CNN performs more accurately than common CNN on CIFAR-10/100. Furthermore, GO-CNN reduces dependence on the amount of training examples and enhances adversarial stability. In the practical task of medically diagnosing bone fractures, GO-CNN obtains 3% improvement in terms of the recall. …

QMiner
QMiner is a data analytics platform for processing large-scale real-time streams containing structured and unstructured data. …

# Magister Dixit

“Analysts will need a proper understanding of math, statistics, algorithms, and other related sciences in order to deliver meaningful results. They must pair that theoretical knowledge with a firm grasp of the modern-day tools that make the analyses possible. That means having an ability to express queries in terms of MapReduce or some other distributed system, an understanding of how to model data storage across different NoSQL-style systems, and familiarity with libraries that implement common algorithms.” Q Ethan McCallum, Ken Gleason ( 2013 )

# R Packages worth a look

Shiny Applications Internationalization (shiny.i18n)
It provides easy internationalization of Shiny applications. It can be used as standalone translation package to translate reports, interactive visuali …

Optimized Elo Rating Method for Obtaining Dominance Ranks (EloOptimized)
Provides an implementation of the maximum likelihood methods for deriving Elo scores as published in Foerster, Franz et al. (2016) <DOI:10.1038/srep …

Tools for Message Passing Between Processes (ipc)
Provides tools for passing messages between R processes. Shiny Examples are provided showing how to perform useful tasks such as: updating reactive val …

# Distilled News

Have you ever sat down with the code and data for an existing machine learning project, trained the same model, checked your results… and found that they were different from the original results?
Purpose Built Analytic Modules (PBAMs) such as those for Fraud Detection represent a fourth way to practice data science, a new model for the good use of Citizen Data Scientists, and a new market for AI-first companies.
Natural Language Processing (NLP) is the ability of a computer system to understand human language. Natural Langauge Processing is a subset of Artificial Intelligence (AI). There are multiple resources available online which can help you develop expertise in Natural Language Processing. In this blog post, we list resources for the beginners and intermediate level learners.
Last week, I presented a session at the ‘AI With The Best’ conference about one of my favorite topics, decentralized artificial intelligence(AI). The ‘AI With the Best’ conference is notorious for bringing together a rate mix of AI researchers and practitioners as part of the same audience so, as a speaker, you have to have the right balance between deep AI research and practical topics. In the case of my talk, I tried to summarize some of the ideas I’ve been exploring in the decentralized AI space.
Text summarization refers to the technique of shortening long pieces of text. The intention is to create a coherent and fluent summary having only the main points outlined in the document. Automatic text summarization is a common problem in machine learning and natural language processing (NLP).
In a previous post, we covered how we can leverage reCAPTCHA, Mechanical Turk, Figure Eight, or PyBOSSA to reach a large crowd of workers to effectively crowdsource our annotation tasks. But what’s the secret to a successful crowdsource campaign to annotate your business dataset?
In Part I of this III part series I will cover the what Supervised Learning Algorithms are, how they work, and a few examples of where they can be applied.

# Book Memo: “Application of FPGA to Real-Time Machine Learning”

 Hardware Reservoir Computers and Software Image Processing This book lies at the interface of machine learning – a subfield of computer science that develops algorithms for challenging tasks such as shape or image recognition, where traditional algorithms fail – and photonics – the physical science of light, which underlies many of the optical communications technologies used in our information society. It provides a thorough introduction to reservoir computing and field-programmable gate arrays (FPGAs). Recently, photonic implementations of reservoir computing (a machine learning algorithm based on artificial neural networks) have made a breakthrough in optical computing possible. In this book, the author pushes the performance of these systems significantly beyond what was achieved before. By interfacing a photonic reservoir computer with a high-speed electronic device (an FPGA), the author successfully interacts with the reservoir computer in real time, allowing him to considerably expand its capabilities and range of possible applications. Furthermore, the author draws on his expertise in machine learning and FPGA programming to make progress on a very different problem, namely the real-time image analysis of optical coherence tomography for atherosclerotic arteries.