Distilled News

Heatmapping is a simple and efficient way to analyze visitor interaction and user behavior on your website. If you are in a Conversion Rate Optimization (aka. CRO) project with your e-commerce or startup (or any other online) business, it’s indispensable to run some website heatmaps – such as click, mouse movement or scroll heatmaps.
In my last post I did some drawings based on L-Systems. These drawings are done sequentially. At any step, the state of the drawing can be described by the position (coordinates) and the orientation of the pencil. In that case I only used two kind of operators: drawing a straight line and turning a constant angle.
Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding.
Predictive maintenance is widely considered to be the obvious next step for any business with high-capital assets: harness machine learning to control rising equipment maintenance costs and pave the way for self maintenance through artificial intelligence (AI).
What makes BI tools great? What features are important while selecting a good BI tool? Let’s have a look. NYU MS in Business Analytics 2017NYU MS in Business Analytics
If you use an API key to access a secure service, or need to use a password to access a protected database, you’ll need to provide these ‘secrets’ in your R code somewhere. That’s easy to do if you just include those keys as strings in your code — but it’s not very secure. This means your private keys and passwords are stored in plain-text on your hard drive, and if you email your script they’re available to anyone who can intercept that email. It’s also really easy to inadvertently include those keys in a public repo if you use Github or similar code-sharing services. To address this problem, Gábor Csárdi and Andrie de Vries created the secret package for R. The secret package integrates with OpenSSH, providing R functions that allow you to create a vault to keys on your local machine, define trusted users who can access those keys, and then include encrypted keys in R scripts or packages that can only be decrypted by you or by people you trust.
How to take into account and how to compare information from different information sources? Multiple Factor Analysis is a principal Component Methods that deals with datasets that contain quantitative and/or categorical variables that are structured by groups. Here is a course with videos that present the method named Multiple Factor Analysis.
How to analyse of categorical data? Here is a course with videos that present Multiple Correspondence Analysis in a French way. The most well-known use of Multiple Correspondence Analysis is: surveys. Four videos present a course on MCA, highlighting the way to interpret the data. Then you will find videos presenting the way to implement MCA in FactoMineR, to deal with missing values in MCA thanks to the package missMDA and lastly a video to draw interactive graphs with Factoshiny. And finally you will see that the new package FactoInvestigate allows you to obtain automatically an interpretation of your MCA results. With this course, you will be stand-alone to perform and interpret results obtain with MCA.

Document worth reading: “Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes”

It is widely observed that deep learning models with learned parameters generalize well, even with much more model parameters than the number of training samples. We systematically investigate the underlying reasons why deep neural networks often generalize well, and reveal the difference between the minima (with the same training error) that generalize well and those they don’t. We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima. We theoretically justify our findings through analyzing 2-layer neural networks; and show that the low-complexity solutions have a small norm of Hessian matrix with respect to model parameters. For deeper networks, extensive numerical evidence helps to support our arguments. Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

If you did not already know

We consider the problem of distributed statistical machine learning in adversarial settings, where some unknown and time-varying subset of working machines may be compromised and behave arbitrarily to prevent an accurate model from being learned. This setting captures the potential adversarial attacks faced by Federated Learning — a modern machine learning paradigm that is proposed by Google researchers and has been intensively studied for ensuring user privacy. Formally, we focus on a distributed system consisting of a parameter server and $m$ working machines. Each working machine keeps $N/m$ data samples, where $N$ is the total number of samples. The goal is to collectively learn the underlying true model parameter of dimension $d$. In classical batch gradient descent methods, the gradients reported to the server by the working machines are aggregated via simple averaging, which is vulnerable to a single Byzantine failure. In this paper, we propose a Byzantine gradient descent method based on the geometric median of means of the gradients. We show that our method can tolerate $q \le (m-1)/2$ Byzantine failures, and the parameter estimate converges in $O(\log N)$ rounds with an estimation error of $\sqrt{d(2q+1)/N}$, hence approaching the optimal error rate $\sqrt{d/N}$ in the centralized and failure-free setting. The total computational complexity of our algorithm is of $O((Nd/m) \log N)$ at each working machine and $O(md + kd \log^3 N)$ at the central server, and the total communication cost is of $O(m d \log N)$. We further provide an application of our general results to the linear regression problem. A key challenge arises in the above problem is that Byzantine failures create arbitrary and unspecified dependency among the iterations and the aggregated gradients. We prove that the aggregated gradient converges uniformly to the true gradient function. …

Least-Angle Regression (LARS)
In statistics, least-angle regression (LARS) is a regression algorithm for high-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. Suppose we expect a response variable to be determined by a linear combination of a subset of potential covariates. Then the LARS algorithm provides a means of producing an estimate of which variables to include, as well as their coefficients. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector. The algorithm is similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residual. …

False Positive Rate
In statistics, when performing multiple comparisons, the term false positive ratio, also known as the false alarm ratio, usually refers to the probability of falsely rejecting the null hypothesis for a particular test. The false positive rate (or “false alarm rate”) usually refers to the expectancy of the false positive ratio.

Magister Dixit

“Some decisions you need to make are big enough to change the course for your business. And your past experiences may not be good predictors of the future. More data are within your reach to understand what was previously unknown. Sophisticated analytical tools are available to you to ‘see’ a wider range of possibilities and evaluate them quickly. Now is a good time for an upgrade in your decision making capabilities.” PWC ( 2014 )

Document worth reading: “Towards Statistical Reasoning in Description Logics over Finite Domains (Full Version)”

We present a probabilistic extension of the description logic $\mathcal{ALC}$ for reasoning about statistical knowledge. We consider conditional statements over proportions of the domain and are interested in the probabilistic-logical consequences of these proportions. After introducing some general reasoning problems and analyzing their properties, we present first algorithms and complexity results for reasoning in some fragments of Statistical $\mathcal{ALC}$. Towards Statistical Reasoning in Description Logics over Finite Domains (Full Version)

Whats new on arXiv

Most state-of-the-art subspace clustering methods only work with linear (or affine) subspaces. In this paper, we present a kernel subspace clustering method that can handle non-linear models. While an arbitrary kernel can non-linearly map data into high-dimensional Hilbert feature space, the data in the resulting feature space are very unlikely to have the desired subspace structures. By contrast, we propose to learn a low-rank kernel mapping, with which the mapped data in feature space are not only low-rank but also self-expressive, such that the low-dimensional subspace structures are present and manifested in the high-dimensional feature space. We have evaluated the proposed method extensively on both motion segmentation and image clustering benchmarks, and obtained superior results, outperforming the kernel subspace clustering method that uses standard kernels~\cite{patel2014kernel} and other state-of-the-art linear subspace clustering methods.
We introduce a novel variant of the multi-armed bandit problem, in which bandits are streamed one at a time to the player, and at each point, the player can either choose to pull the current bandit or move on to the next bandit. Once a player has moved on from a bandit, they may never visit it again, which is a crucial difference between our problem and classic multi-armed bandit problems. In this online context, we study Bernoulli bandits (bandits with payout Ber($p_i$) for some underlying mean $p_i$) with underlying means drawn i.i.d. from various distributions, including the uniform distribution, and in general, all distributions that have a CDF satisfying certain differentiability conditions near zero. In all cases, we suggest several strategies and investigate their expected performance. Furthermore, we bound the performance of any optimal strategy and show that the strategies we have suggested are indeed optimal up to a constant factor. We also investigate the case where the distribution from which the underlying means are drawn is not known ahead of time. We again, are able to suggest algorithms that are optimal up to a constant factor for this case, given certain mild conditions on the universe of distributions.
Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec’s embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.
Modeling physiological time-series in ICU is of high clinical importance. However, data collected within ICU are irregular in time and often contain missing measurements. Since absence of a measure would signify its lack of importance, the missingness is indeed informative and might reflect the decision making by the clinician. Here we propose a deep learning architecture that can effectively handle these challenges for predicting ICU mortality outcomes. The model is based on Long Short-Term Memory, and has layered attention mechanisms. At the sensing layer, the model decides whether to observe and incorporate parts of the current measurements. At the reasoning layer, evidences across time steps are weighted and combined. The model is evaluated on the PhysioNet 2012 dataset showing competitive and interpretable results.
Today’s conversational agents are restricted to simple standalone commands. In this paper, we present Iris, an agent that draws on human conversational strategies to combine commands, allowing it to perform more complex tasks that it has not been explicitly designed to support: for example, composing one command to ‘plot a histogram’ with another to first ‘log-transform the data’. To enable this complexity, we introduce a domain specific language that transforms commands into automata that Iris can compose, sequence, and execute dynamically by interacting with a user through natural language, as well as a conversational type system that manages what kinds of commands can be combined. We have designed Iris to help users with data science tasks, a domain that requires support for command combination. In evaluation, we find that data scientists complete a predictive modeling task significantly faster (2.6 times speedup) with Iris than a modern non-conversational programming environment. Iris supports the same kinds of commands as today’s agents, but empowers users to weave together these commands to accomplish complex goals.
Gradient boosting is a state-of-the-art prediction technique that sequentially produces a model in the form of linear combinations of simple predictors—typically decision trees—by solving an infinite-dimensional convex optimization problem. We provide in the present paper a thorough analysis of two widespread versions of gradient boosting, and introduce a general framework for studying these algorithms from the point of view of functional optimization. We prove their convergence as the number of iterations tends to infinity and highlight the importance of having a strongly convex risk functional to minimize. We also present a reasonable statistical context ensuring consistency properties of the boosting predictors as the sample size grows. In our approach, the optimization procedures are run forever (that is, without resorting to an early stopping strategy), and statistical regularization is basically achieved via an appropriate $L^2$ penalization of the loss and strong convexity arguments.
We propose a neural reranking system for named entity recognition (NER). The basic idea is to leverage recurrent neural network models to learn sentence-level patterns that involve named entity mentions. In particular, given an output sentence produced by a baseline NER model, we replace all entity mentions, such as \textit{Barack Obama}, into their entity types, such as \textit{PER}. The resulting sentence patterns contain direct output information, yet is less sparse without specific named entities. For example, ‘PER was born in LOC’ can be such a pattern. LSTM and CNN structures are utilised for learning deep representations of such sentences for reranking. Results show that our system can significantly improve the NER accuracies over two different baselines, giving the best reported results on a standard benchmark.
AI systems are increasingly applied to complex tasks that involve interaction with humans. During training, such systems are potentially dangerous, as they haven’t yet learned to avoid actions that could cause serious harm. How can an AI system explore and learn without making a single mistake that harms humans or otherwise causes serious damage? For model-free reinforcement learning, having a human ‘in the loop’ and ready to intervene is currently the only way to prevent all catastrophes. We formalize human intervention for RL and show how to reduce the human labor required by training a supervised learner to imitate the human’s intervention decisions. We evaluate this scheme on Atari games, with a Deep RL agent being overseen by a human for four hours. When the class of catastrophes is simple, we are able to prevent all catastrophes without affecting the agent’s learning (whereas an RL baseline fails due to catastrophic forgetting). However, this scheme is less successful when catastrophes are more complex: it reduces but does not eliminate catastrophes and the supervised learner fails on adversarial examples found by the agent. Extrapolating to more challenging environments, we show that our implementation would not scale (due to the infeasible amount of human labor required). We outline extensions of the scheme that are necessary if we are to train model-free agents without a single catastrophe.
Representing relationships as translations in vector space lives at the heart of many neural embedding models such as word embeddings and knowledge graph embeddings. In this work, we study the connections of this translational principle with collaborative filtering algorithms. We propose Translational Recommender Networks (\textsc{TransRec}), a new attentive neural architecture that utilizes the translational principle to model the relationships between user and item pairs. Our model employs a neural attention mechanism over a \emph{Latent Relational Attentive Memory} (LRAM) module to learn the latent relations between user-item pairs that best explains the interaction. By exploiting adaptive user-item specific translations in vector space, our model also alleviates the geometric inflexibility problem of other metric learning algorithms while enabling greater modeling capability and fine-grained fitting of users and items in vector space. The proposed architecture not only demonstrates the state-of-the-art performance across multiple recommendation benchmarks but also boasts of improved interpretability. Qualitative studies over the LRAM module shows evidence that our proposed model is able to infer and encode explicit sentiment, temporal and attribute information despite being only trained on implicit feedback. As such, this ascertains the ability of \textsc{TransRec} to uncover hidden relational structure within implicit datasets.
A well-know drawback of l1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most widely used refitting techniques which is based on least-squares may bring a problem of interpretability, since the signs of the refitted estimator might be flipped with respect to the original estimator. This problem arises from the fact that the least-square refitting considers only the support of the Lasso solution, avoiding any information about signs or amplitudes. To this end we define a sign-consistent refitting as an arbitrary refitting procedure, preserving the signs of the first step Lasso solution and provide Oracle inequalities for such estimators. Finally, we consider special refitting strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful property to converge to the sign-consistent least-squares refitting (least-squares with sign constraints), which provides with greater interpretability. We additionally study the Bregman Lasso refitting in the case of orthogonal design, providing with simple intuition behind the proposed method. Boosted Lasso, in contrast, considers information about magnitudes of the first Lasso step and allows to develop better oracle rates for prediction. Finally, we conduct an extensive numerical study to show advantages of one approach over others in different synthetic and semi-real scenarios.
Domain similarity measures can be used to gauge adaptability and select suitable data for transfer learning, but existing approaches define ad hoc measures that are deemed suitable for respective tasks. Inspired by work on curriculum learning, we propose to \emph{learn} data selection measures using Bayesian Optimization and evaluate them across models, domains and tasks. Our learned measures outperform existing domain similarity measures significantly on three tasks: sentiment analysis, part-of-speech tagging, and parsing. We show the importance of complementing similarity with diversity, and that learned measures are — to some degree — transferable across models, domains, and even tasks.
Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.
A novel class of non-reversible Markov chain Monte Carlo schemes relying on continuous-time piecewise deterministic Markov Processes has recently emerged. In these algorithms, the state of the Markov process evolves according to a deterministic dynamics which is modified using a Markov transition kernel at random event times. These methods enjoy remarkable features including the ability to update only a subset of the state components while other components implicitly keep evolving. However, several important problems remain open. The deterministic dynamics used so far do not exploit the structure of the target. Moreover, exact simulation of the event times is feasible for an important yet restricted class of problems and, even when it is, it is application specific. This limits the applicability of these methods and prevents the development of a generic software implementation. In this paper, we introduce novel MCMC methods addressing these limitations by bringing together piecewise deterministic Markov processes, Hamiltonian dynamics and slice sampling. We propose novel continuous-time algorithms relying on exact Hamiltonian flows and novel discrete-time algorithms which can exploit complex dynamics such as approximate Hamiltonian dynamics arising from symplectic integrators. We demonstrate the performance of these schemes on a variety of applications.

Book Memo: “Mastering Machine Learning with Python in Six Steps”

 A Practical Implementation Guide to Predictive Data Analytics Using Python Master machine learning with Python in six steps and explore fundamental to advanced topics, all designed to make you a worthy practitioner. This book’s approach is based on the “Six degrees of separation” theory, which states that everyone and everything is a maximum of six steps away. Mastering Machine Learning with Python in Six Steps presents each topic in two parts: theoretical concepts and practical implementation using suitable Python packages. You’ll learn the fundamentals of Python programming language, machine learning history, evolution, and the system development frameworks. Key data mining/analysis concepts, such as feature dimension reduction, regression, time series forecasting and their efficient implementation in Scikit-learn are also covered. Finally, you’ll explore advanced text mining techniques, neural networks and deep learning techniques, and their implementation. All the code presented in the book will be available in the form of iPython notebooks to enable you to try out these examples and extend them to your advantage.

R Packages worth a look

Compare Two Data Frames and Summarise the Difference (dataCompareR)
Easy comparison of two tabular data objects in R. Specifically designed to show differences between two sets of data in a useful way that should make it easier to understand the differences, and if necessary, help you work out how to remedy them. Aims to offer a more useful output than all.equal() when your two data sets do not match, but isn’t intended to replace all.equal() as a way to test for equality.

Fitting Tails by the Empirical Residual Coefficient of Variation (ercv)
Provides a methodology simple and trustworthy for the analysis of extreme values and multiple threshold tests for a generalized Pareto distribution, together with an automatic threshold selection algorithm. See del Castillo, J, Daoudi, J and Lockhart, R (2014) <doi:10.1111/sjos.12037>.

Tools to Transform and Query Data with ‘Apache’ ‘Drill’ (sergeant)
Apache Drill’ is a low-latency distributed query engine designed to enable data exploration and ‘analytics’ on both relational and non-relational ‘datastores’, scaling to petabytes of data. Methods are provided that enable working with ‘Apache’ ‘Drill’ instances via the ‘REST’ ‘API’, ‘JDBC’ interface (optional), ‘DBI’ ‘methods’ and using ‘dplyr’/’dbplyr’ idioms.

Bayesian Nonparametric Spectral Density Estimation Using B-Spline Priors (bsplinePsd)
Implementation of a Metropolis-within-Gibbs MCMC algorithm to flexibly estimate the spectral density of a stationary time series. The algorithm updates a nonparametric B-spline prior using the Whittle likelihood to produce pseudo-posterior samples and is based on the work presented by Edwards, Meyer, and Christensen (2017) <arXiv:1707.04878>.

Find Graph Centrality Indices (centiserve)
Calculates centrality indices additional to the ‘igraph’ package centrality functions.

Distilled News

I started my deep learning journey a few years back. I have learnt a lot in this period. But, even after all these efforts, every Neural network I train provides me with a new experience. If you have tried to train a neural network, you must know my plight! But, through all this time, I have now made a workflow, which I will share with you today. I am sharing my learning / experience about building Neural Network with all of you. I cannot guarantee it will work all the time, but at least it may guide you as to how would you approach to solve the problem. I will also share with you a tool which I find is a useful addition to the deep learning toolbox – TensorBoard.
In time series analysis, structural changes represent shocks impacting the evolution with time of the data generating process. That is relevant because one of the key assumptions of the Box-Jenkins methodology is that the structure of the data generating process does not change over time. How can structural changes be identified ? The strucchange package can help in that and the present tutorial shows how.
Speed and time is a key factor for any Data Scientist. In business, you do not usually work with toy datasets having thousands of samples. It is more likely that your datasets will contain millions or hundreds of millions samples. Customer orders, web logs, billing events, stock prices – datasets now are huge. I assume you do not want to spend hours or days, waiting for your data processing to complete. The biggest dataset I worked with so far contained over 30 million of records. When I run my data processing script the first time for this dataset, estimated time to complete was around 4 days! I do not have very powerful machine (Macbook Air with i5 and 4 GB of RAM), but the most I could accept was running the script over one night, not multiple days. Thanks to some clever tricks, I was able to decrease this running time to a few hours. This post will explain the first step to achieve good data processing performance – choosing right library/framework for your dataset.
This chapter covers the following topics:
• The best variables ranking from conventional machine learning algorithms, either predictive or clustering.
• The nature of selecting variables with and without predictive models.
• The effect of variables working in groups (intuition and information theory).
• Exploring the best variable subset in practice using R.
Selecting the best variables is also known as feature selection, selecting the most important predictors, selecting the best predictors, among others.
To perform the analysis, I needed an important number of tweets and I wanted to use all of the tweets concerning the election. The Twitter search API is limited since you only have access to a sample of tweets. On the other hand, the streaming API allows you to collect the data in real-time and to collect almost all tweets. Hence, I used the streamR package. So, I collected tweets on 60 seconds batch and saved them on .json files. The use of batches instead of one large file is to improve RAM consumption (Instead of reading and then subsetting one large file, you can do the subset on each of the batches and then merge them). Here is the code to collect the data with streamR.
Getting the best results out of a machine learning (ML) model requires that you truly understand your data. However, ML datasets can contain hundreds of millions of data points, each consisting of hundreds (or even thousands) of features, making it nearly impossible to understand an entire dataset in an intuitive fashion. Visualization can help unlock nuances and insights in large datasets. A picture may be worth a thousand words, but an interactive visualization can be worth even more. Working with the PAIR initiative, we’ve released Facets, an open source visualization tool to aid in understanding and analyzing ML datasets. Facets consists of two visualizations that allow users to see a holistic picture of their data at different granularities. Get a sense of the shape of each feature of the data using Facets Overview, or explore a set of individual observations using Facets Dive. These visualizations allow you to debug your data which, in machine learning, is as important as debugging your model. They can easily be used inside of Jupyter notebooks or embedded into webpages. In addition to the open source code, we’ve also created a Facets demo website. This website allows anyone to visualize their own datasets directly in the browser without the need for any software installation or setup, without the data ever leaving your computer.
Yes, most faculty, graduate students, and a lot of engineering teams in industry have already abandoned everything else and shifted to deep learning. Most new graduate students in applied areas such as computer vision that I meet, know nothing about probabilistic graphical models for instance, and their proposed solution to any problem is a CNN/LSTM/GAN.
Machine learning with Big Data is, in many ways, different than ‘regular’ machine learning. This informative image is helpful in identifying the steps in machine learning with Big Data, and how they fit together into a process of their own.
In an older post, I discussed a number of functions that are useful for programming in R. I wanted to expand on that topic by covering other functions, packages, and tools that are useful. Over the past year, I have been working as an R programmer and these are some of the new learnings that have become fundamental in my work.
Textual entailment is a simple exercise in logic that attempts to discern whether one sentence can be inferred from another. A computer program that takes on the task of textual entailment attempts to categorize an ordered pair of sentences into one of three categories. The first category, called “positive entailment,” occurs when you can use the first sentence to prove that a second sentence is true. The second category, “negative entailment,” is the inverse of positive entailment. This occurs when the first sentence can be used to disprove the second sentence. Finally, if the two sentences have no correlation, they are considered to have a “neutral entailment.” Textual entailment is useful as a component in much larger applications. For example, question-answering systems may use textual entailment to verify an answer from stored information. Textual entailment may also enhance document summarization by filtering out sentences that don’t include new information. Other natural language processing (NLP) systems find similar uses for entailment. Get O’Reilly’s AI newsletter This article will guide you through how to build a simple and fast-to-train neural network to perform textual entailment using TensorFlow.
In an earlier post I discussed how to avoid overfitting when using Support Vector Machines. This was achieved using cross validation. In cross validation, prediction accuracy is maximized by varying the cost parameter. Importantly, prediction accuracy is calculated on a different subset of the data from that used for training. In this blog post I take that concept a step further, by automating the manual search for the optimal cost. The data set I’ll be using describes different types of glass based upon physical attributes and chemical composition. You can read more about the data here, but for the purposes of my analysis all you need to know is that the outcome variable is categorical (7 types of glass) and the 4 predictor variables are numeric.
In the previous post I explored the use of linear model in the forms most commonly used in agricultural research. Clearly, when we are talking about linear models we are implicitly assuming that all relations between the dependent variable y and the predictors x are linear. In fact, in a linear model we could specify different shapes for the relation between y and x, for example by including polynomials (read for example: https://…/fitting-polynomial-regression-r ). However, we can do that only in cases where we can clearly see a particular shape of the relation, for example quadratic. The problem is in many cases we can see from a scatterplot that we have a non-linear distribution of the points, but it is difficult to understand its form. Moreover, in a linear model the interpretation of polynomial coefficients become more difficult and this may decrease their usefulness. An alternative approach is provided by Generalized Additive Models, which allows us to fit models with non-linear smoothers without specifying a particular shape a priori.
In a previous post, I introduced the Meta Meta-Model of Deep Learning. However, I did not introduce its details. A word of warning for the reader, the concepts in this section is in flux and in undergoing a lot of changes. Therefore, this article is just a reflection of my current understanding of the language of Deep Learning Meta Meta-Model. That’s definitely a mouth full, so to make life simpler for everyone, I just call this the Deep Learning Canonical Patterns. These patterns are documented in the Deep Learning Design Patterns Wiki. In this post I will explore further the characteristics of Artificial Intuition with the goal of describing a set of patterns that can aid us in formulating novel architectures for Deep Learning. In a previous post “Deep Learning and Artificial Intuition”, I introduced the idea that there are two distinct cognitive mechanisms, one based on logical inference and another based on intuition. At least 6 decades have been spent exploring cognitive mechanisms based on logical inference without making much progress towards AGI. Deep Learning, a breakthrough discovered in 2012, revealed an alternative promising research approach based on the a different cognitive paradigm.