If you did not already know

Data Decorations
Alberto Cairo left a comment about ‘data decorations’. This is a name he’s using to describe something like the windshield-wiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience – one might argue that the experience was worse than just staring at the data table. …

Bridge Sampling
(Bennett, 1976; Meng & Wong, 1996), a reliable and relatively straightforward sampling method that allows researchers to obtain the marginal likelihood for models of varying complexity. …

Network Sketching
Convolutional neural networks (CNNs) with deep architectures have substantially advanced the state-of-the-art in computer vision tasks. However, deep networks are typically resource-intensive and thus difficult to be deployed on mobile devices. Recently, CNNs with binary weights have shown compelling efficiency to the community, whereas the accuracy of such models is usually unsatisfactory in practice. In this paper, we introduce network sketching as a novel technique of pursuing binary-weight CNNs, targeting at more faithful inference and better trade-off for practical applications. Our basic idea is to exploit binary structure directly in pre-trained filter banks and produce binary-weight models via tensor expansion. The whole process can be treated as a coarse-to-fine model approximation, akin to the pencil drawing steps of outlining and shading. To further speedup the generated models, namely the sketches, we also propose an associative implementation of binary tensor convolutions. Experimental results demonstrate that a proper sketch of AlexNet (or ResNet) outperforms the existing binary-weight models by large margins on the ImageNet large scale classification task, while the committed memory for network parameters only exceeds a little. …

Book Memo: “Data Mining with R”

 Learning with Case Studies The versatile capabilities and large set of add-on packages make R an excellent alternative to many existing and often expensive data mining tools. Exploring this area from the perspective of a practitioner, Data Mining with R: Learning with Case Studies uses practical examples to illustrate the power of R and data mining. Assuming no prior knowledge of R or data mining/statistical techniques, the book covers a diverse set of problems that pose different challenges in terms of size, type of data, goals of analysis, and analytical tools. To present the main data mining processes and techniques, the author takes a hands-on approach that utilizes a series of detailed, real-world case studies: 1. Predicting algae blooms 2. Predicting stock market returns 3. Detecting fraudulent transactions 4. Classifying microarray samples With these case studies, the author supplies all necessary steps, code, and data. Web Resource A supporting website mirrors the do-it-yourself approach of the text. It offers a collection of freely available R source files that encompass all the code used in the case studies. The site also provides the data sets from the case studies as well as an R package of several functions.

Distilled News

Welcome to the Python Graph Gallery. This website displays hundreds of charts, always providing the reproducible python code! It aims to showcase the awesome dataviz possibilities of python and to help you benefit it. Feel free to propose a chart or report a bug. Any feedback is highly welcome. Get in touch with the gallery by following it on Twitter, Facebook, or by subscribing to the blog.
• Losing sight of the BIG picture
• Lack of engagement with key stakeholders
• Putting the ‘How’ before the ‘Why’
• Not solving the right problem
• Hiring Data Scientists who are Unicorns
If you’re thinking of leaving post-PhD science for data science then doubtless people have told you to learn version control. They’re absolutely right. You should. But learning git is not enough. So, in the spirit of A PhD is Not Enough, a great book about careers in science, here’s some advice about moving from academia into data science after completing a PhD in a natural science. Unlike A PhD is Not Enough, however, this post is not a complete guide to a career. It’s just a collection of (hopefully non-obvious) things that have occurred to me since I made the move myself three years ago. And to be clear: none of what I say here applies to you if you have a PhD in computer science, mathematics, statistics or the humanities.
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. The cartoon version of machine learning sounds quite easy: you feed in training data made up of examples of good and bad outcomes, and the computer automatically learns from these and spits out a model that can make similar predictions on new data not seen before. What could be easier, right?
• Challenge one: Machine learning systems use advanced analytical techniques in production software
• Challenge two: Integrating model builders and system builders
• Challenge three: The failure of QA and the importance of instrumentation
• Challenge four: Diverse data dependencies
There are many ways to query data with R. This post shows you three of the most common ways:
1.Using DBI
2.Using dplyr syntax
3.Using R Notebooks

If you did not already know

Nesterov’s Accelerated Gradient (NAG)
Nesterov’s Accelerated Gradient Descent performs a simple step of gradient descent to go from x_s to y_{s+1}, and then it ‘slides’ a little bit further than y_{s+1} in the direction given by the previous point y_s. The intuition behind the algorithm is quite difficult to grasp, and unfortunately the analysis will not be very enlightening either. Nonetheless Nesterov’s Accelerated Gradient is an optimal method (in terms of oracle complexity) for smooth convex optimization, …

Activation Ensemble
Many activation functions have been proposed in the past, but selecting an adequate one requires trial and error. We propose a new methodology of designing activation functions within a neural network at each layer. We call this technique an ‘activation ensemble’ because it allows the use of multiple activation functions at each layer. This is done by introducing additional variables, $\alpha$, at each activation layer of a network to allow for multiple activation functions to be active at each neuron. By design, activations with larger $\alpha$ values at a neuron is equivalent to having the largest magnitude. Hence, those higher magnitude activations are ‘chosen’ by the network. We implement the activation ensembles on a variety of datasets using an array of Feed Forward and Convolutional Neural Networks. By using the activation ensemble, we achieve superior results compared to traditional techniques. In addition, because of the flexibility of this methodology, we more deeply explore activation functions and the features that they capture. …

Apache Flume
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store …

Book Memo: “Bayesian Data Analysis”

 Now in its third edition, this classic book is widely considered the leading text on Bayesian methods, lauded for its accessible, practical approach to analyzing data and solving research problems. Bayesian Data Analysis, Third Edition continues to take an applied approach to analysis using up-to-date Bayesian methods. The authors—all leaders in the statistics community—introduce basic concepts from a data-analytic perspective before presenting advanced methods. Throughout the text, numerous worked examples drawn from real applications and research emphasize the use of Bayesian inference in practice.

Whats new on arXiv

This paper tackles two related questions at the heart of machine learning; how can we predict if a minimum will generalize to the test set, and why does stochastic gradient descent find minima that generalize well? Our work is inspired by Zhang et al. (2017), who showed deep networks can easily memorize randomly labeled training data, despite generalizing well when shown real labels of the same inputs. We show here that the same phenomenon occurs in small linear models. These observations are explained by evaluating the Bayesian evidence in favor of each model, which penalizes sharp minima. Next, we explore the ‘generalization gap’ between small and large batch training, identifying an optimum batch size which maximizes the test set accuracy. Noise in the gradient updates is beneficial, driving the dynamics towards robust minima for which the evidence is large. Interpreting stochastic gradient descent as a stochastic differential equation, we predict the optimum batch size is proportional to both the learning rate and the size of the training set, and verify these predictions empirically.
We consider the problem of computing the Fourier transform of high-dimensional vectors, distributedly over a cluster of machines consisting of a master node and multiple worker nodes, where the worker nodes can only store and process a fraction of the inputs. We show that by exploiting the algebraic structure of the Fourier transform operation and leveraging concepts from coding theory, one can efficiently deal with the straggler effects. In particular, we propose a computation strategy, named as coded FFT, which achieves the optimal recovery threshold, defined as the minimum number of workers that the master node needs to wait for in order to compute the output. This is the first code that achieves the optimum robustness in terms of tolerating stragglers or failures for computing Fourier transforms. Furthermore, the reconstruction process for coded FFT can be mapped to MDS decoding, which can be solved efficiently. Moreover, we extend coded FFT to settings including computing general $n$-dimensional Fourier transforms, and provide the optimal computing strategy for those settings.
Convolutional Neural Networks (CNNs) currently achieve state-of-the-art accuracy in image classification. With a growing number of classes, the accuracy usually drops as the possibilities of confusion increase. Interestingly, the class confusion patterns follow a hierarchical structure over the classes. We present visual-analytics methods to reveal and analyze this hierarchy of similar classes in relation with CNN-internal data. We found that this hierarchy not only dictates the confusion patterns between the classes, it furthermore dictates the learning behavior of CNNs. In particular, the early layers in these networks develop feature detectors that can separate high-level groups of classes quite well, even after a few training epochs. In contrast, the latter layers require substantially more epochs to develop specialized feature detectors that can separate individual classes. We demonstrate how these insights are key to significant improvement in accuracy by designing hierarchy-aware CNNs that accelerate model convergence and alleviate overfitting. We further demonstrate how our methods help in identifying various quality issues in the training data.
In this work we propose Lasagne, a methodology to learn locality and structure aware graph node embeddings in an unsupervised way. In particular, we show that the performance of existing random-walk based approaches depends strongly on the structural properties of the graph, e.g., the size of the graph, whether the graph has a flat or upward-sloping Network Community Profile (NCP), whether the graph is expander-like, whether the classes of interest are more k-core-like or more peripheral, etc. For larger graphs with flat NCPs that are strongly expander-like, existing methods lead to random walks that expand rapidly, touching many dissimilar nodes, thereby leading to lower-quality vector representations that are less useful for downstream tasks. Rather than relying on global random walks or neighbors within fixed hop distances, Lasagne exploits strongly local Approximate Personalized PageRank stationary distributions to more precisely engineer local information into node embeddings. This leads, in particular, to more meaningful and more useful vector representations of nodes in poorly-structured graphs. We show that Lasagne leads to significant improvement in downstream multi-label classification for larger graphs with flat NCPs, that it is comparable for smaller graphs with upward-sloping NCPs, and that is comparable to existing methods for link prediction tasks.
Sentence representation at the semantic level is a challenging task for Natural Language Processing and Artificial Intelligence. Despite the advances in word embeddings (i.e. word vector representations), capturing sentence meaning is an open question due to complexities of semantic interactions among words. In this paper, we present an embedding method, which is aimed at learning unsupervised sentence representations from unlabeled text. We propose an unsupervised method that models a sentence as a weighted series of word embeddings. The weights of the word embeddings are fitted by using Shannon’s word entropies provided by the Term Frequency–Inverse Document Frequency (TF–IDF) transform. The hyperparameters of the model can be selected according to the properties of data (e.g. sentence length and textual gender). Hyperparameter selection involves word embedding methods and dimensionalities, as well as weighting schemata. Our method offers advantages over existing methods: identifiable modules, short-term training, online inference of (unseen) sentence representations, as well as independence from domain, external knowledge and language resources. Results showed that our model outperformed the state of the art in well-known Semantic Textual Similarity (STS) benchmarks. Moreover, our model reached state-of-the-art performance when compared to supervised and knowledge-based STS systems.
Subjectivity detection is the task of identifying objective and subjective sentences. Objective sentences are those which do not exhibit any sentiment. So, it is desired for a sentiment analysis engine to find and separate the objective sentences for further analysis, e.g., polarity detection. In subjective sentences, opinions can often be expressed on one or multiple topics. Aspect extraction is a subtask of sentiment analysis that consists in identifying opinion targets in opinionated text, i.e., in detecting the specific aspects of a product or service the opinion holder is either praising or complaining about.
We describe Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow. These models are useful for recognizing ‘command triggers’ in speech-based interfaces (e.g., ‘Hey Siri’), which serve as explicit cues for audio recordings of utterances that are sent to the cloud for full speech recognition. Evaluation on Google’s recently released Speech Commands Dataset shows that our reimplementation is comparable in accuracy and provides a starting point for future work on the keyword spotting task.
An increasing number of sensors on mobile, Internet of things (IoT), and wearable devices generate time-series measurements of physical activities. Though access to the sensory data is critical to the success of many beneficial applications such as health monitoring or activity recognition, a wide range of potentially sensitive information about the individuals can also be discovered through these datasets and this cannot easily be protected using traditional privacy approaches. In this paper, we propose an integrated sensing framework for managing access to personal time-series data in order to provide utility while protecting individuals’ privacy. We introduce \textit{Replacement AutoEncoder}, a novel feature-learning algorithm which learns how to transform discriminative features of multidimensional time-series that correspond to sensitive inferences, into some features that have been more observed in non-sensitive inferences, to protect users’ privacy. The main advantage of Replacement AutoEncoder is its ability to keep important features of desired inferences unchanged to preserve the utility of the data. We evaluate the efficacy of the algorithm with an activity recognition task in a multi-sensing environment using extensive experiments on three benchmark datasets. We show that it can retain the recognition accuracy of state-of-the-art techniques while simultaneously preserving the privacy of sensitive information. We use a Generative Adversarial Network to attempt to detect the replacement of sensitive data with fake non-sensitive data. We show that this approach does not detect the replacement unless the network can train using the users’ original unmodified data.
If X and Y are real valued random variables such that the first moments of X, Y, and XY exist and the conditional expectation of Y given X is an affine function of X, then the intercept and slope of the conditional expectation equal the intercept and slope of the least squares linear regression function, even though Y may not have a finite second moment. As a consequence, the affine in X form of the conditional expectation and zero covariance imply mean independence.
A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory.
Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and cross-modal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in sub-optimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the fine-granularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel Siamese-Triplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multi-label classification and cross-modal search. Compared to state-of-the-art image embeddings, our proposed DMAN achieves significant improvement in the tasks of multi-label classification and cross-modal search.
`Community structure’ is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity and are poorly connected to each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a network has received little attention and remains an open issue. We propose an inferential procedure for community structure validation of network partitions, which relies on concepts from network enrichment analysis. The proposed procedure allows to compare the adequacy of different partitions of nodes as community structures. Moreover, it can be employed to assess whether two networks share the same community structure, and to compare the performance of different network clustering algorithms.
Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.
In this paper, we study the pooled data problem of identifying the labels associated with a large collection of items, based on a sequence of pooled tests revealing the counts of each label within the pool. In the noiseless setting, we identify an exact asymptotic threshold on the required number of tests with optimal decoding, and prove a phase transition between complete success and complete failure. In addition, we present a novel noisy variation of the problem, and provide an information-theoretic framework for characterizing the required number of tests for general random noise models. Our results reveal that noise can make the problem considerably more difficult, with strict increases in the scaling laws even at low noise levels. Finally, we demonstrate similar behavior in an approximate recovery setting, where a given number of errors is allowed in the decoded labels.
Tensor decomposition methods are popular tools for learning latent variables given only lower-order moments of the data. However, the standard assumption is that we have sufficient data to estimate these moments to high accuracy. In this work, we consider the case in which certain dimensions of the data are not always observed—common in applied settings, where not all measurements may be taken for all observations—resulting in moment estimates of varying quality. We derive a weighted tensor decomposition approach that is computationally as efficient as the non-weighted approach, and demonstrate that it outperforms methods that do not appropriately leverage these less-observed dimensions.
The principle goal of computational mechanics is to define pattern and structure so that the organization of complex systems can be detected and quantified. Computational mechanics developed from efforts in the 1970s and early 1980s to identify strange attractors as the mechanism driving weak fluid turbulence via the method of reconstructing attractor geometry from measurement time series and in the mid-1980s to estimate equations of motion directly from complex time series. In providing a mathematical and operational definition of structure it addressed weaknesses of these early approaches to discovering patterns in natural systems. Since then, computational mechanics has led to a range of results from theoretical physics and nonlinear mathematics to diverse applications—from closed-form analysis of Markov and non-Markov stochastic processes that are ergodic or nonergodic and their measures of information and intrinsic computation to complex materials and deterministic chaos and intelligence in Maxwellian demons to quantum compression of classical processes and the evolution of computation and language. This brief review clarifies several misunderstandings and addresses concerns recently raised regarding early works in the field (1980s). We show that misguided evaluations of the contributions of computational mechanics are groundless and stem from a lack of familiarity with its basic goals and from a failure to consider its historical context. For all practical purposes, its modern methods and results largely supersede the early works. This not only renders recent criticism moot and shows the solid ground on which computational mechanics stands but, most importantly, shows the significant progress achieved over three decades and points to the many intriguing and outstanding challenges in understanding the computational nature of complex dynamic systems.

R Packages worth a look

Adjacency-Constrained Clustering of a Block-Diagonal Similarity Matrix (adjclust)
Implements a constrained version of hierarchical agglomerative clustering, in which each observation is associated to a position, and only adjacent clusters can be merged. Typical application fields in bioinformatics include Genome-Wide Association Studies or Hi-C data analysis, where the similarity between items is a decreasing function of their genomic distance. Taking advantage of this feature, the implemented algorithm is time and memory efficient. This algorithm is described in Chapter 4 of Alia Dehman (2015) <https://…/tel-01288568v1>.

Initialization Algorithms for Partitioning Cluster Analysis (inaparc)
Partitioning clustering algorithms divide data sets into k subsets or partitions which are so-called clusters. They require some initialization procedures for starting to partition the data sets. Initialization of cluster prototypes is one of such kind of procedures for most of the partitioning algorithms. Cluster prototypes are the data elements, i.e. centroids or medoids, representing the clusters in a data set. In order to initialize cluster prototypes, the package ‘inaparc’ contains a set of the functions that are the implementations of several linear time-complexity and loglinear time-complexity methods in addition to some novel techniques. Initialization of fuzzy membership degrees matrices is another important task for starting the probabilistic and possibilistic partitioning algorithms. In order to initialize membership degrees matrices required by these algorithms, a number of functions based on some traditional and novel initialization techniques are also available in the package ‘inaparc’.

Relative Importance PCA Regression (RelimpPCR)
Performs Principal Components Analysis (also known as PCA) dimensionality reduction in the context of a linear regression. In most cases, PCA dimensionality reduction is performed independent of the response variable for a regression. This captures the majority of the variance of the model’s predictors, but may not actually be the optimal dimensionality reduction solution for a regression against the response variable. An alternative method, optimized for a regression against the response variable, is to use both PCA and a relative importance measure. This package applies PCA to a given data frame of predictors, and then calculates the relative importance of each PCA factor against the response variable. It outputs ordered factors that are optimized for model fit. By performing dimensionality reduction with this method, an individual can achieve a the same r-squared value as performing just PCA, but with fewer PCA factors. References: Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani (2013) <http://…/>.

Event History Analysis (eha)
Sampling of risk sets in Cox regression, selections in the Lexis diagram, bootstrapping. Parametric proportional hazards fitting with left truncation and right censoring for common families of distributions, piecewise constant hazards, and discrete models. AFT regression for left truncated and right censored data.

Interactive, Complex Heatmaps (iheatmapr)
Make complex, interactive heatmaps. ‘iheatmapr’ includes a modular system for iteratively building up complex heatmaps, as well as the iheatmap() function for making relatively standard heatmaps.

Bayes Factors for Hierarchical Linear Models with Continuous Predictors (BayesRS)
Runs hierarchical linear Bayesian models. Samples from the posterior distributions of model parameters in JAGS (Just Another Gibbs Sampler; Plummer, 2003, <http://…/> ). Computes Bayes factors for group parameters of interest with the Savage-Dickey density ratio (Wetzels, Raaijmakers, Jakab, Wagenmakers, 2009, <doi:10.3758/PBR.16.4.752>).

Document worth reading: “Auto-scaling Web Applications in Clouds: A Taxonomy and Survey”

Web application providers have been migrating their applications to cloud data centers, attracted by the emerging cloud computing paradigm. One of the appealing features of cloud is elasticity. It allows cloud users to acquire or release computing resources on demand, which enables web application providers to auto-scale the resources provisioned to their applications under dynamic workload in order to minimize resource cost while satisfying Quality of Service (QoS) requirements. In this paper, we comprehensively analyze the challenges remain in auto-scaling web applications in clouds and review the developments in this field. We present a taxonomy of auto-scaling systems according to the identified challenges and key properties. We analyze the surveyed works and map them to the taxonomy to identify the weakness in this field. Moreover, based on the analysis, we propose new future directions. Auto-scaling Web Applications in Clouds: A Taxonomy and Survey

Magister Dixit

“Feature engineering and feature selection are not mutually exclusive. They are both useful. I’d say feature engineering is more important though, especially because you can’t really automate it.” Robert Neuhaus

Distilled News

The idea of storytelling is fascinating; to take an idea or an incident, and turn it into a story. It brings the idea to life and makes it more interesting. This happens in our day to day life. Whether we narrate a funny incident or our findings, stories have always been the “go-to” to draw interest from listeners and readers alike. For instance; when we talk of how one of our friends got scolded by a teacher, we tend to narrate the incident from the beginning so that a flow is maintained. Let’s take an example of the most common driving distractions by gender. There are two ways to tell this.
Emerging pattern mining is a data mining task that aims to discover discriminative patterns, which can describe emerging behavior with respect to a property of interest. In recent years, the description of datasets has become an interesting field due to the easy acquisition of knowledge by the experts. In this review, we will focus on the descriptive point of view of the task. We collect the existing approaches that have been proposed in the literature and group them together in a taxonomy in order to obtain a general vision of the task. A complete empirical study demonstrates the suitability of the approaches presented. This review also presents future trends and emerging prospects within pattern mining and the benefits of knowledge extracted from emerging patterns.
Unless you’re involved in anomaly detection you may never have heard of Unsupervised Decision Trees. It’s a very interesting approach to decision trees that on the surface doesn’t sound possible but in practice is the backbone of modern intrusion detection.
Today, the volume of data is often too big for a single server – node – to process. Therefore, there was a need to develop code that runs on multiple nodes. Writing distributed systems is an endless array of problems, so people developed multiple frameworks to make our lives easier. MapReduce is a framework that allows the user to write code that is executed on multiple nodes without having to worry about fault tolerance, reliability, synchronization or availability.
Traditional approaches to string matching such as the Jaro-Winkler or Levenshtein distance measure are too slow for large datasets. Using TF-IDF with N-Grams as terms to find similar strings transforms the problem into a matrix multiplication problem, which is computationally much cheaper. Using this approach made it possible to search for near duplicates in a set of 663,000 company names in 42 minutes using only a dual-core laptop.
At the Strata Big Data Conference in New York, one of the major themes was the responsibility that data scientists have to do their best to prevent the biases and prejudices that exist in society from creeping into data and the way algorithms are built.
In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label satellite images from the Amazon basin, in order to better track and understand causes of deforestation.
Random Forest, one of the most popular and powerful ensemble method used today in Machine Learning. This post is an introduction to such algorithm and provides a brief overview of its inner workings.
Do you know what’s more dangerous than artificial intelligence? Natural stupidity. In this article, I will explore natural stupidity in more detail and show how our current technology (driven by narrow artificial intelligence) is making us collectively dumber. We’ve all had this experience of using a GPS to guide us around an unfamiliar place only to realize later that we have no recollection or ability to get to that place again without the aid of a GPS. Not only is our directional instinct diminish because of lack of use, but so is our own memories. We’ve all experienced losing our ability to recall due to our over use of Google. We now recall more as to how we can search for something rather than the details of that something. The framework that I often use to explore intuition is the Cognitive Bias Codex found at Wikipedia. It’s a massive list of biases, however to get an overview of it, there are four high level categories that are the the drivers of theses biases. These are “Too Much Information”, “Not Enough Meaning”, “Need to Act Fast” and “What Should we Remember?”.
As practitioners who build data science tools, we seem to have a rather myopic obsession with the challenges faced by the Googles, Amazons, and Facebooks of the world—companies with massive and mature data analytics ecosystems, supported by experienced systems engineers, and used by data scientists who are capable programmers. However, these companies represent a tiny fraction of the “big data” universe. It’s helpful to think of them as the “1% of big data”: the minority whose struggles are not often what the rest of the “big data” world faces. Yet, they occupy the majority of discourse around how to utilize the latest tools and technologies in the industry.
In Containerizing Continuous Delivery in Java we explored the fundamentals of packaging and deploying Java applications within Docker containers. This was only the first step in creating production-ready, container-based systems. Running containers at any real-world scale requires a container orchestration and scheduling platform, and although many exist (i.e., Docker Swarm, Apache Mesos, and AWS ECS), the most popular is Kubernetes. Kubernetes is used in production at many organizations, and is now hosted by the Cloud Native Computing Foundation (CNCF). In this article, we will take the previous simple Java-based, e-commerce shop that we packaged within Docker containers and run this on Kubernetes.
What you need know before committing to AI.
The Julia programming language is growing fast and its efficiency and speed is now well-known. Even-though I think R is the best language for Data Science, sometimes we just need more. Modelling is an important part of Data Science and sometimes you may need to implement your own algorithms or adapt existing models to your problems. If performance is not essential and the complexity of your problem is small, R alone is enough. However, if you need to run the same model several times on large datasets and available implementations are not suit to your problem, you will need to go beyond R. Fortunately, you can go beyond R in R, which is great because you can do your analysis in R and call complex models from elsewhere. The book “Extending R” from John Chambers presents interfaces in R for C++, Julia and Python. The last two are in the XRJulia and in the XRPython packages, which are very straightforward.
Olga’s talk was entitled ‘How we built a Shiny App for 700 users?’ She went over the main challenges associated with scaling a Shiny application, and the methods we used to resolve them. The talk was partly in the form of a case study based on Appsilon’s experience.
If you hang out on Meta Stack Overflow, you may have noticed news from time to time about A/B tests of various features here at Stack Overflow. We use A/B testing to compare a new version to a baseline for a design, a machine learning model, or practically any feature of what we do here at Stack Overflow; these tests are part of our decision-making process. Which version of a button, predictive model, or ad is better? We don’t have to guess blindly, but instead we can use tests as part of our decision-making toolkit.