# Book Memo: “Convex Analysis and Global Optimization”

 This book presents state-of-the-art results and methodologies in modern global optimization, and has been a staple reference for researchers, engineers, advanced students (also in applied mathematics), and practitioners in various fields of engineering. The second edition has been brought up to date and continues to develop a coherent and rigorous theory of deterministic global optimization, highlighting the essential role of convex analysis. The text has been revised and expanded to meet the needs of research, education, and applications for many years to come.

# R Packages worth a look

Plotting for Bayesian Models (bayesplot)
Plotting functions for posterior analysis, model checking, and MCMC diagnostics. The package is designed not only to provide convenient functionality for users, but also a common set of functions that can be easily used by developers working on a variety of R packages for Bayesian modeling, particularly (but not exclusively) packages interfacing with Stan.

Frequency Distribution (freqdist)
Generates a frequency distribution. The frequency distribution includes raw frequencies, percentages in each category, and cumulative frequencies. The frequency distribution can be stored as a data frame.

Generated Effect Modifier (pirate)
An implementation of the generated effect modifier (GEM) method. This method constructs composite variables by linearly combining pre-treatment scalar patient characteristics to create optimal treatment effect modifiers in linear models. The optimal linear combination is called a GEM. Treatment is assumed to have been assigned at random. For reference, see E Petkova, T Tarpey, Z Su, and RT Ogden. Generated effect modifiers (GEMs) in randomized clinical trials. Biostatistics (First published online: July 27, 2016, <doi:10.1093/biostatistics/kxw035>).

Ratio-of-Uniforms Sampling for Bayesian Extreme Value Analysis (revdbayes)
Provides functions for the Bayesian analysis of extreme value models. The ‘rust’ package <https://…/package=rust> is used to simulate a random sample from the required posterior distribution. The functionality of ‘revdbayes’ is similar to the ‘evdbayes’ package <https://…/package=evdbayes>, which uses Markov Chain Monte Carlo (MCMC) methods for posterior simulation. See the ‘revdbayes’ website for more information, documentation and examples.

Fitting log-Linear Models in Sparse Contingency Tables (eMLEloglin)
Log-linear modeling is a popular method for the analysis of contingency table data. When the table is sparse, the data can fall on the boundary of the convex support, and we say that ‘the MLE does not exist’ in the sense that some parameters cannot be estimated. However, an extended MLE always exists, and a subset of the original parameters will be estimable. The ‘eMLEloglin’ package determines which sampling zeros contribute to the non-existence of the MLE. These problematic zero cells can be removed from the contingency table and the model can then be fit (as far as is possible) using the glm() function.

# If you did not already know: “Unit of Analysis”

One of the most important ideas in a research project is the unit of analysis. The unit of analysis is the major entity that you are analyzing in your study. For instance, any of the following could be a unit of analysis in a study:
• individuals
• groups
• artifacts (books, photos, newspapers)
• geographical units (town, census tract, state)
• social interactions (dyadic relations, divorces, arrests)
Why is it called the ‘unit of analysis’ and not something else (like, the unit of sampling)? Because it is the analysis you do in your study that determines what the unit is. For instance, if you are comparing the children in two classrooms on achievement test scores, the unit is the individual child because you have a score for each child. On the other hand, if you are comparing the two classes on classroom climate, your unit of analysis is the group, in this case the classroom, because you only have a classroom climate score for the class as a whole and not for each individual student. For different analyses in the same study you may have different units of analysis. If you decide to base an analysis on student scores, the individual is the unit. But you might decide to compare average classroom performance. In this case, since the data that goes into the analysis is the average itself (and not the individuals’ scores) the unit of analysis is actually the group. Even though you had data at the student level, you use aggregates in the analysis. In many areas of social research these hierarchies of analysis units have become particularly important and have spawned a whole area of statistical analysis sometimes referred to as hierarchical modeling. This is true in education, for instance, where we often compare classroom performance but collected achievement data at the individual student level. …
Unit of Analysis

# Book Memo: “Learning Analytics”

 Measurement Innovations to Support Employee Development The potential to improve education due to the large amounts of data on learning and learners is unprecedented and has created an information gap in understanding what to do with all the raw data. Providing a framework for understanding how to work with learning analytics, authors John R. Mattox II and Jean Martin show L&D and HR practitioners the power that effective analytics has on building an organization and the impact this power has on performance, talent management, and competitive advantage. Martin and Mattox focus on aligning training with business needs and answering the questions “Is training effective?” and “How can it improved or made more effective?” Beginning with an explanation of what learning analytics is and the business need for it, they move on to applying business intelligence principles, linking learning to impact, connecting training content with business needs, optimizing investments in learning, and placing learning development within the larger scope of talent management. Chapters include case studies from Hilton Hotels, Shell Oil, and American Express to highlight best practice and to provide examples of how companies apply various methodologies across a range of industries.

# R Packages worth a look

In confirmatory factor analysis (CFA), structural constraints typically ensure that the model is identified up to all possible reflections, i.e., column sign changes of the matrix of loadings. Such reflection invariance is problematic for Bayesian CFA when the reflection modes are not well separated in the posterior distribution. Imposing rotational constraints — fixing some loadings to be zero or positive in order to pick a factor solution that corresponds to one reflection mode — may not provide a satisfactory solution for Bayesian CFA. The function ‘relabel’ uses the relabeling algorithm of Erosheva and Curtis to correct for sign invariance in MCMC draws from CFA models. The MCMC draws should come from Bayesian CFA models that are fit without rotational constraints.

Selection of Number of Clusters via Normalized Clustering Instability (cstab)
Selection of the number of clusters in cluster analysis using stability methods.

Visualization for ‘diffrprojects’ (diffrprojectswidget)
Interactive visualizations and tabulations for diffrprojects. All presentations are based on the htmlwidgets framework allowing for interactivity via HTML and Javascript, Rstudio viewer integration, RMarkdown integration, as well as Shiny compatibility.

Continuous Threshold Expectile Regression (cthreshER)
Estimation and inference methods for the continuous threshold expectile regression. It can fit the continuous threshold expectile regression and test the existence of change point, for the paper, ‘Feipeng Zhang and Qunhua Li (2016). A continuous threshold expectile regression, submitted.’

Systematic Conservation Prioritization in R (prioritizr)
Solve systematic reserve design problems using integer programming techniques. To solve problems most efficiently users can install three optional packages not available on CRAN: the ‘gurobi’ optimizer (available from <http://…/> ) and the conservation prioritization package ‘marxan’ (available from <https://…/marxan> ).

# If you did not already know: “Web Ontology Language (OWL)”

The Web Ontology Language (OWL) is a family of knowledge representation languages for authoring ontologies. Ontologies are a formal way to describe taxonomies and classification networks, essentially defining the structure of knowledge for various domains: the nouns representing classes of objects and the verbs representing relations between the objects. Ontologies resemble class hierarchies in object-oriented programming but there are several critical differences. Class hierarchies are meant to represent structures used in source code that evolve fairly slowly (typically monthly revisions) where as ontologies are meant to represent information on the Internet and are expected to be evolving almost constantly. Similarly, ontologies are typically far more flexible as they are meant to represent information on the Internet coming from all sorts of heterogeneous data sources. Class hierarchies on the other hand are meant to be fairly static and rely on far less diverse and more structured sources of data such as corporate databases. The OWL languages are characterized by formal semantics. They are built upon a W3C XML standard for objects called the Resource Description Framework (RDF). OWL and RDF have attracted significant academic, medical and commercial interest. In October 2007, a new W3C working group was started to extend OWL with several new features as proposed in the OWL 1.1 member submission. W3C announced the new version of OWL on 27 October 2009. This new version, called OWL 2, soon found its way into semantic editors such as Protégé and semantic reasoners such as Pellet, RacerPro, FaCT++ and HermiT. The OWL family contains many species, serializations, syntaxes and specifications with similar names. OWL and OWL2 are used to refer to the 2004 and 2009 specifications, respectively. Full species names will be used, including specification version (for example, OWL2 EL). When referring more generally, OWL Family will be used. … Web Ontology Language (OWL)

# Distilled News

Recently I wanted to extract a table from a pdf file so that I could work with the table in R. Specifically, I wanted to get data on layoffs in California from the California Employment Development Department. The EDD publishes a list of all of the layoffs in the state that fall under the WARN act here. Unfortunately, the tables are available only in pdf format. I wanted an interactive version of the data that I could work with in R and export to a csv file. Fortunately, the tabulizer package in R makes this a cinch. In this post, I will use this scenario as a working example to show how to extract data from a pdf file using the tabulizer package in R.
Change Point Detection (CPD) refers to the problem of estimating the time at which the statistical properties of a time series… well… change. It originates in the 1950s, as a method used to automatically detect failures in industrial processes (quality control) and it is currently an active area of research that can boast of having a website on its own. CPD is a generally interesting problem with lots of potential applications other than quality control, ranging from predicting anomalies in the electricity market (and, more generally, in financial markets) to detecting security attacks in a network system or even detecting electrical activity in the brain. The point is to have an algorithm that can automatically detect changes in the properties of the time series for us to make the appropriate decisions. Whatever the application, the general framework is always the same: the underlying probability distribution function of the time series is assumed to change at one (or more) moments in time.
After spending a day the other week struggling to make sense of a federal data set shared in an archaic format (ASCII fixed format dat file). It is essential for the effective distribution and sharing of data that it use the minimum amount of disk space and be rapidly accessible for use by potential users. In this post I test four different file formats available to R users. These formats are comma separated values csv (write.csv()), object representation format as a ASCII txt (dput()), a serialized R object (saveRDS()), and a Stata file (write.dta() from the foreign package). For reference, rds files seem to be identical to Rdata files except that they deal with only one object rather than potentially multiple.
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). Data collection is underway and on completion, VisDial will contain 1 dialog with 10 question-answer pairs on all ~200k images from COCO, with a total of 2M dialog question-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders — Late Fusion, Hierarchical Recurrent Encoder and Memory Network — and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Our dataset, code, and trained models will be released publicly. Putting it all together, we demonstrate the first ‘visual chatbot’!
Dgraph is a distributed, low-latency, high throughput, native graph database. It’s written from scratch in Go[1] language, with concurrency and scalability in mind.

# Whats new on arXiv

Within the framework of functional data analysis, we develop principal component analysis for periodically correlated time series of functions. We define the components of the above analysis including periodic, operator-valued filters, score processes and the inversion formulas. We show that these objects are defined via convergent series under a simple condition requiring summability of the Hilbert-Schmidt norms of the filter coefficients, and that they poses optimality properties. We explain how the Hilbert space theory reduces to an approximate finite-dimensional setting which is implemented in a custom build R package. A data example and a simulation study show that the new methodology is superior to existing tools if the functional time series exhibit periodic characteristics.
Causal graphs, such as directed acyclic graphs (DAGs) and partial ancestral graphs (PAGs), represent causal relationships among variables in a model. Methods exist for learning DAGs and PAGs from data and for converting DAGs to PAGs. However, these methods only output a single causal graph consistent with the independencies/dependencies (the Markov equivalence class $M$) estimated from the data. However, many distinct graphs may be consistent with $M$, and a data modeler may wish to select among these using domain knowledge. In this paper, we present a method that makes this possible. We introduce PAG2ADMG, the first method for enumerating all causal graphs consistent with $M$, under certain assumptions. PAG2ADMG converts a given PAG into a set of acyclic directed mixed graphs (ADMGs). We prove the correctness of the approach and demonstrate its efficiency relative to brute-force enumeration.
Machine learning models, including state-of-the-art deep neural networks, are vulnerable to small perturbations that cause unexpected classification errors. This unexpected lack of robustness raises fundamental questions about their generalization properties and poses a serious concern for practical deployments. As such perturbations can remain imperceptible – commonly called adversarial examples that demonstrate an inherent inconsistency between vulnerable machine learning models and human perception – some prior work casts this problem as a security issue as well. Despite the significance of the discovered instabilities and ensuing research, their cause is not well understood, and no effective method has been developed to address the problem highlighted by adversarial examples. In this paper, we present a novel theory to explain why this unpleasant phenomenon exists in deep neural networks. Based on that theory, we introduce a simple, efficient and effective training approach, Batch Adjusted Network Gradients (BANG), which significantly improves the robustness of machine learning models. While the BANG technique does not rely on any form of data augmentation or the application of adversarial images for training, the resultant classifiers are more resistant to adversarial perturbations while maintaining or even enhancing classification performance overall.
Decision tree is an important method for both induction research and data mining, which is mainly used for model classification and prediction. ID3 algorithm is the most widely used algorithm in the decision tree so far. In this paper, the shortcoming of ID3’s inclining to choose attributes with many values is discussed, and then a new decision tree algorithm which is improved version of ID3. In our proposed algorithm attributes are divided into groups and then we apply the selection measure 5 for these groups. If information gain is not good then again divide attributes values into groups. These steps are done until we get good classification/misclassification ratio. The proposed algorithms classify the data sets more accurately and efficiently.
Many prediction tasks contain uncertainty. In the case of next-frame or future prediction the uncertainty is inherent in the task itself, as it is impossible to foretell what exactly is going to happen in the future. Another source of uncertainty or ambiguity is the way data is labeled. Sometimes not all objects of interest are annotated in a given image or the annotation is ambiguous, e.g. in the form of occluded joints in human pose estimation. We present a method that is able to handle these problems by predicting not a single output but multiple hypotheses. More precisely, we propose a framework for re-formulating existing single prediction models as multiple hypothesis prediction (MHP) problems as well as a meta loss and an optimization procedure to train the resulting MHP model. We consider three entirely different applications, i.e. future prediction, image classification and human pose estimation, and demonstrate how existing single hypothesis predictors (SHPs) can be turned into MHPs. The performed experiments show that the resulting MHP outperforms the existing SHP and yields additional insights regarding the variation and ambiguity of the predictions.
Fully convolutional neural networks give accurate, per-pixel prediction for input images and have applications like semantic segmentation. However, a typical FCN usually requires lots of floating point computation and large run-time memory, which effectively limits its usability. We propose a method to train Bit Fully Convolution Network (BFCN), a fully convolutional neural network that has low bit-width weights and activations. Because most of its computation-intensive convolutions are accomplished between low bit-width numbers, a BFCN can be accelerated by an efficient bit-convolution implementation. On CPU, the dot product operation between two bit vectors can be reduced to bitwise operations and popcounts, which can offer much higher throughput than 32-bit multiplications and additions. To validate the effectiveness of BFCN, we conduct experiments on the PASCAL VOC 2012 semantic segmentation task and Cityscapes. Our BFCN with 1-bit weights and 2-bit activations, which runs 7.8x faster on CPU or requires less than 1\% resources on FPGA, can achieve comparable performance as the 32-bit counterpart.
We present a method for inducing new dialogue systems from very small amounts of unannotated dialogue data, showing how word-level exploration using Reinforcement Learning (RL), combined with an incremental and semantic grammar – Dynamic Syntax (DS) – allows systems to discover, generate, and understand many new dialogue variants. The method avoids the use of expensive and time-consuming dialogue act annotations, and supports more natural (incremental) dialogues than turn-based systems. Here, language generation and dialogue management are treated as a joint decision/optimisation problem, and the MDP model for RL is constructed automatically. With an implemented system, we show that this method enables a wide range of dialogue variations to be automatically captured, even when the system is trained from only a single dialogue. The variants include question-answer pairs, over- and under-answering, self- and other-corrections, clarification interaction, split-utterances, and ellipsis. This generalisation property results from the structural knowledge and constraints present within the DS grammar, and highlights some limitations of recent systems built using machine learning techniques only.
Although support vector machines (SVMs) are theoretically well understood, their underlying optimization problem becomes very expensive if, for example, hundreds of thousands of samples and a non-linear kernel are considered. Several approaches have been proposed in the past to address this serious limitation. In this work we investigate a decomposition strategy that learns on small, spatially defined data chunks. Our contributions are two fold: On the theoretical side we establish an oracle inequality for the overall learning method using the hinge loss, and show that the resulting rates match those known for SVMs solving the complete optimization problem with Gaussian kernels. On the practical side we compare our approach to learning SVMs on small, randomly chosen chunks. Here it turns out that for comparable training times our approach is significantly faster during testing and also reduces the test error in most cases significantly. Furthermore, we show that our approach easily scales up to 10 million training samples: including hyper-parameter selection using cross validation, the entire training only takes a few hours on a single machine. Finally, we report an experiment on 32 million training samples.
Distributed representations of words have been shown to capture lexical semantics, as demonstrated by their effectiveness in word similarity and analogical relation tasks. But, these tasks only evaluate lexical semantics indirectly. In this paper, we study whether it is possible to utilize distributed representations to generate dictionary definitions of words, as a more direct and transparent representation of the embeddings’ semantics. We introduce definition modeling, the task of generating a definition for a given word and its embedding. We present several definition model architectures based on recurrent neural networks, and experiment with the models over multiple data sets. Our results show that a model that controls dependencies between the word being defined and the definition words performs significantly better, and that a character-level convolution layer designed to leverage morphology can complement word-level embeddings. Finally, an error analysis suggests that the errors made by a definition model may provide insight into the shortcomings of word embeddings.