M4CD  In this paper, we propose a robust change detection method for intelligent visual surveillance. This method, named M4CD, includes three major steps. Firstly, a samplebased background model that integrates color and texture cues is built and updated over time. Secondly, multiple heterogeneous features (including brightness variation, chromaticity variation, and texture variation) are extracted by comparing the input frame with the background model, and a multisource learning strategy is designed to online estimate the probability distributions for both foreground and background. The three features are approximately conditionally independent, making multisource learning feasible. Pixelwise foreground posteriors are then estimated with Bayes rule. Finally, the Markov random field (MRF) optimization and heuristic postprocessing techniques are used sequentially to improve accuracy. In particular, a twolayer MRF model is constructed to represent pixelbased and superpixelbased contextual constraints compactly. Experimental results on the CDnet dataset indicate that M4CD is robust under complex environments and ranks among the top methods. 
MAC Network  We present the MAC network, a novel fully differentiable neural network architecture, designed to facilitate explicit and expressive reasoning. Drawing inspiration from first principles of computer organization, MAC moves away from monolithic blackbox neural architectures towards a design that encourages both transparency and versatility. The model approaches problems by decomposing them into a series of attentionbased reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintains a separation between control and memory. By stringing the cells together and imposing structural constraints that regulate their interaction, MAC effectively learns to perform iterative reasoning processes that are directly inferred from the data in an endtoend approach. We demonstrate the model’s strength, robustness and interpretability on the challenging CLEVR dataset for visual reasoning, achieving a new stateoftheart 98.9% accuracy, halving the error rate of the previous best model. More importantly, we show that the model is computationallyefficient and dataefficient, in particular requiring 5x less data than existing models to achieve strong results. 
Machine Learning  Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and nonspam messages. After learning, it can then be used to classify new email messages into spam and nonspam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory. 
Machine Learning Algorithms alphabetically  A list of machine learning algorithms 
Machine Learning Algorithms by Category  A list of machine learning algorithms 
Machine Learning Canvas  A framework to connect the dots between data collection, machine learning, and value creation 
Machine Listening Intelligence  This manifesto paper will introduce machine listening intelligence, an integrated research framework for acoustic and musical signals modelling, based on signal processing, deep learning and computational musicology. 
Machine Reasoning  Imagine that the toddler who was once pushing the glass off the table now understands the physics of movement and gravity. Even without having encountered this situation before, the toddler can surmise what will inevitably happen. The toddler can apply the same logic to another object on the table — adapting that knowledge and applying it to a TV remote on the same table — because he knows why it happens. That’s machine reasoning. Machine reasoning is a more humanlike approach within the AI spectrum that’s highly relevant to big data investigations, therefore it allows for more flexible adaptation than machine learning. However, machine reasoning requires heuristics and curation, which is usually done by knowledgeable domain experts. This process is where machine reasoning may be difficult for companies to scale — it requires a great deal of expert human effort for this curation to take place. Machine reasoning is best applied in deterministic scenarios – that is, determining whether something is true or not, or whether something will happen or not. Knowing this, it’s clear why machine learning and machine reasoning work well together. 
Machine Teaching  In this paper, we consider the problem of machine teaching, the inverse problem of machine learning. Different from traditional machine teaching which views the learners as batch algorithms, we study a new paradigm where the learner uses an iterative algorithm and a teacher can feed examples sequentially and intelligently based on the current performance of the learner. We show that the teaching complexity in the iterative case is very different from that in the batch case. Instead of constructing a minimal training set for learners, our iterative machine teaching focuses on achieving fast convergence in the learner model. Depending on the level of information the teacher has from the learner model, we design teaching algorithms which can provably reduce the number of teaching examples and achieve faster convergence than learning without teachers. We also validate our theoretical findings with extensive experiments on different data distribution and real image datasets. 
Machine Vision (MV) 
Machine vision (MV) is the technology and methods used to provide imagingbased automatic inspection and analysis for such applications as automatic inspection, process control, and robot guidance in industry. The scope of MV is broad. MV is related to, though distinct from, computer vision. 
Machines Talking To Machines (M2M) 
We propose Machines Talking To Machines (M2M), a framework combining automation and crowdsourcing to rapidly bootstrap endtoend dialogue agents for goaloriented dialogues in arbitrary domains. M2M scales to new tasks with just a task schema and an API client from the dialogue system developer, but it is also customizable to cater to taskspecific interactions. Compared to the WizardofOz approach for data collection, M2M achieves greater diversity and coverage of salient dialogue flows while maintaining the naturalness of individual utterances. In the first phase, a simulated user bot and a domainagnostic system bot converse to exhaustively generate dialogue ‘outlines’, i.e. sequences of template utterances and their semantic parses. In the second phase, crowd workers provide contextual rewrites of the dialogues to make the utterances more natural while preserving their meaning. The entire process can finish within a few hours. We propose a new corpus of 3,000 dialogues spanning 2 domains collected with M2M, and present comparisons with popular dialogue datasets on the quality and diversity of the surface forms and dialogue flows. 
MAESTRO  We present MAESTRO, a framework to describe and analyze CNN dataflows, and predict performance and energyefficiency when running neural network layers across various hardware configurations. This includes two components: (i) a concise language to describe arbitrary dataflows and (ii) and analysis framework that accepts the dataflow description, hardware resource description, and DNN layer description as inputs and generates buffer requirements, buffer access counts, networkonchip (NoC) bandwidth requirements, and roofline performance information. We demonstrate both components across several dataflows as case studies. 
MAgent  We introduce MAgent, a platform to support research and development of manyagent reinforcement learning. Unlike previous research platforms on single or multiagent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents’ optimal polices, but more importantly, the observation and understanding of individual agent’s behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch. 
Magnetic Laplacian Matrix  MagneticMap 
MagnitudeShape Plot  This article proposes a new graphical tool, the magnitudeshape (MS) plot, for visualizing both the magnitude and shape outlyingness of multivariate functional data. The proposed tool builds on the recent notion of functional directional outlyingness, which measures the centrality of functional data by simultaneously considering the level and the direction of their deviation from the central region. The MSplot intuitively presents not only levels but also directions of magnitude outlyingness on the horizontal axis or plane, and demonstrates shape outlyingness on the vertical axis. A dividing curve or surface is provided to separate nonoutlying data from the outliers. Both the simulated data and the practical examples confirm that the MSplot is superior to existing tools for visualizing centrality and detecting outliers for functional data. 
Mahalanobis Distance  The Mahalanobis distance is a descriptive statistic that provides a relative measure of a data point’s distance (residual) from a common point. It is a unitless measure introduced by P. C. Mahalanobis in 1936. The Mahalanobis distance is used to identify and gauge similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scaleinvariant. In other words, it has a multivariate effect size. 
Malware Analysis and Attributed using Genetic Information (MAAGI) 
Artificial intelligence methods have often been applied to perform specific functions or tasks in the cyberdefense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyberattacks, and malwarebased attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future characteristics of malware. In this paper, we present the Malware Analysis and Attributed using Genetic Information (MAAGI) system. The underlying idea behind the MAAGI system is that there are strong similarities between malware behavior and biological organism behavior, and applying biologically inspired methods to corpora of malware can help analysts better understand the ecosystem of malware attacks. Due to the sophistication of the malware and the analysis, the MAAGI system relies heavily on artificial intelligence techniques to provide this capability. It has already yielded promising results over its development life, and will hopefully inspire more integration between the artificial intelligence and cyber–defense communities. 
Managed Memory Computing (MMC) 
Aggregated data cubes are the most effective form of storage of aggregated or summarized data for quick analysis. This technology is driven by Online Analytical Processing technology. Utilizing these data cubes involves intense disk I/O operations. This at times lowers the speed for users of data. Conventional, inmemory processing does not rely on stored and summarized or aggregated data but brings all the relevant data to the memory. This technology then utilizes intense processing and large amounts of memory to perform all calculations and aggregations while in memory. Managed Memory Computing blends the best of both methods, allowing users to define data cubes with perstructured and aggregated data, providing a logical business layer to users, and offering inmemory computation. These features make the response time for user interactions far superior and enable the most balanced approach between disk I/O and inmemory processing. The hybrid approach of Managed Memory Computing provides analysis, dashboards, graphical interaction, ad hoc querying, presentation, and discussion driven analytic at blazing speeds, making the Business Intelligence Tool ready for everything from an interactive session in the boardroom to a production planning meeting on the factory floor. 
Managed R Archive Network (MRAN) 
Revolution Analytics’ Managed R Archive Network 
Mandolin  Markov Logic Networks join probabilistic modeling with firstorder logic and have been shown to integrate well with the Semantic Web foundations. While several approaches have been devised to tackle the subproblems of rule mining, grounding, and inference, no comprehensive workflow has been proposed so far. In this paper, we fill this gap by introducing a framework called Mandolin, which implements a workflow for knowledge discovery specifically on RDF datasets. Our framework imports knowledge from referenced graphs, creates similarity relationships among similar literals, and relies on stateoftheart techniques for rule mining, grounding, and inference computation. We show that our best configuration scales well and achieves at least comparable results with respect to other statisticalrelationallearning algorithms on link prediction. 
Manhattan Distance  Taxicab geometry, considered by Hermann Minkowski in 19th century Germany, is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. The taxicab metric is also known as rectilinear distance, L1 distance or norm, city block distance, Manhattan distance, or Manhattan length, with corresponding variations in the name of the geometry. The latter names allude to the grid layout of most streets on the island of Manhattan, which causes the shortest path a car could take between two intersections in the borough to have length equal to the intersections’ distance in taxicab geometry. 
Manhattan Plot  A Manhattan plot is a type of scatter plot, usually used to display data with a large number of datapoints – many of nonzero amplitude, and with a distribution of highermagnitude values, for instance in genomewide association studies (GWAS). It gains its name from the similarity of such a plot to the Manhattan skyline: a profile of skyscrapers towering above the lower level “buildings” which vary around a lower height. 
Manifold Learning  Manifold Learning (often also referred to as nonlinear dimensionality reduction) pursuits the goal to embed data that originally lies in a high dimensional space in a lower dimensional space, while preserving characteristic properties. This is possible because for any high dimensional data to be interesting, it must be intrinsically low dimensional. For example, images of faces might be represented as points in a high dimensional space (let’s say your camera has 5MP – so your images, considering each pixel consists of three values , lie in a 15M dimensional space), but not every 5MP image is a face. Faces lie on a submanifold in this high dimensional space. A submanifold is locally Euclidean, i.e. if you take two very similar points, for example two images of identical twins, you can interpolate between them and still obtain an image on the manifold, but globally not Euclidean – if you take two images that are very different – for example Arnold Schwarzenegger and Hillary Clinton – you cannot interpolate between them. I develop algorithms that map these high dimensional data points into a low dimensional space, while preserving local neighborhoods. This can be interpreted as a nonlinear generalization of PCA. 
ManiFool  Deep convolutional neural networks have been shown to be vulnerable to arbitrary geometric transformations. However, there is no systematic method to measure the invariance properties of deep networks to such transformations. We propose ManiFool as a simple yet scalable algorithm to measure the invariance of deep networks. In particular, our algorithm measures the robustness of deep networks to geometric transformations in a worstcase regime as they can be problematic for sensitive applications. Our extensive experimental results show that ManiFool can be used to measure the invariance of fairly complex networks on high dimensional datasets and these values can be used for analyzing the reasons for it. Furthermore, we build on Manifool to propose a new adversarial training scheme and we show its effectiveness on improving the invariance properties of deep neural networks. 
MannKendall Trend Test (MK Test) 
Given n consecutive observations of a time series zt; t = 1;…; n, Mann (1945) suggested using the Kendall rank correlation of zt with t; t = 1;…; n to test for monotonic trend. ➚ “Kendall Rank Correlation Coefficient” 
MapBased MultiPolicy Reinforcement Learning (MMPRL) 
In order for robots to perform missioncritical tasks, it is essential that they are able to quickly adapt to changes in their environment as well as to injuries and or other bodily changes. Deep reinforcement learning has been shown to be successful in training robot control policies for operation in complex environments. However, existing methods typically employ only a single policy. This can limit the adaptability since a large environmental modification might require a completely different behavior compared to the learning environment. To solve this problem, we propose Mapbased MultiPolicy Reinforcement Learning (MMPRL), which aims to search and store multiple policies that encode different behavioral features while maximizing the expected reward in advance of the environment change. Thanks to these policies, which are stored into a multidimensional discrete map according to its behavioral feature, adaptation can be performed within reasonable time without retraining the robot. An appropriate pretrained policy from the map can be recalled using Bayesian optimization. Our experiments show that MMPRL enables robots to quickly adapt to large changes without requiring any prior knowledge on the type of injuries that could occur. A highlight of the learned behaviors can be found here: https://youtu.be/qcCepAKL32U . 
Maple  Maple combines the world’s most powerful math engine with an interface that makes it extremely easy to analyze, explore, visualize, and solve mathematical problems. 
mapnik  Mapnik is a highpowered rendering library that can take GIS data from a number of sources (ESRI shapefiles, PostGIS databases, etc.) and use them to render beautiful 2dimensional maps. It’s used as the underlying rendering solution for a lot of online mapping services, most notably including MapQuest and the OpenStreetMap project, so it’s a truly productionquality framework. And, despite being written in C++, it comes with bindings for Python and Node, so you can leverage it in the language of your choice. Render Google Maps Tiles with Mapnik and Python 
MapReduce for C (MR4C) 
MR4C is an implementation framework that allows you to run native code within the Hadoop execution framework. Pairing the performance and flexibility of natively developed algorithms with the unfettered scalability and throughput inherent in Hadoop, MR4C enables largescale deployment of advanced data processing applications. 
Marian  We present Marian, an efficient and selfcontained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoderdecoder framework and demonstrate that a researchfriendly toolkit can achieve high training and translation speed. 
Marimekko Chart  The Marimekko name has been adopted within business and the management consultancy industry to refer to a bar chart where all the bars are of equal height, there are no spaces between the bars, and the bars are in turn each divided into segments of different width. The design of the ‘marimekko’ chart is said to resemble a Marimekko print. The chart’s design encodes two variables (such as percentage of sales and market share), but it is criticised for making the data hard to perceive and to compare visually. 
Marked Point Process (MPP) 
A simple temporal point process (SPP) is an important class of time series, where the sample realization of the process is solely composed of the times at which events occur. Particular examples of point process data are neuronal spike patterns or spike trains, and a large number of distance and similarity metrics for those data have been proposed. A marked point process (MPP) is an extension of a simple temporal point process, in which a certain vector valued mark is associated with each of the temporal points in the SPP. Analyses of MPPs are of practical importance because instances of MPPs include recordings of natural disasters such as earthquakes and tornadoes. In this paper, we introduce an R package mmpp, which implements a number of distance and similarity metrics for SPP, and also extends those metrics for dealing with MPP. mmpp 
Marker Passing  
MarkerAssisted MiniPooling (mMPA) 
mMPA 
Market Basket Analysis (MBA) 
Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in an English pub and you buy a pint of beer and don’t buy a bar meal, you are more likely to buy crisps (US. chips) at the same time than somebody who didn’t buy beer. The set of items a customer buys is referred to as an itemset, and market basket analysis seeks to find relationships between purchases. Typically the relationship will be in the form of a rule: IF {beer, no bar meal} THEN {crisps}. The probability that a customer will buy beer without a bar meal (i.e. that the antecedent is true) is referred to as the support for the rule. The conditional probability that a customer will purchase crisps is referred to as the confidence. The algorithms for performing market basket analysis are fairly straightforward (Berry and Linhoff is a reasonable introductory resource for this). The complexities mainly arise in exploiting taxonomies, avoiding combinatorial explosions (a supermarket may stock 10,000 or more line items), and dealing with the large amounts of transaction data that may be available. A major difficulty is that a large number of the rules found may be trivial for anyone familiar with the business. Although the volume of data has been reduced, we are still asking the user to find a needle in a haystack. Requiring rules to have a high minimum support level and a high confidence level risks missing any exploitable result we might have found. One partial solution to this problem is differential market basket analysis, as described below. 
Marketing Attribution  Attribution is the process of identifying a set of user actions (‘events’) that contribute in some manner to a desired outcome, and then assigning a value to each of these events. Marketing attribution provides a level of understanding of what combination of events influence individuals to engage in a desired behavior, typically referred to as a conversion. Attribution is the process of assigning credit to various marketing efforts when a sale is generated. In the modern world, this is no easy task. There are myriad ways to touch a customer today and the goal of attribution is to tease out the impact that each touch had in convincing you to make a purchase. Was it the email you were sent? Or the Google link you clicked? Or the banner ad you clicked when visiting a different site? Or the ad you saw with your video on YouTube? Or one of many other potential touch points? Or is it a mix? It is quite common today for a customer to have been exposed to multiple influences in the lead up to a purchase. How do you attribute the relationship? The question is not simply academic because it has real world consequences. Budgets are set based on performance. So, the person in charge of Google advertising has a huge motivation to ensure that they get all the credit they deserve. Also, accurate attribution will allow resources to be properly focused on the approaches that truly work best. https://…/1029 
Markov Blanket  In machine learning, the Markov blanket for a node A in a Bayesian network is the set of nodes dA composed of A’s parents, its children, and its children’s other parents. In a Markov network, the Markov blanket of a node is its set of neighboring nodes. A Markov blanket may also be denoted by MB(A). The Markov blanket of a node contains all the variables that shield the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge needed to predict the behavior of that node. The term was coined by Pearl in 1988. In a Bayesian network, the values of the parents and children of a node evidently give information about that node; however, its children’s parents also have to be included, because they can be used to explain away the node in question. In a Markov random field, the Markov blanket for a node is simply its adjacent nodes. 
Markov Brains  Markov Brains are a class of evolvable artificial neural networks (ANN). They differ from conventional ANNs in many aspects, but the key difference is that instead of a layered architecture, with each node performing the same function, Markov Brains are networks built from individual computational components. These computational components interact with each other, receive inputs from sensors, and control motor outputs. The function of the computational components, their connections to each other, as well as connections to sensors and motors are all subject to evolutionary optimization. Here we describe in detail how a Markov Brain works, what techniques can be used to study them, and how they can be evolved. 
Markov Chain  A Markov chain (discretetime Markov chain or DTMC), named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another on a state space. It is a random process usually characterized as memoryless: the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of ‘memorylessness’ is called the Markov property. Markov chains have many applications as statistical models of realworld processes. http://…/9789814451505 
Markov Chain Las Vegas (MCLV) 
We propose a Las Vegas transformation of Markov Chain Monte Carlo (MCMC) estimators of Restricted Boltzmann Machines (RBMs). We denote our approach Markov Chain Las Vegas (MCLV). MCLV gives statistical guarantees in exchange for random running times. MCLV uses a stopping set built from the training data and has maximum number of Markov chain steps K (referred as MCLVK). We present a MCLVK gradient estimator (LVSK) for RBMs and explore the correspondence and differences between LVSK and Contrastive Divergence (CDK), with LVSK significantly outperforming CDK training RBMs over the MNIST dataset, indicating MCLV to be a promising direction in learning generative models. 
Markov Chain Monte Carlo (MCMC) 
In statistics, Markov chain Monte Carlo (MCMC) methods (which include random walk Monte Carlo methods) are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a large number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps. Usually it is not hard to construct a Markov chain with the desired properties. The more difficult problem is to determine how many steps are needed to converge to the stationary distribution within an acceptable error. A good chain will have rapid mixingthe stationary distribution is reached quickly starting from an arbitrary positiondescribed further under Markov chain mixing time 
Markov Chain Neural Network  In this work we present a modified neural network model which is capable to simulate Markov Chains. We show how to express and train such a network, how to ensure given statistical properties reflected in the training data and we demonstrate several applications where the network produces nondeterministic outcomes. One example is a random walker model, e.g. useful for simulation of Brownian motions or a natural TicTacToe network which ensures nondeterministic game behavior. 
Markov Cluster Algorithm (MCL) 
The MCL algorithm is short for the Markov Cluster Algorithm, a fast and scalable unsupervised cluster algorithm for graphs (also known as networks) based on simulation of (stochastic) flow in graphs. MCL 
Markov Decision Process (MDP) 
Markov decision processes (MDPs), named after Andrey Markov, provide a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker. MDPs are useful for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of research on Markov decision processes resulted from Ronald A. Howard’s book published in 1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines, including robotics, automated control, economics, and manufacturing. 
Markov Decision Process for Diversifying the Search Results in Information Retrieval (MDPDIV) 
Recently, some studies have utilized the Markov Decision Process for diversifying (MDPDIV) the search results in information retrieval. Though promising performances can be delivered, MDPDIV suffers from a very slow convergence, which hinders its usability in real applications. In this paper, we aim to promote the performance of MDPDIV by speeding up the convergence rate without much accuracy sacrifice. The slow convergence is incurred by two main reasons: the large action space and data scarcity. On the one hand, the sequential decision making at each position needs to evaluate the querydocument relevance for all the candidate set, which results in a huge searching space for MDP; on the other hand, due to the data scarcity, the agent has to proceed more ‘trial and error’ interactions with the environment. To tackle this problem, we propose MDPDIVkNN and MDPDIVNTN methods. The MDPDIVkNN method adopts a $k$ nearest neighbor strategy, i.e., discarding the $k$ nearest neighbors of the recentlyselected action (document), to reduce the diversification searching space. The MDPDIVNTN employs a pretrained diversification neural tensor network (NTNDIV) as the evaluation model, and combines the results with MDP to produce the final ranking solution. The experiment results demonstrate that the two proposed methods indeed accelerate the convergence rate of the MDPDIV, which is 3x faster, while the accuracies produced barely degrade, or even are better. 
Markov Logic Networks (MLN) 
A Markov logic network (or MLN) is a probabilistic logic which applies the ideas of a Markov network to firstorder logic, enabling uncertain inference. Markov logic networks generalize firstorder logic, in the sense that, in a certain limit, all unsatisfiable statements have a probability of zero, and all tautologies have probability one. Markov Logic Networks 
Markov Random Field (MRF) 
In the domain of physics and probability, a Markov random field (often abbreviated as MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. A Markov random field is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic. Thus, a Markov network can represent certain dependencies that a Bayesian network cannot (such as cyclic dependencies); on the other hand, it can’t represent certain dependencies that a Bayesian network can (such as induced dependencies). 
Markov switch smoothtransition HYGARCH model  HYGARCH model is basically used to model longrange dependence in volatility. We propose Markov switch smoothtransition HYGARCH model, where the volatility in each state is a timedependent convex combination of GARCH and FIGARCH. This model provides a flexible structure to capture different levels of volatilities and also short and long memory effects. The necessary and sufficient condition for the asymptotic stability is derived. Forecast of conditional variance is studied by using all past information through a parsimonious way. Bayesian estimations based on Gibbs sampling are provided. A simulation study has been given to evaluate the estimations and model stability. The competitive performance of the proposed model is shown by comparing it with the HYGARCH and smoothtransition HYGARCH models for some period of the \textit{S}\&\textit{P}500 indices based on volatility and valueatrisk forecasts. 
Mashboard  Also called realtime dashboard, a mashboard is a Web 2.0 buzzword that is used to describe analytic mashups that allow businesses to create or add components that may analyze and present data, look up inventory, accept orders, and other tasks without ever having to access the system that carries out the transaction. 
Masked Autoencoder for Distribution Estimation (MADE) 
There has been a lot of recent interest in designing neural network models to estimate a distribution from a set of examples. We introduce a simple modification for autoencoder neural networks that yields powerful generative models. Our method masks the autoencoder’s parameters to respect autoregressive constraints: each input is reconstructed only from previous inputs in a given ordering. Constrained this way, the autoencoder outputs can be interpreted as a set of conditional probabilities, and their product, the full joint probability. We can also train a single network that can decompose the joint probability in multiple different orderings. Our simple framework can be applied to multiple architectures, including deep ones. Vectorized implementations, such as on GPUs, are simple and fast. Experiments demonstrate that this approach is competitive with stateof theart tractable distribution estimators. At test time, the method is significantly faster and scales better than other autoregressive estimators. GitXiv 
MaskGAN  Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are stateoftheart for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maximum likelihood and teacher forcing. These methods are wellsuited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actorcritic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model. 
Mass Displacement Network (MDN) 
Despite the large improvements in performance attained by using deep learning in computer vision, one can often further improve results with some additional postprocessing that exploits the geometric nature of the underlying task. This commonly involves displacing the posterior distribution of a CNN in a way that makes it more appropriate for the task at hand, e.g. better aligned with local image features, or more compact. In this work we integrate this geometric postprocessing within a deep architecture, introducing a differentiable and probabilistically sound counterpart to the common geometric voting technique used for evidence accumulation in vision. We refer to the resulting neural models as Mass Displacement Networks (MDNs), and apply them to human pose estimation in two distinct setups: (a) landmark localization, where we collapse a distribution to a point, allowing for precise localization of body keypoints and (b) communication across body parts, where we transfer evidence from one part to the other, allowing for a globally consistent pose estimate. We evaluate on largescale pose estimation benchmarks, such as MPII Human Pose and COCO datasets, and report systematic improvements when compared to strong baselines. 
Mass Personalization  Mass personalization is defined as custom tailoring by a company in accordance with its end users tastes and preferences. From collaborative engineering perspective, mass customization can be viewed as collaborative efforts between customers and manufacturers, who have different sets of priorities and need to jointly search for solutions that best match customers’ individual specific needs with manufacturers’ customization capabilities. The main difference between mass customization and mass personalization is that customization is the ability for a company to give its customers an opportunity to create and choose product to certain specifications, but does have limits. Clothing industry has also adopted the mass customization paradigm and some footwear retailers are producing mass customized shoes. The gaming market is seeing personalization in the new custom controller industry. A new, and notable, company called “Experience Custom” gives customers the opportunity to order personalized gaming controllers. A website knowing a user’s location, and buying habits, will present offers and suggestions tailored to the user’s demographics; this is an example of mass personalization. The personalization is not individual but rather the user is first classified and then the personalization is based on the group they belong to. Behavioral targeting represents a concept that is similar to mass personalization. 
Massive Online Analysis (MOA) 
MOA (Massive Online Analysis) is a free opensource software specific for Data stream mining with Concept drift. It’s written in Java and developed at the University of Waikato, New Zealand. MOA is an opensource framework software that allows to build and run experiments of machine learning or data mining on evolving data streams. It includes a set of learners and stream generators that can be used from the Graphical User Interface (GUI), the commandline, and the Java API. MOA contains several collections of machine learning algorithms for classification, regression, clustering, outlier detection and recommendation engines. http://moa.cms.waikato.ac.nz 
Massive Open Online Course (MOOC) 
A Massive Open Online Course (MOOC) is an online course aimed at unlimited participation and open access via the web. In addition to traditional course materials such as videos, readings, and problem sets, MOOCs provide interactive user forums that help build a community for students, professors, and teaching assistants (TAs). MOOCs are a recent development in distance education which began to emerge in 2012. 
Matchbox  We present a probabilistic model for generating personalised recommendations of items to users of a web service. The Matchbox system makes use of content information in the form of user and item meta data in combination with collaborative filtering information from previous user behavior in order to predict the value of an item for a user. Users and items are represented by feature vectors which are mapped into a lowdimensional ‘trait space’ in which similarity is measured in terms of inner products. The model can be trained from different types of feedback in order to learn useritem preferences. Here we present three alternatives: direct observation of an absolute rating each user gives to some items, observation of a binary preference (like/ don’t like) and observation of a set of ordinal ratings on a userspecific scale. Efficient inference is achieved by approximate message passing involving a combination of Expectation Propagation (EP) and Variational Message Passing. We also include a dynamics model which allows an item’s popularity, a user’s taste or a user’s personal rating scale to drift over time. By using AssumedDensity Filtering (ADF) for training, the model requires only a single pass through the training data. This is an online learning algorithm capable of incrementally taking account of new data so the system can immediately reflect the latest user preferences. We evaluate the performance of the algorithm on the MovieLens and Netflix data sets consisting of approximately 1,000,000 and 100,000,000 ratings respectively. This demonstrates that training the model using the online ADF approach yields stateoftheart performance with the option of improving performance further if computational resources are available by performing multiple EP passes over the training data. 
MatchZoo  In recent years, deep neural models have been widely adopted for text matching tasks, such as question answering and information retrieval, showing improved performance as compared with previous methods. In this paper, we introduce the MatchZoo toolkit that aims to facilitate the designing, comparing and sharing of deep text matching models. Specifically, the toolkit provides a unified data preparation module for different text matching problems, a flexible layerbased model construction process, and a variety of training objectives and evaluation metrics. In addition, the toolkit has implemented two schools of representative deep text matching models, namely representationfocused models and interactionfocused models. Finally, users can easily modify existing models, create and share their own models for text matching in MatchZoo. 
Math Kernel Library (MKL) 
Intel Math Kernel Library (Intel MKL) is a library of optimized math routines for science, engineering, and financial applications. Core math functions include BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, and vector math. The routines in MKL are hand optimized by exploiting Intel’s multicore and manycore processors. The library supports Intel and compatible processors and is available for Windows, Linux and OS X operating systems. MKL functions are optimized with each new processor releases from Intel. 
Mathematica  Mathematica is a computational software program used in many scientific, engineering, mathematical and computing fields, based on symbolic mathematics. It was conceived by Stephen Wolfram and is developed by Wolfram Research of Champaign, Illinois. The Wolfram Language is the programming language used in Mathematica. 
Mathematical Statistics  Mathematical statistics is the application of mathematics to statistics, which was originally conceived as the science of the state – the collection and analysis of facts about a country: its economy, land, military, population, and so forth. Mathematical techniques which are used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measuretheoretic probability theory. 
Mathematics  Mathematics (from Greek μάθημα máthēma, ‘knowledge, study, learning’), often shortened to maths or math, is the study of topics such as quantity (numbers), structure, space, and change. There is a range of views among mathematicians and philosophers as to the exact scope and definition of mathematics. Mathematicians seek out patterns and use them to formulate new conjectures. Mathematicians resolve the truth or falsity of conjectures by mathematical proof. When mathematical structures are good models of real phenomena, then mathematical reasoning can provide insight or predictions about nature. Through the use of abstraction and logic, mathematics developed from counting, calculation, measurement, and the systematic study of the shapes and motions of physical objects. Practical mathematics has been a human activity for as far back as written records exist. The research required to solve mathematical problems can take years or even centuries of sustained inquiry. 
MathJax  A JavaScript display engine for mathematics that works in all browsers. 
MATLAB  MATLAB is the highlevel language and interactive environment used by millions of engineers and scientists worldwide. It lets you explore and visualize ideas and collaborate across disciplines including signal and image processing, communications, control systems, and computational finance. You can use MATLAB in projects such as modeling energy consumption to build smart power grids, developing control algorithms for hypersonic vehicles, analyzing weather data to visualize the track and intensity of hurricanes, and running millions of simulations to pinpoint optimal dosing for antibiotics. 
MatricizedTensor Times KhatriRao Product (MTTKRP) 
The matricizedtensor times KhatriRao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this paper, we develop sharedmemory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We benchmark sequential and parallel performance of our implementations, demonstrating high sequential performance and efficient parallel scaling. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to $7.4\times$ over existing parallel software. 
Matrix Calculus  In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations. The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics. Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a row vector. Both of these conventions are possible even when the common assumption is made that vectors should be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be somewhat standard throughout a single field that commonly use matrix calculus (e.g. econometrics, statistics, estimation theory and machine learning). However, even within a given field different authors can be found using competing conventions. Authors of both groups often write as though their specific convention is standard. Serious mistakes can result when combining results from different authors without carefully verifying that compatible notations are used. Therefore great care should be taken to ensure notational consistency. Definitions of these two conventions and comparisons between them are collected in the layout conventions section. 
Matrix Decomposition  In the mathematical discipline of linear algebra, a matrix decomposition or matrix factorization is a factorization of a matrix into a product of matrices. There are many different matrix decompositions; each finds use among a particular class of problems. 
Matrixcentric Neural Networks  We present a new distributed representation in deep neural nets wherein the information is represented in native form as a matrix. This differs from current neural architectures that rely on vector representations. We consider matrices as central to the architecture and they compose the input, hidden and output layers. The model representation is more compact and elegant – the number of parameters grows only with the largest dimension of the incoming layer rather than the number of hidden units. We derive feedforward nets that map an input matrix into an output matrix, and recurrent nets which map a sequence of input matrices into a sequence of output matrices. Experiments on handwritten digits recognition, face reconstruction, sequence to sequence learning and EEG classification demonstrate the efficacy and compactness of the matrixcentric architectures. 
MatrixVariate Gaussian (MVG) 
Differential privacy mechanism design has traditionally been tailored for a scalarvalued query function. Although many mechanisms such as the Laplace and Gaussian mechanisms can be extended to a matrixvalued query function by adding i.i.d. noise to each element of the matrix, this method is often suboptimal as it forfeits an opportunity to exploit the structural characteristics typically associated with matrix analysis. To address this challenge, we propose a novel differential privacy mechanism called the MatrixVariate Gaussian (MVG) mechanism, which adds a matrixvalued noise drawn from a matrixvariate Gaussian distribution, and we rigorously prove that the MVG mechanism preserves $(\epsilon,\delta)$differential privacy. Furthermore, we introduce the concept of directional noise made possible by the design of the MVG mechanism. Directional noise allows the impact of the noise on the utility of the matrixvalued query function to be moderated. Finally, we experimentally demonstrate the performance of our mechanism using three matrixvalued queries on three privacysensitive datasets. We find that the MVG mechanism notably outperforms four previous stateoftheart approaches, and provides comparable utility to the nonprivate baseline. Our work thus presents a promising prospect for both future research and implementation of differential privacy for matrixvalued query functions. 
Matroid  In combinatorics, a branch of mathematics, a matroid is a structure that captures and generalizes the notion of linear independence in vector spaces. There are many equivalent ways to define a matroid, the most significant being in terms of independent sets, bases, circuits, closed sets or flats, closure operators, and rank functions. Matroid theory borrows extensively from the terminology of linear algebra and graph theory, largely because it is the abstraction of various notions of central importance in these fields. Matroids have found applications in geometry, topology, combinatorial optimization, network theory and coding theory. 
Matryoshka Network  In this paper, we develop novel, efficient 2D encodings for 3D geometry, which enable reconstructing full 3D shapes from a single image at high resolution. The key idea is to pose 3D shape reconstruction as a 2D prediction problem. To that end, we first develop a simple baseline network that predicts entire voxel tubes at each pixel of a reference view. By leveraging wellproven architectures for 2D pixelprediction tasks, we attain stateoftheart results, clearly outperforming purely voxelbased approaches. We scale this baseline to higher resolutions by proposing a memoryefficient shape encoding, which recursively decomposes a 3D shape into nested shape layers, similar to the pieces of a Matryoshka doll. This allows reconstructing highly detailed shapes with complex topology, as demonstrated in extensive experiments; we clearly outperform previous octreebased approaches despite having a much simpler architecture using standard network components. Our Matryoshka networks further enable reconstructing shapes from IDs or shape similarity, as well as shape sampling. 
Matthews Correlation Coefficient (MCC) 
The Matthews Correlation Coefficient (MCC) has a range of 1 to 1 where 1 indicates a completely wrong binary classifier while 1 indicates a completely correct binary classifier. Using the MCC allows one to gauge how well their classification model/function is performing. Another method for evaluating classifiers is known as the ROC curve. Wikipedia mccr 
Maucha Diagrams  This diagram was proposed by Rezso Maucha in 1932 as a way to vizualise the relative ionic composition of water samples. oviz 
Maxima Units Search (MUS) 
An algorithm for extracting identity submatrices of small rank and pivotal units from large and sparse matrices is proposed. The procedure has already been satisfactorily applied for solving the label switching problem in Bayesian mixture models. Here we introduce it on its own and explore possible applications in different contexts. 
Maximal Information Coefficient (MIC) 
In statistics, the maximal information coefficient (MIC) is a measure of the strength of the linear or nonlinear association between two variables X and Y. The MIC belongs to the maximal informationbased nonparametric exploration (MINE) class of statistics. In a simulation study, MIC outperformed some selected low power tests, however concerns have been raised regarding reduced statistical power in detecting some associations in settings with low sample size when compared to powerful methods such as distance correlation and HHG. Comparisons with these methods, in which MIC was outperformed, were made in and. It is claimed that MIC approximately satisfies a property called equitability which is illustrated by selected simulation studies. It was later proved that no nontrivial coefficient can exactly satisfy the equitability property as defined by Reshef et al. Some criticisms of MIC are addressed by Reshef et al. in further studies published on arXiv. 
Maximal Label Search (MLS) 
Many graph search algorithms use a vertex labeling to compute an ordering of the vertices. We examine such algorithms which compute a peo (perfect elimination ordering) of a chordal graph and corresponding algorithms which compute an meo (minimal elimination ordering) of a nonchordal graph, an ordering used to compute a minimal triangulation of the input graph. We express all known peocomputing search algorithms as instances of a generic algorithm called MLS (maximal label search) and generalize Algorithm MLS into CompMLS, which can compute any peo. We then extend these algorithms to versions which compute an meo and likewise generalize all known meocomputing search algorithms. We show that not all minimal triangulations can be computed by such a graph search, and, more surprisingly, that all these search algorithms compute the same set of minimal triangulations, even though the computed meos are different. Finally, we present a complexity analysis of these algorithms. An extended abstract of part of this paper was published in WG 2005. Computing a clique tree with algorithm MLS (Maximal Label Search) 
Maximally Divergent Intervals (MDI) 
Automatic detection of anomalies in space and timevarying measurements is an important tool in several fields, e.g., fraud detection, climate analysis, or healthcare monitoring. We present an algorithm for detecting anomalous regions in multivariate spatiotemporal timeseries, which allows for spotting the interesting parts in large amounts of data, including video and text data. In opposition to existing techniques for detecting isolated anomalous data points, we propose the ‘Maximally Divergent Intervals’ (MDI) framework for unsupervised detection of coherent spatial regions and time intervals characterized by a high KullbackLeibler divergence compared with all other data given. In this regard, we define an unbiased KullbackLeibler divergence that allows for ranking regions of different size and show how to enable the algorithm to run on largescale data sets in reasonable time using an interval proposal technique. Experiments on both synthetic and real data from various domains, such as climate analysis, video surveillance, and text forensics, demonstrate that our method is widely applicable and a valuable tool for finding interesting events in different types of data. 
Maximum a posteriori (MAP) 
In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is a mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the basis of empirical data. It is closely related to Fisher’s method of maximum likelihood (ML), but employs an augmented optimization objective which incorporates a prior distribution over the quantity one wants to estimate. MAP estimation can therefore be seen as a regularization of ML estimation. 
Maximum Causal Tsallis Entropy (MCTE) 
In this paper, we propose a novel maximum causal Tsallis entropy (MCTE) framework for imitation learning which can efficiently learn a sparse multimodal policy distribution from demonstrations. We provide the full mathematical analysis of the proposed framework. First, the optimal solution of an MCTE problem is shown to be a sparsemax distribution, whose supporting set can be adjusted. The proposed method has advantages over a softmax distribution in that it can exclude unnecessary actions by assigning zero probability. Second, we prove that an MCTE problem is equivalent to robust Bayes estimation in the sense of the Brier score. Third, we propose a maximum causal Tsallis entropy imitation learning (MCTEIL) algorithm with a sparse mixture density network (sparse MDN) by modeling mixture weights using a sparsemax distribution. In particular, we show that the causal Tsallis entropy of an MDN encourages exploration and efficient mixture utilization while Boltzmann Gibbs entropy is less effective. We validate the proposed method in two simulation studies and MCTEIL outperforms existing imitation learning methods in terms of average returns and learning multimodal policies. 
Maximum Complex Correntropy Criterion (MCCC) 
Recent studies have demonstrated that correntropy is an efficient tool for analyzing higherorder statistical moments in nonGaussian noise environments. Although correntropy has been used with complex data, no theoretical study was pursued to elucidate its properties, nor how to best use it for optimization. This paper presents a probabilistic interpretation for correntropy using complexvalued data called complex correntropy. A recursive solution for the maximum complex correntropy criterion (MCCC) is introduced based on a fixed point solution. This technique is applied to a simple system identification case study, and the results demonstrate prominent advantages when compared to the complex recursive least squares (RLS) algorithm. By using such probabilistic interpretation, correntropy can be applied to solve several problems involving complex data in a more straightforward way. Keywords: complexvalued data correntropy, maximum complex correntropy criterion, fixedpoint algorithm. 
Maximum Entropy Flow Networks  Maximum Entropy Flow Networks 
Maximum Entropy Spectral Analysis (MESA) 

Maximum Inner Product Search (MIPS) 

Maximum Likelihood (ML) 
In statistics, maximumlikelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximumlikelihood estimation provides estimates for the model’s parameters. The method of maximum likelihood corresponds to many wellknown estimation methods in statistics. For example, one may be interested in the heights of adult female penguins, but be unable to measure the height of every single penguin in a population due to cost or time constraints. Assuming that the heights are normally (Gaussian) distributed with some unknown mean and variance, the mean and variance can be estimated with MLE while only knowing the heights of some sample of the overall population. MLE would accomplish this by taking the mean and variance as parameters and finding particular parametric values that make the observed results the most probable (given the model). In general, for a fixed set of data and underlying statistical model, the method of maximum likelihood selects the set of values of the model parameters that maximizes the likelihood function. Intuitively, this maximizes the ‘agreement’ of the selected model with the observed data, and for discrete random variables it indeed maximizes the probability of the observed data under the resulting distribution. Maximumlikelihood estimation gives a unified approach to estimation, which is welldefined in the case of the normal distribution and many other problems. However, in some complicated problems, difficulties do occur: in such problems, maximumlikelihood estimators are unsuitable or do not exist. 
Maximum Likelihood Estimates (MLE) 
In statistics, maximumlikelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximumlikelihood estimation provides estimates for the model’s parameters. 
Maximum Margin Interval Trees  Learning a regression function using censored or intervalvalued output data is an important problem in fields such as genomics and medicine. The goal is to learn a realvalued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a marginbased discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in loglinear time. We show empirically that this algorithm achieves stateoftheart speed and prediction accuracy in a benchmark of several data sets. 
Maximum Margin Principal Components  Principal Component Analysis (PCA) is a very successful dimensionality reduction technique, widely used in predictive modeling. A key factor in its widespread use in this domain is the fact that the projection of a dataset onto its first $K$ principal components minimizes the sum of squared errors between the original data and the projected data over all possible rank $K$ projections. Thus, PCA provides optimal lowrank representations of data for leastsquares linear regression under standard modeling assumptions. On the other hand, when the loss function for a prediction problem is not the leastsquares error, PCA is typically a heuristic choice of dimensionality reduction — in particular for classification problems under the zeroone loss. In this paper we target classification problems by proposing a straightforward alternative to PCA that aims to minimize the difference in margin distribution between the original and the projected data. Extensive experiments show that our simple approach typically outperforms PCA on any particular dataset, in terms of classification error, though this difference is not always statistically significant, and despite being a filter method is frequently competitive with Partial Least Squares (PLS) and Lasso on a wide range of datasets. 
Maximum Mean Discrepancy (MMD) 
The core idea in maximum mean discrepancy (MMD) in a reproducing kernel Hilbert space (RKHS) is to match two distributions based on the mean of features in the Hilbert space induced by a kernel K. This is justified because when K is universal there is an injection between the space of distributions and the space of mean feature vectors lying in its RKHS. From a practical perspective too, the MMD approach is appealing because unlike other parametric density estimation methods, it can be applied to arbitrary domains and to highdimensional data, and is computationally tractable. This approach was earlier used in the covariance shift problem (Gretton et al., 2009), the twosample problem (Gretton et al., 2012a), and recently in (Zhang et al., 2013) for estimating class ratios. 
Maximum Variance Total Variation Denoising (MVTV) 
We consider the problem of estimating a regression function in the common situation where the number of features is small, where interpretability of the model is a high priority, and where simple linear or additive models fail to provide adequate performance. To address this problem, we present Maximum Variance Total Variation denoising (MVTV), an approach that is conceptually related both to CART and to the more recent CRISP algorithm, a stateoftheart alternative method for interpretable nonlinear regression. MVTV divides the feature space into blocks of constant value and fits the value of all blocks jointly via a convex optimization routine. Our method is fully dataadaptive, in that it incorporates highly robust routines for tuning all hyperparameters automatically. We compare our approach against CART and CRISP via both a complexityaccuracy tradeoff metric and a human study, demonstrating that that MVTV is a more powerful and interpretable method. 
MaximumMargin Markov Network (M3N) 
In typical classification tasks, we seek a function which assigns a label to a single object. Kernelbased approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use highdimensional feature spaces, and from their strong theoretical guarantees. However, many realworld tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernelbased methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle highdimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with highdimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 
MaxMahalanobis Linear Discriminant Analysis (MMLDA) 
A deep neural network (DNN) consists of a nonlinear transformation from an input to a feature representation, followed by a common softmax linear classifier. Though many efforts have been devoted to designing a proper architecture for nonlinear transformation, little investigation has been done on the classifier part. In this paper, we show that a properly designed classifier can improve robustness to adversarial attacks and lead to better prediction results. Specifically, we define a MaxMahalanobis distribution (MMD) and theoretically show that if the input distributes as a MMD, the linear discriminant analysis (LDA) classifier will have the best robustness to adversarial examples. We further propose a novel MaxMahalanobis linear discriminant analysis (MMLDA) network, which explicitly maps a complicated data distribution in the input space to a MMD in the latent feature space and then applies LDA to make predictions. Our results demonstrate that the MMLDA networks are significantly more robust to adversarial attacks, and have better performance in classbiased classification. 
MaxMargin Deep Generative Models (mmDGMs) 
Deep generative models (DGMs) are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability. However, it is relatively insufficient to empower the discriminative ability of DGMs on making accurate predictions. This paper presents maxmargin deep generative models (mmDGMs) and a classconditional variant (mmDCGMs), which explore the strongly discriminative principle of maxmargin learning to improve the predictive performance of DGMs in both supervised and semisupervised learning, while retaining the generative capability. In semisupervised learning, we use the predictions of a maxmargin classifier as the missing labels instead of performing full posterior inference for efficiency; we also introduce additional maxmargin and labelbalance regularization terms of unlabeled data for effectiveness. We develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objectives in different settings. Empirical results on various datasets demonstrate that: (1) maxmargin learning can significantly improve the prediction performance of DGMs and meanwhile retain the generative ability; (2) in supervised learning, mmDGMs are competitive to the best fully discriminative networks when employing convolutional neural networks as the generative and recognition models; and (3) in semisupervised learning, mmDCGMs can perform efficient inference and achieve stateoftheart classification results on several benchmarks. 
Maxout Network  We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout’s fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance. Maxout Networks GitXiv 
McDiarmid Drift Detection Method (MDDM) 
Increasingly, Internet of Things (IoT) domains, such as sensor networks, smart cities, and social networks, generate vast amounts of data. Such data are not only unbounded and rapidly evolving. Rather, the content thereof dynamically evolves over time, often in unforeseen ways. These variations are due to socalled concept drifts, caused by changes in the underlying data generation mechanisms. In a classification setting, concept drift causes the previously learned models to become inaccurate, unsafe and even unusable. Accordingly, concept drifts need to be detected, and handled, as soon as possible. In medical applications and military zones, for example, change in behaviors should be detected in near realtime, to avoid potential loss of life. To this end, we introduce the McDiarmid Drift Detection Method (MDDM), which utilizes McDiarmid’s inequality in order to detect concept drift. The MDDM approach proceeds by sliding a window over prediction results, and associate window entries with weights. Higher weights are assigned to the most recent entries, in order to emphasize their importance. As instances are processed, the detection algorithm compares a weighted mean of elements inside the sliding window with the maximum weighted mean observed so far. A significant difference between the two weighted means, upperbounded by the McDiarmid inequality, implies a concept drift. Our extensive experimentation against synthetic and realworld data streams show that our novel method outperforms the stateoftheart. Specifically, MDDM yields shorter detection delays as well as lower false negative rates, while maintaining high classification accuracies. 
McNemar Test  In statistics, McNemar’s test is a statistical test used on paired nominal data. It is applied to 2 × 2 contingency tables with a dichotomous trait, with matched pairs of subjects, to determine whether the row and column marginal frequencies are equal (that is, whether there is “marginal homogeneity”). It is named after Quinn McNemar, who introduced it in 1947. An application of the test in genetics is the transmission disequilibrium test for detecting linkage disequilibrium. 
Mean Absolute Deviation (MAD) 
The mean absolute deviation (MAD), also referred to as the mean deviation (or sometimes average absolute deviation, though see above for a distinction), is the mean of the absolute deviations of a set of data about the data’s mean. In other words, it is the average distance of the data set from its mean. MAD has been proposed to be used in place of standard deviation since it corresponds better to real life. Because the MAD is a simpler measure of variability than the standard deviation, it can be used as pedagogical tool to help motivate the standard deviation. 
Mean Absolute Percentage Deviation (MAPD) 
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of accuracy of a method for constructing fitted time series values in statistics, specifically in trend estimation. It usually expresses accuracy as a percentage, 
Mean Average Percentage Error (MAPE) 
The mean absolute percentage error (MAPE), also known as mean absolute percentage deviation (MAPD), is a measure of accuracy of a method for constructing fitted time series values in statistics, specifically in trend estimation. It usually expresses accuracy as a percentage, 
Mean Directional Accuracy (MDA) 
Mean Directional Accuracy (MDA), also known as Mean Direction Accuracy, is a measure of prediction accuracy of a forecasting method in statistics. It compares the forecast direction (upward or downward) to the actual realized direction. In simple words, MDA provides the probability that the under study forecasting method can detect the correct direction of the time series. MDA is a popular metric for forecasting performance in economics and finance. MDA is used in economics applications where the economists is often interested only in directional movement of variable of interest. As an example in macroeconomics, a monetary authority who likes to know the direction of the inflation, to raises interest rates or decrease the rates if inflation is predicted to rise or drop respectively. Another example can be found in financial planning where the user wants to know if the demand has increasing direction or decreasing trend. 
Mean Field Reinforcement Learning (MFRL) 
Existing multiagent reinforcement learning methods are limited typically to a small number of agents. When the agent number increases largely, the learning becomes intractable due to the curse of the dimensionality and the exponential growth of user interactions. In this paper, we present Mean Field Reinforcement Learning where the interactions within the population of agents are approximated by those between a single agent and the average effect from the overall population or neighboring agents; the interplay between the two entities is mutually reinforced: the learning of the individual agent’s optimal policy depends on the dynamics of the population, while the dynamics of the population change according to the collective patterns of the individual policies. We develop practical mean field Qlearning and mean field ActorCritic algorithms and analyze the convergence of the solution. Experiments on resource allocation, Ising model estimation, and battle game tasks verify the learning effectiveness of our mean field approaches in handling manyagent interactions in population. 
Mean Field Residual Network  We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the ‘edge of chaos’ hypothesis, these subexponential and polynomial laws allow residual networks to ‘hover over the boundary between stability and chaos,’ thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind. 
Mean Shift  Mean shift is a nonparametric featurespace analysis technique for locating the maxima of a density function, a socalled modeseeking algorithm. Application domains include cluster analysis in computer vision and image processing. http://…/MeanShiftTheory.pdf 
Mean Shift Clustering  The mean shift algorithm is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters. http://…/mean_shift.pdf http://…/meanshift 
Mean Squared Error (MSE) 
In statistics, the mean squared error (MSE) of an estimator measures the average of the squares of the “errors”, that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss. The difference occurs because of randomness or because the estimator doesn’t account for information that could produce a more accurate estimate. 
Meaningful Purposive Interaction Analysis (MPIA) 
This book introduces Meaningful Purposive Interaction Analysis (MPIA) theory, which combines social network analysis (SNA) with latent semantic analysis (LSA) to help create and analyse a meaningful learning landscape from the digital traces left by a learning community in the coconstruction of knowledge. The hybrid algorithm is implemented in the statistical programming language and environment R, introducing packages which capture – through matrix algebra – elements of learners’ work with more knowledgeable others and resourceful content artefacts. The book provides comprehensive packagebypackage application examples, and code samples that guide the reader through the MPIA model to show how the MPIA landscape can be constructed and the learner’s journey mapped and analysed. This building block application will allow the reader to progress to using and building analytics to guide students and support decisionmaking in learning. 
Measure Differential Equations (MDE) 
A new type of differential equations for probability measures on Euclidean spaces, called Measure Differential Equations (briefly MDEs), is introduced. MDEs correspond to Probability Vector Fields, which map measures on an Euclidean space to measures on its tangent bundle. Solutions are intended in weak sense and existence, uniqueness and continuous dependence results are proved under suitable conditions. The latter are expressed in terms of the Wasserstein metric on the base and fiber of the tangent bundle. MDEs represent a natural measuretheoretic generalization of Ordinary Differential Equations via a monoid morphism mapping sums of vector fields to fiber convolution of the corresponding Probability Vector Fields. Various examples, including finitespeed diffusion and concentration, are shown, together with relationships to Partial Differential Equations. Finally, MDEs are also natural meanfield limits of multiparticle systems, with convergence results extending the classical Dubroshin approach. 
Measure Forecast Accuracy  
MEBoost  Class imbalance problem has been a challenging research problem in the fields of machine learning and data mining as most real life datasets are imbalanced. Several existing machine learning algorithms try to maximize the accuracy classification by correctly identifying majority class samples while ignoring the minority class. However, the concept of the minority class instances usually represents a higher interest than the majority class. Recently, several cost sensitive methods, ensemble models and sampling techniques have been used in literature in order to classify imbalance datasets. In this paper, we propose MEBoost, a new boosting algorithm for imbalanced datasets. MEBoost mixes two different weak learners with boosting to improve the performance on imbalanced datasets. MEBoost is an alternative to the existing techniques such as SMOTEBoost, RUSBoost, Adaboost, etc. The performance of MEBoost has been evaluated on 12 benchmark imbalanced datasets with state of the art ensemble methods like SMOTEBoost, RUSBoost, Easy Ensemble, EUSBoost, DataBoost. Experimental results show significant improvement over the other methods and it can be concluded that MEBoost is an effective and promising algorithm to deal with imbalance datasets. 
Mechanical Turk (MTurk) 
Amazon Mechanical Turk (MTurk) is a crowdsourcing Internet marketplace that enables individuals and businesses (known as Requesters) to coordinate the use of human intelligence to perform tasks that computers are currently unable to do. It is one of the sites of Amazon Web Services. Employers are able to post jobs known as HITs (Human Intelligence Tasks), such as choosing the best among several photographs of a storefront, writing product descriptions, or identifying performers on music CDs. Workers (called Providers in Mechanical Turk’s Terms of Service, or, more colloquially, Turkers) can then browse among existing jobs and complete them for a monetary payment set by the employer. To place jobs, the requesting programs use an open application programming interface (API), or the more limited MTurk Requester site. Employers are restricted to USbased entities. 
Median Absolute Deviation (MAD) 
In statistics, the median absolute deviation (MAD) is a robust measure of the variability of a univariate sample of quantitative data. It can also refer to the population parameter that is estimated by the MAD calculated from a sample. Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations about 2 are (1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1. 
Median Polish  The median polish is an exploratory data analysis procedure proposed by the statistician John Tukey. It finds an additivelyfit model for data in a twoway layout table (usually, results from a factorial experiment) of the form row effect + column effect + overall median. STMedianPolish 
Mediation  In statistics, a mediation model is one that seeks to identify and explicate the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third explanatory variable, known as a mediator variable. Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables. In other words, mediating relationships occur when a third variable plays an important role in governing the relationship between the other two variables. mediation,mma,mlma 
Medoid  Medoids are representative objects of a data set or a cluster with a data set whose average dissimilarity to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined such as 3D trajectories or in the gene expression context. The term is used in computer science in data clustering algorithms. 
Medusa  Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multicloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the faulttolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience. 
Memetic Algorithms (MA) 
Memetic algorithms (MA) represent one of the recent growing areas of research in evolutionary computation. The term MA is now widely used as a synergy of evolutionary or any populationbased approach with separate individual learning or local improvement procedures for problem search. Quite often, MA are also referred to in the literature as Baldwinian evolutionary algorithms (EA), Lamarckian EAs, cultural algorithms, or genetic local search. A Gentle Introduction to Memetic Algorithms 
Memetic Graph Clustering  ➘ “VieClus” 
Memory Attentionaware Recommender System (MARS) 
In this paper, we study the problem of modeling users’ diverse interests. Previous methods usually learn a fixed user representation, which has a limited ability to represent distinct interests of a user. In order to model users’ various interests, we propose a Memory Attentionaware Recommender System (MARS). MARS utilizes a memory component and a novel attentional mechanism to learn deep \textit{adaptive user representations}. Trained in an endtoend fashion, MARS adaptively summarizes users’ interests. In the experiments, MARS outperforms seven stateoftheart methods on three realworld datasets in terms of recall and mean average precision. We also demonstrate that MARS has a great interpretability to explain its recommendation results, which is important in many recommendation scenarios. 
Memory Augmented Control Network (MACN) 
Planning problems in partially observable environments cannot be solved directly with convolutional networks and require some form of memory. But, even memory networks with sophisticated addressing schemes are unable to learn intelligent reasoning satisfactorily due to the complexity of simultaneously learning to access memory and plan. To mitigate these challenges we introduce the Memory Augmented Control Network (MACN). The proposed network architecture consists of three main parts. The first part uses convolutions to extract features and the second part uses a neural networkbased planning module to preplan in the environment. The third part uses a network controller that learns to store those specific instances of past information that are necessary for planning. The performance of the network is evaluated in discrete grid world environments for path planning in the presence of simple and complex obstacles. We show that our network learns to plan and can generalize to new environments. 
Memory Augmented Neural Network (MANN) 
Deep learning typically requires training a very capable architecture using large datasets. However, many important learning problems demand an ability to draw valid inferences from small size datasets, and such problems pose a particular challenge for deep learning. In this regard, various researches on ‘metalearning’ are being actively conducted. Recent work has suggested a Memory Augmented Neural Network (MANN) for metalearning. MANN is an implementation of a Neural Turing Machine (NTM) with the ability to rapidly assimilate new data in its memory, and use this data to make accurate predictions. In models such as MANN, the input data samples and their appropriate labels from previous step are bound together in the same memory locations. This often leads to memory interference when performing a task as these models have to retrieve a feature of an input from a certain memory location and read only the label information bound to that location. In this paper, we tried to address this issue by presenting a more robust MANN. We revisited the idea of metalearning and proposed a new memory augmented neural network by explicitly splitting the external memory into feature and label memories. The feature memory is used to store the features of input data samples and the label memory stores their labels. Hence, when predicting the label of a given input, our model uses its feature memory unit as a reference to extract the stored feature of the input, and based on that feature, it retrieves the label information of the input from the label memory unit. In order for the network to function in this framework, a new memorywritingmodule to encode label information into the label memory in accordance with the metalearning task structure is designed. Here, we demonstrate that our model outperforms MANN by a large margin in supervised oneshot classification tasks using Omniglot and MNIST datasets. Streaming MANN: A StreamingBased Inference for EnergyEfficient MemoryAugmented Neural Networks 
Memory Networks  We describe a new class of learning models called memory networks. Memory networks reason with inference components combined with a longterm memory component; they learn how to use these jointly. The longterm memory can be read and written to, with the goal of using it for prediction. We investigate these models in the context of question answering (QA) where the longterm memory effectively acts as a (dynamic) knowledge base, and the output is a textual response. We evaluate them on a largescale QA task, and a smaller, but more complex, toy task generated from a simulated world. In the latter, we show the reasoning power of such models by chaining multiple supporting sentences to answer questions that require understanding the intension of verbs. 
MemoryEfficient Convolution (MEC) 
Convolution is a critical component in modern deep neural networks, thus several algorithms for convolution have been developed. Direct convolution is simple but suffers from poor performance. As an alternative, multiple indirect methods have been proposed including im2colbased convolution, FFTbased convolution, or Winogradbased algorithm. However, all these indirect methods have high memoryoverhead, which creates performance degradation and offers a poor tradeoff between performance and memory consumption. In this work, we propose a memoryefficient convolution or MEC with compact lowering, which reduces memoryoverhead substantially and accelerates convolution process. MEC lowers the input matrix in a simple yet efficient/compact way (i.e., much less memoryoverhead), and then executes multiple small matrix multiplications in parallel to get convolution completed. Additionally, the reduced memory footprint improves memory subsystem efficiency, improving performance. Our experimental results show that MEC reduces memory consumption significantly with good speedup on both mobile and server platforms, compared with other indirect convolution algorithms. 
Mendelian Randomization  The basic idea behind Mendelian Randomization is the following. In a simple, randomly mating population Mendel’s laws tell us that at any genomic locus (a measured spot in the genome) the allele (genetic material you got) you get is assigned at random. At the chromosome level this is very close to true due to properties of meiosis (here is an example of how this looks in very cartoonish form in yeast). http://…/018150.full.pdf 
MentorNet  Recent studies have discovered that deep networks are capable of memorizing the entire data even when the labels are completely random. Since deep models are trained on big data where labels are often noisy, the ability to overfit noise can lead to poor performance. To overcome the overfitting on corrupted training data, we propose a novel technique to regularize deep networks in the data dimension. This is achieved by learning a neural network called MentorNet to supervise the training of the base network, namely, StudentNet. Our work is inspired by curriculum learning and advances the theory by learning a curriculum from data by neural networks. We demonstrate the efficacy of MentorNet on several benchmarks. Comprehensive experiments show that it is able to significantly improve the generalization performance of the stateoftheart deep networks on corrupted training data. 
Merge and Select  In this article we introduce Merge and Select – a methodology – and factorMerger – an R package – for exploration and visualization of kgroup comparisons. Comparison of kgroups is one of the most important issues in exploratory analyses and it has zillions of applications. The classical solution is to test a null hypothesis that observations from all groups come from the same distribution. If the global null hypothesis is rejected a more detailed analysis of differences among pairs of groups is performed. The traditional approach is to use pairwise post hoc tests in order to verify which groups differ significantly. However, this approach fails with large number of groups in both interpretation and visualization layer. The Merge and Select methodology solves this problem by using easy to understand description of LRT based similarity among groups. 
MergeNet  We present here, a novel network architecture called MergeNet for discovering small obstacles for onroad scenes in the context of autonomous driving. The basis of the architecture rests on the central consideration of training with less amount of data since the physical setup and the annotation process for small obstacles is hard to scale. For making effective use of the limited data, we propose a multistage training procedure involving weightsharing, separate learning of low and high level features from the RGBD input and a refining stage which learns to fuse the obtained complementary features. The model is trained and evaluated on the Lost and Found dataset and is able to achieve stateofart results with just 135 images in comparison to the 1000 images used by the previous benchmark. Additionally, we also compare our results with recent methods trained on 6000 images and show that our method achieves comparable performance with only 1000 training samples. 
MergeShuffle  This article introduces an algorithm, MergeShuffle, which is an extremely efficient algorithm to generate random permutations (or to randomly permute an existing array). It is easy to implement, runs in $n\log_2 n + O(1)$ time, is inplace, uses $n\log_2 n + \Theta(n)$ random bits, and can be parallelized accross any number of processes, in a sharedmemory PRAM model. Finally, our preliminary simulations using OpenMP suggest it is more efficient than the RaoSandelius algorithm, one of the fastest existing random permutation algorithms. We also show how it is possible to further reduce the number of random bits consumed, by introducing a second algorithm BalancedShuffle, a variant of the RaoSandelius algorithm which is more conservative in the way it recursively partitions arrays to be shuffled. While this algorithm is of lesser practical interest, we believe it may be of theoretical value. Our full code is available at: https://…/mergeshuffle 
mermaid  Generation of diagrams and flowcharts from text in a similar manner as markdown. Ever wanted to simplify documentation and avoid heavy tools like Visio when explaining your code? This is why mermaid was born, a simple markdownlike script language for generating charts from text via javascript. 
Mesa  Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy a complex and challenging set of user and systems requirements, including near realtime data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Specifically, Mesa handles petabytes of data, processes millions of row updates per second, and serves billions of queries that fetch trillions of rows per day. Mesa is georeplicated across multiple datacenters and provides consistent and repeatable query answers at low latency, even when an entire datacenter fails. 
Message Importance Divergence (MID) 
Information transfer which reveals the state variation of variables can play a vital role in big data analytics and processing. In fact, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to KL divergence and Renyi divergence. Furthermore, in terms of the information transfer in big data, small probability events dominate the importance of the total message to some degree. Therefore, it is significant to design an information transfer measure based on the message importance which emphasizes the small probability events. In this paper, we propose the message importance divergence (MID) and investigate its characteristics and applications on three aspects. First, the message importance transfer capacity based on MID is presented to offer an upper bound for the information transfer with disturbance. Then, we utilize the MID to guide the queue length selection, which is the fundamental problem considered to have higher social or academic value in the caching operation of mobile edge computing. Finally, we extend the MID to the continuous case and discuss the robustness by using it to measuring information distance. 
Message Passing Algorithms  Constraint Satisfaction Problems (CSPs) are defined over a set of variables whose state must satisfy a number of constraints. We study a class of algorithms called Message Passing Algorithms, which aim at finding the probability distribution of the variables over the space of satisfying assignments. These algorithms involve passing local messages (according to some message update rules) over the edges of a factor graph constructed corresponding to the CSP. 
Message Passing Interface (MPI) 
Message Passing Interface (MPI) is a standardized and portable messagepassing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers. The standard defines the syntax and semantics of a core of library routines useful to a wide range of users writing portable messagepassing programs in Fortran or the C programming language. There are several welltested and efficient implementations of MPI, including some that are free or in the public domain. These fostered the development of a parallel software industry, and there encouraged development of portable and scalable largescale parallel applications. http://…/randmetaanalysis.html metaplus,MAVIS 
Message Understanding Conference (MUC) 
The Message Understanding Conferences (MUC) were initiated and financed by DARPA (Defense Advanced Research Projects Agency) to encourage the development of new and better methods of information extraction. The character of this competition—many concurrent research teams competing against one another—required the development of standards for evaluation, e.g. the adoption of metrics like precision and recall. 
MEstimation  In statistics, Mestimators are a broad class of estimators, which are obtained as the minima of sums of functions of the data. Leastsquares estimators are a special case of Mestimators. The definition of Mestimators was motivated by robust statistics, which contributed new types of Mestimators. The statistical procedure of evaluating an Mestimator on a data set is called Mestimation. More generally, an Mestimator may be defined to be a zero of an estimating function. This estimating function is often the derivative of another statistical function. For example, a maximumlikelihood estimate is often defined to be a zero of the derivative of the likelihood function with respect to the parameter; thus, a maximumlikelihood estimator is often a critical point of the score function. In many applications, such Mestimators can be thought of as estimating characteristics of the population. geex 
Meta Bag Algorithm  
Meta Networks  Deep neural networks have been successfully applied in applications with a large amount of labeled data. However, there are major drawbacks of the neural networks that are related to rapid generalization with small data and continual learning of new concepts without forgetting. We present a novel meta learning method, Meta Networks (MetaNet), that acquires a metalevel knowledge across tasks and shifts its inductive bias via fast parameterization for the rapid generalization. When tested on the standard oneshot learning benchmarks, our MetaNet models achieved near humanlevel accuracy. We demonstrated several appealing properties of MetaNet relating to generalization and continual learning. 
MetaAnalysis for Pathway Enrichment (MAPE) 
Motivation: Many pathway analysis (or gene set enrichment analysis) methods have been developed to identify enriched pathways under different biological states within a genomic study. As more and more microarray datasets accumulate, metaanalysis methods have also been developed to integrate information among multiple studies. Currently, most metaanalysis methods for combining genomic studies focus on biomarker detection and metaanalysis for pathway analysis has not been systematically pursued. Results: We investigated two approaches of metaanalysis for pathway enrichment (MAPE) by combining statistical significance across studies at the gene level (MAPE_G) or at the pathway level (MAPE_P). Simulation results showed increased statistical power of metaanalysis approaches compared to a single study analysis and showed complementary advantages of MAPE_G and MAPE_P under different scenarios. We also developed an integrated method (MAPE_I) that incorporates advantages of both approaches. Comprehensive simulations and applications to real data on drug response of breast cancer cell lines and lung cancer tissues were evaluated to compare the performance of three MAPE variations. MAPE_P has the advantage of not requiring gene matching across studies. When MAPE_G and MAPE_P show complementary advantages, the hybrid version of MAPE_I is generally recommended. MetaPath 
MetaBags  Ensembles are popular methods for solving practical supervised learning problems. They reduce the risk of having underperforming models in productiongrade software. Although critical, methods for learning heterogeneous regression ensembles have not been proposed at large scale, whereas in classical ML literature, stacking, cascading and voting are mostly restricted to classification problems. Regression poses distinct learning challenges that may result in poor performance, even when using well established homogeneous ensemble schemas such as bagging or boosting. In this paper, we introduce MetaBags, a novel, practically useful stacking framework for regression. MetaBags is a metalearning algorithm that learns a set of metadecision trees designed to select one base model (i.e. expert) for each query, and focuses on inductive bias reduction. A set of metadecision trees are learned using different types of metafeatures, specially created for this purpose – to then be bagged at metalevel. This procedure is designed to learn a model with a fair biasvariance tradeoff, and its improvement over base model performance is correlated with the prediction diversity of different experts on specific input space subregions. The proposed method and metafeatures are designed in such a way that they enable good predictive performance even in subregions of space which are not adequately represented in the available training data. An exhaustive empirical testing of the method was performed, evaluating both generalization error and scalability of the approach on synthetic, open and realworld application datasets. The obtained results show that our method significantly outperforms existing stateoftheart approaches. 
MetaCognitive Machine Learning  Machine learning is usually defined in behaviourist terms, where external validation is the primary mechanism of learning. In this paper, I argue for a more holistic interpretation in which finding more probable, efficient and abstract representations is as central to learning as performance. In other words, machine learning should be extended with strategies to reason over its own learning process, leading to socalled metacognitive machine learning. As such, the de facto definition of machine learning should be reformulated in these intrinsically multiobjective terms, taking into account not only the task performance but also internal learning objectives. To this end, we suggest a ‘model entropy function’ to be defined that quantifies the efficiency of the internal learning processes. It is conjured that the minimization of this model entropy leads to concept formation. Besides philosophical aspects, some initial illustrations are included to support the claims. 
MetaForest  A requirement of classic metaanalysis is that the studies being aggregated are conceptually similar, and ideally, close replications. However, in many fields, there is substantial heterogeneity between studies on the same topic. Similar research questions are studied in different laboratories, using different methods, instruments, and samples. Classic metaanalysis lacks the power to assess more than a handful of univariate moderators, or to investigate interactions between moderators, and nonlinear effects. MetaForest, by contrast, has substantial power to explore heterogeneity in metaanalysis. It can identify important moderators from a larger set of potential candidates, even with as little as 20 studies (Van Lissa, in preparation). This is an appealing quality, because many metaanalyses have small sample sizes. Moreover, MetaForest yields a measure of variable importance which can be used to identify important moderators, and offers partial prediction plots to explore the shape of the marginal relationship between moderators and effect size. metaforest 
Metatrace  Reinforcement learning (RL) has had many successes in both ‘deep’ and ‘shallow’ settings. In both cases, significant hyperparameter tuning is often required to achieve good performance. Furthermore, when nonlinear function approximation is used, nonstationarity in the state representation can lead to learning instability. A variety of techniques exist to combat this — most notably large experience replay buffers or the use of multiple parallel actors. These techniques come at the cost of moving away from the online RL problem as it is traditionally formulated (i.e., a single agent learning online without maintaining a large database of training examples). Metalearning can potentially help with both these issues by tuning hyperparameters online and allowing the algorithm to more robustly adjust to nonstationarity in a problem. This paper applies metagradient descent to derive a set of stepsize tuning algorithms specifically for online RL control with eligibility traces. Our novel technique, Metatrace, makes use of an eligibility trace analogous to methods like $TD(\lambda)$. We explore tuning both a single scalar stepsize and a separate stepsize for each learned parameter. We evaluate Metatrace first for control with linear function approximation in the classic mountain car problem and then in a noisy, nonstationary version. Finally, we apply Metatrace for control with nonlinear function approximation in 5 games in the Arcade Learning Environment where we explore how it impacts learning speed and robustness to initial stepsize choice. Results show that the metastepsize parameter of Metatrace is easy to set, Metatrace can speed learning, and Metatrace can allow an RL algorithm to deal with nonstationarity in the learning task. 
MetaUnsupervisedLearning  We introduce a new paradigm to investigate unsupervised learning, reducing unsupervised learning to supervised learning. Specifically, we mitigate the subjectivity in unsupervised decisionmaking by leveraging knowledge acquired from prior, possibly heterogeneous, supervised learning tasks. We demonstrate the versatility of our framework via comprehensive expositions and detailed experiments on several unsupervised problems such as (a) clustering, (b) outlier detection, and (c) similarity prediction under a common umbrella of metaunsupervisedlearning. We also provide rigorous PACagnostic bounds to establish the theoretical foundations of our framework, and show that our framing of metaclustering circumvents Kleinberg’s impossibility theorem for clustering. 
Metcalfe’s Law  Metcalfe’s law states that the value of a telecommunications network is proportional to the square of the number of connected users of the system (n^2). First formulated in this form by George Gilder in 1993, and attributed to Robert Metcalfe in regard to Ethernet, Metcalfe’s law was originally presented, circa 1980, not in terms of users, but rather of ‘compatible communicating devices’ (for example, fax machines, telephones, etc.). Only more recently with the launch of the Internet did this law carry over to users and networks as its original intent was to describe Ethernet purchases and connections. The law is also very much related to economics and business management, especially with competitive companies looking to merge with one another. In the real world, requirements of Pareto efficiency imply that the law will not hold. 
Method of Codifferential Descent (MCD) 
Method of codifferential descent (MCD) developed by professor V.F. Demyanov for solving a large class of nonsmooth nonconvex optimization problems. ➚ “Generalised Method of Codifferential Descent” 
Method of Moments (MM) 
In statistics, the method of moments is a method of estimation of population parameters. One starts with deriving equations that relate the population moments (i.e., the expected values of powers of the random variable under consideration) to the parameters of interest. Then a sample is drawn and the population moments are estimated from the sample. The equations are then solved for the parameters of interest, using the sample moments in place of the (unknown) population moments. This results in estimates of those parameters. The method of moments was introduced by Karl Pearson in 1894. momentchi2 
Method of Simulated Moments (MSM) 
In econometrics, the method of simulated moments (MSM) (also called simulated method of moments) is a structural estimation technique introduced by Daniel McFadden. It extends the generalized method of moments to cases where theoretical moment functions cannot be evaluated directly, such as when moment functions involve highdimensional integrals. MSM’s earliest and principal applications have been to research in industrial organization, after its development by Ariel Pakes, David Pollard, and others, though applications in consumption are emerging. 
Metric  In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set. A set with a metric is called a metric space. A metric induces a topology on a set, but not all topologies can be generated by a metric. A topological space whose topology can be described by a metric is called metrizable. 
Metric Expression Network (MEnet) 
Recent CNNbased saliency models have achieved great performance on public datasets, however, most of them are sensitive to distortion (e.g., noise, compression). In this paper, an endtoend generic salient object segmentation model called Metric Expression Network (MEnet) is proposed to overcome this drawback. Within this architecture, we construct a new topological metric space, with the implicit metric being determined by the deep network. In this way, we succeed in grouping all the pixels within the observed image semantically within this latent space into two regions: a salient region and a nonsalient region. With this method, all feature extractions are carried out at the pixel level, which makes the output boundaries of salient object finegrained. Experimental results show that the proposed metric can generate robust salient maps that allow for object segmentation. By testing the method on several public benchmarks, we show that the performance of MEnet has achieved good results. Furthermore, the proposed method outperforms previous CNNbased methods on distorted images. 
Metric Optimization Engine (MOE) 
MOE (Metric Optimization Engine) is an efficient way to optimize a system’s parameters, when evaluating parameters is timeconsuming or expensive. It is an open source, machine learning tool for solving these global, black box optimization problems in an optimal way. Here are some examples of when you could use MOE: 1. Optimizing a system’s clickthrough rate (CTR). 2. Optimizing tunable parameters of a machinelearning prediction method. 3. Optimizing the design of an engineering system 4. Optimizing the parameters of a realworld experiment 
MetricConstrained Kernel UnionofSubspaces (MCKUoS) 
Modern information processing relies on the axiom that highdimensional data lie near lowdimensional geometric structures. This paper revisits the problem of datadriven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing ‘related’ objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the unionofsubspaces model, and is termed the metricconstrained unionofsubspaces (MCUoS) model. The second one of these models—suited for data drawn from a mixture of nonlinear manifolds—generalizes the kernel subspace model, and is termed the metricconstrained kernel unionofsubspaces (MCKUoS) model. The main contributions of this paper in this regard include the following. First, it motivates and formalizes the problems of MCUoS and MCKUoS learning. Second, it presents algorithms that efficiently learn an MCUoS or an MCKUoS underlying data of interest. Third, it extends these algorithms to the case when parts of the data are missing. Last, but not least, it reports the outcomes of a series of numerical experiments involving both synthetic and real data that demonstrate the superiority of the proposed geometric models and learning algorithms over existing approaches in the literature. These experiments also help clarify the connections between this work and the literature on (subspace and kernel kmeans) clustering. GitXiv 
MetricConstrained UnionofSubspaces (MCUoS) 
Modern information processing relies on the axiom that highdimensional data lie near lowdimensional geometric structures. This paper revisits the problem of datadriven learning of these geometric structures and puts forth two new nonlinear geometric models for data describing ‘related’ objects/phenomena. The first one of these models straddles the two extremes of the subspace model and the unionofsubspaces model, and is termed the metricconstrained unionofsubspaces (MCUoS) model. The second one of these models—suited for data drawn from a mixture of nonlinear manifolds—generalizes the kernel subspace model, and is termed the metricconstrained kernel unionofsubspaces (MCKUoS) model. The main contributions of this paper in this regard include the following. First, it motivates and formalizes the problems of MCUoS and MCKUoS learning. Second, it presents algorithms that efficiently learn an MCUoS or an MCKUoS underlying data of interest. Third, it extends these algorithms to the case when parts of the data are missing. Last, but not least, it reports the outcomes of a series of numerical experiments involving both synthetic and real data that demonstrate the superiority of the proposed geometric models and learning algorithms over existing approaches in the literature. These experiments also help clarify the connections between this work and the literature on (subspace and kernel kmeans) clustering. GitXiv 
MetricsGraphics.js  MetricsGraphics.js is a library built on top of D3 that is optimized for visualizing and laying out timeseries data. It provides a simple way to produce common types of graphics in a principled, consistent and responsive way. The library currently supports line charts, scatterplots and histograms as well as features like rug plots and basic linear regression. metricsgraphics 
Metropolis Adjusted Langevin Algorithm (MALA) 
The MetropolisAdjusted Langevin Algorithm (MALA) is a Markov Chain Monte Carlo method which creates a Markov chain reversible with respect to a given target distribution, N, with Lebesgue density on R^N; it can hence be used to approximately sample the target distribution. When the dimension N is large a key question is to determine the computational cost of the algorithm as a function of N. One approach to this question, which we adopt here, is to derive diffusion limits for the algorithm. The family of target measures that we consider in this paper are, in general, in nonproduct form and are of interest in applied problems as they arise in Bayesian nonparametric statistics and in the study of conditioned diffusions. Furthermore, we study the situation, which arises in practice, where the algorithm is started out of stationarity. We thereby significantly extend previous works which consider either only measures of product form, when the Markov chain is started out of stationarity, or measures defined via a density with respect to a Gaussian, when the Markov chain is started in stationarity. We prove that, in the nonstationary regime, the computational cost of the algorithm is of the order N^(1/2) with dimension, as opposed to what is known to happen in the stationary regime, where the cost is of the order N^(1/3). Counterstrike: Defending Deep Learning Architectures Against Adversarial Samples by Langevin Dynamics with Supervised Denoising Autoencoder 
MetropolisHastings Algorithm  In statistics and in statistical physics, the MetropolisHastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult. This sequence can be used to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). MetropolisHastings and other MCMC algorithms are generally used for sampling from multidimensional distributions, especially when the number of dimensions is high. For singledimensional distributions, other methods are usually available (e.g. adaptive rejection sampling) that can directly return independent samples from the distribution, and are free from the problem of autocorrelated samples that is inherent in MCMC methods. http://…/1504.01896 
Metzler Matrix  In mathematics, a Metzler matrix is a matrix in which all the offdiagonal components are nonnegative (equal to or greater than zero). It is named after the American economist Lloyd Metzler. Metzler matrices appear in stability analysis of time delayed differential equations and positive linear dynamical systems. Their properties can be derived by applying the properties of nonnegative matrices to matrices of the form M + aI where M is a Metzler matrix. 
MFCMT  Discriminative Correlation Filters (DCF)based tracking algorithms exploiting conventional handcrafted features have achieved impressive results both in terms of accuracy and robustness. Template handcrafted features have shown excellent performance, but they perform poorly when the appearance of target changes rapidly such as fast motions and fast deformations. In contrast, statistical handcrafted features are insensitive to fast states changes, but they yield inferior performance in the scenarios of illumination variations and background clutters. In this work, to achieve an efficient tracking performance, we propose a novel visual tracking algorithm, named MFCMT, based on a complementary ensemble model with multiple features, including Histogram of Oriented Gradients (HOGs), Color Names (CNs) and Color Histograms (CHs). Additionally, to improve tracking results and prevent targets drift, we introduce an effective fusion method by exploiting relative entropy to coalesce all basic response maps and get an optimal response. Furthermore, we suggest a simple but efficient update strategy to boost tracking performance. Comprehensive evaluations are conducted on two tracking benchmarks demonstrate and the experimental results demonstrate that our method is competitive with numerous stateoftheart trackers. Our tracker achieves impressive performance with faster speed on these benchmarks. 
MicroMacro Multilevel Modeling  MicroMacroMultilevel 
Microsoft Project Oxford  Set of technologies dubbed Project Oxford that allows developers to create smarter apps, which can do things like recognize faces and interpret natural language even if the app developers are not experts in those fields. “If you are an app developer, you could just take the API capabilities and not worry about the machine learning aspect,” said Vijay Vokkaarne, a principal group program manager with Bing, whose team is working on the speech aspect of Project Oxford. 
MILABOT  We present MILABOT: a deep reinforcement learning chatbot developed by the Montreal Institute for Learning Algorithms (MILA) for the Amazon Alexa Prize competition. MILABOT is capable of conversing with humans on popular small talk topics through both speech and text. The system consists of an ensemble of natural language generation and retrieval models, including neural network and templatebased models. By applying reinforcement learning to crowdsourced data and realworld user interactions, the system has been trained to select an appropriate response from the models in its ensemble. The system has been evaluated through A/B testing with realworld users, where it performed significantly better than other systems. The results highlight the potential of coupling ensemble systems with deep reinforcement learning as a fruitful path for developing realworld, opendomain conversational agents. 
MillerHagberg Algorithm  We present an efficient algorithm to generate random graphs with a given sequence of expected degrees. Existing algorithms run in O(N 2 ) O(N2) time where N is the number of nodes. We prove that our algorithm runs in O(N+M) O(N+M) expected time where M is the expected number of edges. If the expected degrees are chosen from a distribution with finite mean, this is O(N) O(N) as N>Inf. 
MiMatrix  In this paper, we present a codesigned petascale highdensity GPU cluster to expedite distributed deep learning training with synchronous Stochastic Gradient Descent~(SSGD). This architecture of our heterogeneous cluster is inspired by Harvard architecture. Regarding to different roles in the system, nodes are configured as different specifications. Based on the topology of the whole system’s network and properties of different types of nodes, we develop and implement a novel job server parallel software framework, named by ‘\textit{MiMatrix}’, for distributed deep learning training. Compared to the parameter server framework, in which parameter server is a bottleneck of data transfer in AllReduce algorithm of SSGD, the job server undertakes all of controlling, scheduling and monitoring tasks without model data transfer. In MiMatrix, we propose a novel GPUDirect Remote direct memory access~(RDMA)aware parallel algorithm of AllReucde executed by computing servers, which both computation and handshake message are $O(1)$ at each epoch 
Min.Max Algorithm  This paper focuses on modeling violent crime rates against population over the years 19602014 for the United States via cubic spline based method. We propose a new min/max algorithm on knots detection and estimation for cubic spline regression. We employ least squares estimation to find potential regression coefficients based upon the cubic spline model and the knots chosen by the min/max algorithm. We then utilize the best subsets regression method to aid in model selection in which we find the minimum value of the Bayesian Information Criteria. Finally, we report the $R_{adj}^{2}$ as a measure of overall goodnessoffit of our selected model. Among the fifty states and Washington D.C., we have found 42 out of 51 with $R_{adj}^{2}$ value that was greater than $90\%$. We also present an overall model for the United States as a whole. Our method can serve as a unified model for violent crime rate over future years. 
Mined Semantic Analysis (MSA) 
Mined Semantic Analysis (MSA) is a novel distributional semantics approach which employs data mining techniques. MSA embraces knowledgedriven analysis of natural languages. It uncovers implicit relations between concepts by mining for their associations in target encyclopedic corpora. MSA exploits not only target corpus content but also its knowledge graph (e.g., ‘See also’ link graph of Wikipedia). Empirical results show competitive performance of MSA compared to prior stateoftheart methods for measuring semantic relatedness on benchmark data sets. Additionally, we introduce the first analytical study to examine statistical significance of results reported by different semantic relatedness methods. Our study shows that, top performing results could be statistically equivalent though mathematically different. The study positions MSA as one of stateoftheart methods for measuring semantic relatedness. 
MiniBatch AUC Optimization (MBA) 
Area under the receiver operating characteristics curve (AUC) is an important metric for a wide range of signal processing and machine learning problems, and scalable methods for optimizing AUC have recently been proposed. However, handling very large datasets remains an open challenge for this problem. This paper proposes a novel approach to AUC maximization, based on sampling minibatches of positive/negative instance pairs and computing Ustatistics to approximate a global risk minimization problem. The resulting algorithm is simple, fast, and learningrate free. We show that the number of samples required for good performance is independent of the number of pairs available, which is a quadratic function of the positive and negative instances. Extensive experiments show the practical utility of the proposed method. 
Minibatch Tempered MCMC (MINTMCMC) 
In this paper we propose a general framework of performing MCMC with only a minibatch of data. We show by estimating the MetropolisHasting ratio with only a minibatch of data, one is essentially sampling from the true posterior raised to a known temperature. We show by experiments that our method, Minibatch Tempered MCMC (MINTMCMC), can efficiently explore multiple modes of a posterior distribution. As an application, we demonstrate one application of MINTMCMC as an inference tool for Bayesian neural networks. We also show an cyclic version of our algorithm can be applied to build an ensemble of neural networks with little additional training cost. 
Minimal Support Vector Machine (Minimal SVM) 
Support Vector Machine (SVM) is an efficient classification approach, which finds a hyperplane to separate data from different classes. This hyperplane is determined by support vectors. In existing SVM formulations, the objective function uses L2 norm or L1 norm on slack variables. The number of support vectors is a measure of generalization errors. In this work, we propose a Minimal SVM, which uses L0.5 norm on slack variables. The result model further reduces the number of support vectors and increases the classification performance. 
Minimally Sufficient Statistic  In using a statistic to estimate a parameter in a probability distribution, it is important to remember that there can be multiple sufficient statistics for the same parameter. Indeed, the entire data set,X1 … Xn , can be a sufficient statistic – it certainly contains all of the information that is needed to estimate the parameter. However, using all n variables is not very satisfying as a sufficient statistic, because it doesn’t reduce the information in any meaningful way – and a more compact, concise statistic is better than a complicated, multidimensional statistic. If we can use a lowerdimensional statistic that still contains all necessary information for estimating the parameter, then we have truly reduced our data set without stripping any value from it. 
Minimax Concave Penalty (MCP) 
regnet 
Minimax Regularization  Classical approach to regularization is to design norms enhancing smoothness or sparsity and then to use this norm or some power of this norm as a regularization function. The choice of the regularization function (for instance a power function) in terms of the norm is mostly dictated by computational purpose rather than theoretical considerations. In this work, we design regularization functions that are motivated by theoretical arguments. To that end we introduce a concept of optimal regularization called ‘minimax regularization’ and, as a proof of concept, we show how to construct such a regularization function for the $\ell_1^d$ norm for the random design setup. We develop a similar construction for the deterministic design setup. It appears that the resulting regularized procedures are different from the one used in the LASSO in both setups. 
Minimizing Approximated Information Criteria (MIC) 
coxphMIC 
Minimum Correlation Regularization  In social networks, heterogeneous multimedia data correlate to each other, such as videos and their corresponding tags in YouTube and imagetext pairs in Facebook. Nearest neighbor retrieval across multiple modalities on large data sets becomes a hot yet challenging problem. Hashing is expected to be an efficient solution, since it represents data as binary codes. As the bitwise XOR operations can be fast handled, the retrieval time is greatly reduced. Few existing multimodal hashing methods consider the correlation among hashing bits. The correlation has negative impact on hashing codes. When the hashing code length becomes longer, the retrieval performance improvement becomes slower. In this paper, we propose a minimum correlation regularization (MCR) for multimodal hashing. First, the sigmoid function is used to embed the data matrices. Then, the MCR is applied on the output of sigmoid function. As the output of sigmoid function approximates a binary code matrix, the proposed MCR can efficiently decorrelate the hashing codes. Experiments show the superiority of the proposed method becomes greater as the code length increases. 
Minimum Description Length (MDL) 
The minimum description length (MDL) principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory. 
Minimum Description Length Principle (MDL) 
The minimum description length (MDL) principle is a formalization of Occam’s razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory. 
Minimum Incremental Coding Length (MICL) 
We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We demonstrate the asymptotic optimality of this criterion for Gaussian distributions and analyze its relationships to classical classifiers. The theoretical results clarify the connections between our approach and popular classifiers such as maximum a posteriori (MAP), regularized discriminant analysis (RDA), $k$nearest neighbor ($k$NN), and support vector machine (SVM), as well as unsupervised methods based on lossy coding. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domainspecific information. 
Minimum Spanning Tree (MST) 
Given a connected, undirected graph, a spanning tree of that graph is a subgraph that is a tree and connects all the vertices together. A single graph can have many different spanning trees. We can also assign a weight to each edge, which is a number representing how unfavorable it is, and use this to assign a weight to a spanning tree by computing the sum of the weights of the edges in that spanning tree. A minimum spanning tree (MST) or minimum weight spanning tree is then a spanning tree with weight less than or equal to the weight of every other spanning tree. More generally, any undirected graph (not necessarily connected) has a minimum spanning forest, which is a union of minimum spanning trees for its connected components. http://…/43mst http://…/t0000021.pdf 
Mining High Utility Itemset using PUNLists (MIP) 
In this paper, we propose a novel data structure called PUNlist, which maintains both the utility information about an itemset and utility upper bound for facilitating the processing of mining high utility itemsets. Based on PUNlists, we present a method, called MIP (Mining high utility Itemset using PUNLists), for fast mining high utility itemsets. The efficiency of MIP is achieved with three techniques. First, itemsets are represented by a highly condensed data structure, PUNlist, which avoids costly, repeatedly utility computation. Second, the utility of an itemset can be efficiently calculated by scanning the PUNlist of the itemset and the PUNlists of long itemsets can be fast constructed by the PUNlists of short itemsets. Third, by employing the utility upper bound lying in the PUNlists as the pruning strategy, MIP directly discovers high utility itemsets from the search space, called setenumeration tree, without generating numerous candidates. Extensive experiments on various synthetic and real datasets show that PUNlist is very effective since MIP is at least an order of magnitude faster than recently reported algorithms on average. 
Minka’s Expectation Propagation  
Minkowski Distance  The Minkowski distance is a metric on Euclidean space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance. 
Minkowski Weighted KMeans (MWKMeans) 
This paper represents another step in overcoming a drawback of KMeans, its lack of defense against noisy features, using feature weights in the criterion. The Weighted KMeans method by Huang et al. (2008, 2004, 2005) is extended to the corresponding Minkowski metric for measuring distances. Under Minkowski metric the feature weights become intuitively appealing feature rescaling factors in a conventional KMeans criterion. To see how this can be used in addressing another issue of KMeans, the initial setting, a method to initialize KMeans with anomalous clusters is adapted. The Minkowski metric based method is experimentally validated on datasets from the UCI Machine Learning Repository and generated sets of Gaussian clusters, both as they are and with additional uniform random noise features, and appears to be competitive in comparison with other KMeans based feature weighting algorithms. The problem we are tracking here relates to the fact that KMeans treats all features in a dataset as if they had the same degree of relevance. However, we do know that in most datasets different features will have different degrees of relevance. It is not just a matter of feature selection (in which we say: features a and b are relevant but c isn’t), but of feature weighting. 
MinNorm Training  In this work, we propose a new training method for finding minimum weight norm solutions in overparameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem wherein the sum of the norm of the weights in each layer of the network is minimized, under the constraint of exactly fitting training data. It draws inspiration from support vector machines (SVMs), which are able to generalize well, despite often having an infinite number of free parameters in their primal form, and from recent theoretical generalization bounds on NNs which suggest that lower norm solutions generalize better. To solve this constrained optimization problem, our method employs Lagrange multipliers that act as integrators of error over training and identify `support vector’like examples. The method can be implemented as a wrapper around gradient based methods and uses standard backpropagation of gradients from the NN for both regression and classification versions of the algorithm. We provide theoretical justifications for the effectiveness of this algorithm in comparison to early stopping and $L_2$regularization using simple, analytically tractable settings. In particular, we show faster convergence to the maxmargin hyperplane in a shallow network (compared to vanilla gradient descent); faster convergence to the minimumnorm solution in a linear chain (compared to $L_2$regularization); and initializationindependent generalization performance in a deep linear network. Finally, using the MNIST dataset, we demonstrate that this algorithm can boost test accuracy and identify difficult examples in realworld datasets. 
MinMax Scaling  An alternative approach to Zscore normalization (or standardization) is the socalled MinMax scaling (often also simply called “normalization” – a common cause for ambiguities). In this approach, the data is scaled to a fixed range – usually 0 to 1. 
MINT  We propose a test of independence of two multivariate random vectors, given a sample from the underlying population. Our approach, which we call MINT, is based on the estimation of mutual information, whose decomposition into joint and marginal entropies facilitates the use of recentlydeveloped efficient entropy estimators derived from nearest neighbour distances. The proposed critical values, which may be obtained from simulation (in the case where one marginal is known) or resampling, guarantee that the test has nominal size, and we provide local power analyses, uniformly over classes of densities whose mutual information satisfies a lower bound. Our ideas may be extended to provide a new goodnessoffit tests of normal linear models based on assessing the independence of our vector of covariates and an appropriatelydefined notion of an error vector. The theory is supported by numerical studies on both simulated and real data. 
MinWise Independent Permutations Locality Sensitive Hashing Scheme (MinHash) 
In computer science, MinHash (or the minwise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder (1997), and initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It has also been applied in largescale clustering problems, such as clustering documents by the similarity of their sets of words. 
Missing View Imputation with Generative Adversarial Networks (VIGAN) 
In an era where big data is becoming the norm, we are becoming less concerned with the quantity of the data for our models, but rather the quality. With such large amounts of data collected from multiple heterogeneous sources comes the associated problems, often missing views. As most models could not handle whole view missing problem, it brings up a significant challenge when conducting any multiview analysis, especially when used in the context of very large and heterogeneous datasets. However if dealt with properly, joint learning from these complementary sources can be advantageous. In this work, we present a method for imputing missing views based on generative adversarial networks called VIGAN which combines crossdomain relations given unpaired data with multiview relations given paired data. In our model, VIGAN first learns bidirectional mapping between view X and view Y using a cycleconsistent adversarial network. Moreover, we incorporate a denoising multimodal autoencoder to refine the initial approximation by making use of the joint representation. Empirical results give evidence indicating VIGAN offers competitive results compared to other methods on both numeric and image data. 
Mix and Match (M&M) 
We introduce Mix&Match (M&M) – a training framework designed to facilitate rapid and effective learning in RL agents, especially those that would be too slow or too challenging to train otherwise. The key innovation is a procedure that allows us to automatically form a curriculum over agents. Through such a curriculum we can progressively train more complex agents by, effectively, bootstrapping from solutions found by simpler agents. In contradistinction to typical curriculum learning approaches, we do not gradually modify the tasks or environments presented, but instead use a process to gradually alter how the policy is represented internally. We show the broad applicability of our method by demonstrating significant performance gains in three different experimental setups: (1) We train an agent able to control more than 700 actions in a challenging 3D firstperson task; using our method to progress through an actionspace curriculum we achieve both faster training and better final performance than one obtains using traditional methods. (2) We further show that M&M can be used successfully to progress through a curriculum of architectural variants defining an agents internal state. (3) Finally, we illustrate how a variant of our method can be used to improve agent performance in a multitask setting. 
MIXed data Multilevel Anomaly Detection (MIXMAD) 
Anomalies are those deviating from the norm. Unsupervised anomaly detection often translates to identifying low density regions. Major problems arise when data is highdimensional and mixed of discrete and continuous attributes. We propose MIXMAD, which stands for MIXed data Multilevel Anomaly Detection, an ensemble method that estimates the sparse regions across multiple levels of abstraction of mixed data. The hypothesis is for domains where multiple data abstractions exist, a data point may be anomalous with respect to the raw representation or more abstract representations. To this end, our method sequentially constructs an ensemble of Deep Belief Nets (DBNs) with varying depths. Each DBN is an energybased detector at a predefined abstraction level. At the bottom level of each DBN, there is a Mixedvariate Restricted Boltzmann Machine that models the density of mixed data. Predictions across the ensemble are finally combined via rank aggregation. The proposed MIXMAD is evaluated on highdimensional realworld datasets of different characteristics. The results demonstrate that for anomaly detection, (a) multilevel abstraction of highdimensional and mixed data is a sensible strategy, and (b) empirically, MIXMAD is superior to popular unsupervised detection methods for both homogeneous and mixed data. 
Mixed Markov Models (MMM) 
Markov random fields can encode complex probabilistic relationships involving multiple variables and admit efficient procedures for probabilistic inference. However, from a knowledge engineering point of view, these models suffer from a serious limitation. The graph of a Markov field must connect all pairs of variables that are conditionally dependent even for a single choice of values of the other variables. This makes it hard to encode interactions that occur only in a certain context and are absent in all others. Furthermore, the requirement that two variables be connected unless always conditionally independent may lead to excessively dense graphs, obscuring the independencies present among the variables and leading to computationally prohibitive inference algorithms. Mumford proposed an alternative modeling framework where the graph need not be rigid and completely determined a priori. Mixed Markov models contain nodevalued random variables that, when instantiated, augment the graph by a set of transient edges. A single joint probability distribution relates the values of regular and nodevalued variables. In this article, we study the analytical and computational properties of mixed Markov models. In particular, we show that positive mixed models have a local Markov property that is equivalent to their global factorization. We also describe a computationally efficient procedure for answering probabilistic queries in mixed Markov models. 
Mixed Membership Models (MMM) 
… We have reviewed and seen mixture models in detail. And we’ve seen hierarchical modelsparticularly those that capture nested structure in the data. 1. We will now combine these ideas to form mixed membership models, which is a powerful modeling methodology. 2. The basic ideas are • Data are grouped. • Each group is modeled with a mixture. • The mixture components are shared across all the groups. • The mixture proportions are vary from group to group. … mixedMem 
Mixed Neighbourhood Selection (MNS) 
MNS 
MixedData Sampling (MIDAS) 
Mixeddata sampling (MIDAS) is an econometric regression or filtering method developed by Ghysels et al. The regression models can be viewed in some cases as substitutes for the Kalman filter when applied in the context of mixed frequency data. Bai, Ghysels and Wright (2010) examine the relationship between MIDAS regressions and Kalman filter state space models applied to mixed frequency data. In general, the latter involve a system of equations, whereas in contrast MIDAS regressions involve a (reduced form) single equation. As a consequence, MIDAS regressions might be less efficient, but also less prone to specification errors. In cases where the MIDAS regression is only an approximation, the approximation errors tend to be small. 
Mixture Density Network  The core idea is to have a Neural Net that predicts an entire (and possibly complex) distribution. In this example we’re predicting a mixture of gaussians distributions via its sufficient statistic. This means that the network knows what it doesn’t know: it will predict diffuse distributions in situations where the target variable is very noisy, and it will predict a much more peaky distribution in nearly deterministic parts. 
Mixture Likelihood Ratio Test  We explore the fundamental limits of heterogeneous distributed detection in an anonymous sensor network with n sensors and a single fusion center. The fusion center collects the single observation from each of the n sensors to detect a binary parameter. The sensors are clustered into multiple groups, and different groups follow different distributions under a given hypothesis. The key challenge for the fusion center is the anonymity of sensors — although it knows the exact number of sensors and the distribution of observations in each group, it does not know which group each sensor belongs to. It is hence natural to consider it as a composite hypothesis testing problem. First, we propose an optimal test called mixture likelihood ratio test, which is a randomized threshold test based on the ratio of the uniform mixture of all the possible distributions under one hypothesis to that under the other hypothesis. Optimality is shown by first arguing that there exists an optimal test that is symmetric, that is, it does not depend on the order of observations across the sensors, and then proving that the mixture likelihood ratio test is optimal among all symmetric tests. Second, we focus on the NeymanPearson setting and characterize the error exponent of the worstcase typeII error probability as n tends to infinity, assuming the number of sensors in each group is proportional to n. Finally, we generalize our result to find the collection of all achievable typeI and typeII error exponents, showing that the boundary of the region can be obtained by solving a convex optimization problem. Our results elucidate the price of anonymity in heterogeneous distributed detection. The results are also applied to distributed detection under Byzantine attacks, which hints that the conventional approach based on simple hypothesis testing might be too pessimistic. 
Mixture Model (MM) 
In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the subpopulation to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with “mixture distributions” relate to deriving the properties of the overall population from those of the subpopulations, “mixture models” are used to make statistical inferences about the properties of the subpopulations given only observations on the pooled population, without subpopulation identity information. 
Mixture of Experts (MoE) 
Mixture of experts refers to a machine learning technique where multiple experts (learners) are used to divide the problem space into homogeneous regions. An example from the computer vision domain is combining a neural network model for human detection with another for pose estimation. If the output is conditioned on multiple levels of probabilistic gating functions, the mixture is called a hierarchical mixture of experts. A gating network decides which expert to use for each input region. Learning thus consists of 1) learning the parameters of individual learners and 2) learning the parameters of the gating network. Globally Consistent Algorithms for Mixture of Experts 
MLJAR  MLJAR is a platform for rapid prototyping, development and deploying pattern recognition algorithms. It works with many data types – basically all data are arrays 🙂 mljar 
MLPerf  The MLPerf effort aims to build a common set of benchmarks that enables the machine learning (ML) field to measure system performance for both training and inference from mobile devices to cloud services. We believe that a widely accepted benchmark suite will benefit the entire community, including researchers, developers, builders of machine learning frameworks, cloud service providers, hardware manufacturers, application providers, and end users. 
MNIST Database (MNIST) 
The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. 
mobile Question Answering (mQA) 
In this paper, we present a novel proposal for Question An swering through mobile devices. Thus, an architecture for a mobile Ques tion Answering system based on WAP technologies is deployed. The ar chitecture propose moves the issue of Question Answering to the context of mobility. This paradigm ensures that QA is seen as an activity that provides entertainment and excitement pleasure. This characteristic gives to QA an added value. Furthermore, the method for answering de¯nition questions is very precise. It could answer almost 90% of the questions; moreover, it never replies wrong or unsupported answers. Considering that the mobilephone has had a boom in the last years and that a lot of people already have mobile telephones (approximately 3.5 billions), we propose an architecture for a new mobile system that makes QA some thing natural and e®ective for work in all ¯elds of development. This obeys to that the new mobile technology can help us to achieve our perspectives of growth. This system provides to user with a permanent communication in anytime, anywhere and any device (PDA’s, cellphone, NDS, etc.). 
MobiRNN  In this paper, we explore optimizations to run Recurrent Neural Network (RNN) models locally on mobile devices. RNN models are widely used for Natural Language Processing, Machine Translation, and other tasks. However, existing mobile applications that use RNN models do so on the cloud. To address privacy and efficiency concerns, we show how RNN models can be run locally on mobile devices. Existing work on porting deep learning models to mobile devices focus on Convolution Neural Networks (CNNs) and cannot be applied directly to RNN models. In response, we present MobiRNN, a mobilespecific optimization framework that implements GPU offloading specifically for mobile GPUs. Evaluations using an RNN model for activity recognition shows that MobiRNN does significantly decrease the latency of running RNN models on phones. 
MOCHA  Federated learning poses new statistical and systems challenges in training machine learning models over distributed networks of devices. In this work, we show that multitask learning is naturally suited to handle the statistical challenges of this setting, and propose a novel systemsaware optimization method, MOCHA, that is robust to practical systems issues. Our method and theory for the first time consider issues of high communication cost, stragglers, and fault tolerance for distributed multitask learning. The resulting method achieves significant speedups compared to alternatives in the federated setting, as we demonstrate through simulations on realworld federated datasets. 
modAL  modAL is a modular active learning framework for Python, aimed to make active learning research and practice simpler. Its distinguishing features are (i) clear and modular object oriented design (ii) full compatibility with scikitlearn models and workflows. These features make fast prototyping and easy extensibility possible, aiding the development of reallife active learning pipelines and novel algorithms as well. modAL is fully open source, hosted on GitHub at https://…/modAL. To assure code quality, extensive unit tests are provided and continuous integration is applied. In addition, a detailed documentation with several tutorials are also available for ease of use. The framework is available in PyPI and distributed under the MIT license. 
Model Average Double Robust (MADR) 
Estimates average treatment effects using model average double robust (MADR) estimation. The MADR estimator is defined as weighted average of double robust estimators, where each double robust estimator corresponds to a specific choice of the outcome model and the propensity score model. The MADR estimator extend the desirable double robustness property by achieving consistency under the much weaker assumption that either the true propensity score model or the true outcome model be within a specified, possibly large, class of models. madr 
Model Averaging  
Model Based Clustering for Mixed Data (clustMD) 
A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded. 
Model Based Machine Learning (MBML) 
Several decades of research in the field of machine learning have resulted in a multitude of different algorithms for solving a broad range of problems. To tackle a new application, a researcher typically tries to map their problem onto one of these existing methods, often influenced by their familiarity with specific algorithms and by the availability of corresponding software implementations. In this study, we describe an alternative methodology for applying machine learning, in which a bespoke solution is formulated for each new application. The solution is expressed through a compact modelling language, and the corresponding custom machine learning code is then generated automatically. This modelbased approach offers several major advantages, including the opportunity to create highly tailored models for specific scenarios, as well as rapid prototyping and comparison of a range of alternative models. Furthermore, newcomers to the field of machine learning do not have to learn about the huge range of traditional methods, but instead can focus their attention on understanding a single modelling environment. In this study, we show how probabilistic graphical models, coupled with efficient inference algorithms, provide a very flexible foundation formodelbased machine learning, and we outline a largescale commercial application of this framework involving tens of millions of users. 
Model Confidence Set (MCS) 
The Model Confidence Set (MCS) procedure was recently developed by Hansen et al. (2011). The Hansen’s procedure consists on a sequence of tests which permits to construct a set of ‘superior’ models, where the null hypothesis of Equal Predictive Ability (EPA) is not rejected at a certain confidence level. The EPA statistic tests is calculated for an arbitrary loss function, meaning that we could test models on various aspects, for example punctual forecasts. MCS 
Model Explanation System (MES) 
We propose a general model explanation system (MES) for “explaining” the output of black box classifiers. In this introduction we use the motivating example of a classifier trained to detect fraud in a credit card transaction history. The key aspect is that we provide explanations applicable to a single prediction, rather than provide an interpretable set of parameters. The labels in the provided examples are usually negative. Hence, we focus on explaining positive predictions (alerts). In many classification applications, but especially in fraud detection, there is an expectation of false positives. Alerts are given to a human analyst before any further action is taken. Analysts often insist on understanding “why” there was an alert, since an opaque alert makes it difficult for them to proceed. Analogous scenarios occur in computer vision , credit risk , spam detection , etc. Furthermore, the MES framework is useful for model criticism. In the world of generative models, practitioners often generate synthetic data from a trained model to get an idea of “what the model is doing”. Our MES framework augments such tools. As an added benefit, MES is applicable to completely nonprobabilistic black boxes that only provide hard labels. In Section 3 we use MES to visualize the decisions of a face recognition system. 
Model Management Deep Neural Network (MMdnn) 
MMdnn is a set of tools to help users interoperate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch and CoreML. A comprehensive, crossframework solution to convert, visualize and diagnosis deep neural network models. The ‘MM’ in MMdnn stands for model management and ‘dnn’ is an acronym for deep neural network. Basically, it converts many DNN models that trained by one framework into others. The major features include: • Model File Converter Converting DNN models between frameworks • Model Code Snippet Generator Generating training or inference code snippet for frameworks • Model Visualization Visualizing DNN network architecture and parameters for frameworks • Model compatibility testing (Ongoing) This project is designed and developed by Microsoft Research (MSR). We also encourage researchers and students leverage this project to analysis DNN models and we welcome any new ideas to extend this project. 
Model Selection  Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a preexisting set of data is considered. However, the task can also involve the design of experiments such that the data collected is wellsuited to the problem of model selection. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice. Konishi & Kitagawa (2008, p.75) state, ‘The majority of the problems in statistical inference can be considered to be problems related to statistical modeling’. Relatedly, Sir David Cox (2006, p.197) has said, ‘How translation from subjectmatter problem to statistical model is done is often the most critical part of an analysis’. 
Model, MetaModel and Anomaly Detection (M3A) 
Alice’ is submitting one web search per five minutes, for three hours in a row – is it normal? How to detect abnormal search behaviors, among Alice and other users? Is there any distinct pattern in Alice’s (or other users’) search behavior? We studied what is probably the largest, publicly available, query log that contains more than 30 million queries from 0.6 million users. In this paper, we present a novel, userand grouplevel framework, M3A: Model, MetaModel and Anomaly detection. For each user, we discover and explain a surprising, bimodal pattern of the interarrival time (IAT) of landed queries (queries with user clickthrough). Specifically, the model CamelLog is proposed to describe such an IAT distribution; we then notice the correlations among its parameters at the group level. Thus, we further propose the metamodel MetaClick, to capture and explain the twodimensional, heavytail distribution of the parameters. Combining CamelLog and MetaClick, the proposed M3A has the following strong points: (1) the accurate modeling of marginal IAT distribution, (2) quantitative interpretations, and (3) anomaly detection. 
ModelAveraged Confidence Intervals  MuMIn 
ModelAveraged Tail Area Wald Confidence Interval (MATAWald) 
MATA 
Modelaveraged Wald Confidence Intervals  
ModelBased Clustering  Sample observations arise from a distribution that is a mixture of two or more components. Each component is described by a density function and has an associated probability or \weight” in the mixture. In principle, we can adopt any probability model for the components, but typically we will assume that components are pvariate normal distributions. (This does not necessarily mean things are easy: inference in tractable, however.) Thus, the probability model for clustering will often be a mixture of multivariate normal distributions. Each component in the mixture is what we call a cluster. mclust,SelvarMix 
ModelBased Pricing (MBP) 
Data analytics using machine learning (ML) has become ubiquitous in science, business intelligence, journalism and many other domains. While a lot of work focuses on reducing the training cost, inference runtime and storage cost of ML models, little work studies how to reduce the cost of data acquisition, which potentially leads to a loss of sellers’ revenue and buyers’ affordability and efficiency. In this paper, we propose a modelbased pricing (MBP) framework, which instead of pricing the data, directly prices ML model instances. We first formally describe the desired properties of the MBP framework, with a focus on avoiding arbitrage. Next, we show a concrete realization of the MBP framework via a noise injection approach, which provably satisfies the desired formal properties. Based on the proposed framework, we then provide algorithmic solutions on how the seller can assign prices to models under different market scenarios (such as to maximize revenue). Finally, we conduct extensive experiments, which validate that the MBP framework can provide high revenue to the seller, high affordability to the buyer, and also operate on low runtime cost. 
ModelBased Priors for ModelFree Reinforcement Learning (MBMF) 
Reinforcement Learning is divided in two main paradigms: modelfree and modelbased. Each of these two paradigms has strengths and limitations, and has been successfully applied to real world domains that are appropriate to its corresponding strengths. In this paper, we present a new approach aimed at bridging the gap between these two paradigms. We aim to take the best of the two paradigms and combine them in an approach that is at the same time dataefficient and costsavvy. We do so by learning a probabilistic dynamics model and leveraging it as a prior for the intertwined modelfree optimization. As a result, our approach can exploit the generality and structure of the dynamics model, but is also capable of ignoring its inevitable inaccuracies, by directly incorporating the evidence provided by the direct observation of the cost. As a proofofconcept, we demonstrate on simulated tasks that our approach outperforms purely modelbased and modelfree approaches, as well as the approach of simply switching from a modelbased to a modelfree setting. 
ModelBased Value Expansion  Recent modelfree reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present modelbased value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a modelfree reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning. 
ModelImplied Instrumental Variable (MIIV) 
Modelimplied instrumental variables are the observed variables in the model that can serve as instrumental variables in a given equation. 
ModelImplied Instrumental Variable – Generalized Method of Moments (MIIVGMM) 
The common maximum likelihood (ML) estimator for structural equation models (SEMs) has optimal asymptotic properties under ideal conditions (e.g., correct structure, no excess kurtosis, etc.) that are rarely met in practice. This paper proposes modelimplied instrumental variable – generalized method of moments (MIIVGMM) estimators for latent variable SEMs that are more robust than ML to violations of both the model structure and distributional assumptions. Under less demanding assumptions, the MIIVGMM estimators are consistent, asymptotically unbiased, asymptotically normal, and have an asymptotic covariance matrix. They are ‘distributionfree,’ robust to heteroscedasticity, and have overidentification goodnessoffit Jtests with asymptotic chisquare distributions. In addition, MIIVGMM estimators are ‘scalable’ in that they can estimate and test the full model or any subset of equations, and hence allow better pinpointing of those parts of the model that fit and do not fit the data. An empirical example illustrates MIIVGMM estimators. Two simulation studies explore their finite sample properties and find that they perform well across a range of sample sizes. 
Moderated Regression  ➘ “Moderation” pequod 
Moderation  In statistics and regression analysis, moderation occurs when the relationship between two variables depends on a third variable. The third variable is referred to as the moderator variable or simply the moderator. The effect of a moderating variable is characterized statistically as an interaction; that is, a categorical (e.g., sex, race, class) or quantitative (e.g., level of reward) variable that affects the direction and/or strength of the relation between dependent and independent variables. Specifically within a correlational analysis framework, a moderator is a third variable that affects the zeroorder correlation between two other variables, or the value of the slope of the dependent variable on the independent variable. In analysis of variance (ANOVA) terms, a basic moderator effect can be represented as an interaction between a focal independent variable and a factor that specifies the appropriate conditions for its operation. pequod 
ModhaSpangler Clustering  ModhaSpangler clustering, which uses a bruteforce strategy to maximize the cluster separation simultaneously in the continuous and categorical variables. kamila 
Modified Generative Adversarial Network (MSGAN) 
Correcting measured detectorlevel distributions to particlelevel is essential to make data usable outside the experimental collaborations. The term unfolding is used to describe this procedure. A new method of unfolding the data using a modified Generative Adversarial Network (MSGAN) is presented here. Applied to various distributions, it is demonstrated to perform at par with, or better than, currently used methods. 
ModSpace  Mango Solutions have developed a configurable software application to allow statisticians, programmers and analysts to centralise and manage the oftencomplex statistical knowledge (held in SAS, R, Matlab and other languages, documents, data, images etc). The application was designed to provide a centralised platform for analysts to store, share and reuse complex analytical IP in an approach which helps enforce business and coding standards and promote collaboration and continual improvement within teams. ModSpace has proved especially valuable for teams working in diverse geographic locations as it promotes increased interaction between sites and individuals. The easy to use tool contains intuitive searching capabilities, enabling analysts to reuse their code and reduce the duplication of effort. The system also supports quality assurance with the use of audit trails, version control and an archiving functionality, which allows valuable historic information to be accessed without interfering with day to day activities. The system can be configured for different coding style templates which promote standards and can identify current/legacy and customer specific standards. Managers are also able to take advantage of the powerful reporting environment which allows them to track usage within their teams, spot trends and identify areas of process improvement. http://…/#sthash.ZGls4IJx.dpuf 
Modular Attention Network (MAttNet) 
In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an endtoend framework. In our model, which we call the Modular Attention Network (MAttNet), two types of attention are utilized: languagebased attention that learns the module weights as well as the word/phrase attention that each module should focus on; and visual attention that allows the subject and relationship modules to focus on relevant image components. Module weights combine scores from all three modules dynamically to output an overall score. Experiments show that MAttNet outperforms previous stateofart methods by a large margin on both boundingboxlevel and pixellevel comprehension tasks. 
Modular Generative Adversarial Network (ModularGAN) 
Existing methods for multidomain imagetoimage translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains. However, most existing methods have limited scalability and robustness, since they require building independent models for each pair of domains in question. This leads to two significant shortcomings: (1) the need to train exponential number of pairwise models, and (2) the inability to leverage data from other domains when training a particular pairwise mapping. Inspired by recent work on module networks, this paper proposes ModularGAN for multidomain image generation and imagetoimage translation. ModularGAN consists of several reusable and composable modules that carry on different functions (e.g., encoding, decoding, transformations). These modules can be trained simultaneously, leveraging data from all domains, and then combined to construct specific GAN networks at test time, according to the specific image translation task. This leads to ModularGAN’s superior flexibility of generating (or translating to) an image in any desired domain. Experimental results demonstrate that our model not only presents compelling perceptual results but also outperforms stateoftheart methods on multidomain facial attribute transfer. 
Modular, Optimal Learning Testing Environment (MOLTE) 
We address the relative paucity of empirical testing of learning algorithms (of any type) by introducing a new publicdomain, Modular, Optimal Learning Testing Environment (MOLTE) for Bayesian ranking and selection problem, stochastic bandits or sequential experimental design problems. The Matlabbased simulator allows the comparison of a number of learning policies (represented as a series of .m modules) in the context of a wide range of problems (each represented in its own .m module) which makes it easy to add new algorithms and new test problems. Stateoftheart policies and various problem classes are provided in the package. The choice of problems and policies is guided through a spreadsheetbased interface. Different graphical metrics are included. MOLTE is designed to be compatible with parallel computing to scale up from local desktop to clusters and clouds. We offer MOLTE as an easytouse tool for the research community that will make it possible to perform much more comprehensive testing, spanning a broader selection of algorithms and test problems. We demonstrate the capabilities of MOLTE through a series of comparisons of policies on a starter library of test problems. We also address the problem of tuning and constructing priors that have been largely overlooked in optimal learning literature. We envision MOLTE as a modest spur to provide researchers an easy environment to study interesting questions involved in optimal learning. 
Modularity  Modularity is one measure of the structure of networks or graphs. It was designed to measure the strength of division of a network into modules (also called groups, clusters or communities). Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. However, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities. Biological networks, including animal brains, exhibit a high degree of modularity. 
Module Graphical Lasso (MGL) 
We propose module graphical lasso (MGL), an aggressive dimensionality reduction and network estimation technique for a highdimensional Gaussian graphical model (GGM). MGL achieves scalability, interpretability and robustness by exploiting the modularity property of many realworld networks. Variables are organized into tightly coupled modules and a graph structure is estimated to determine the conditional independencies among modules. MGL iteratively learns the module assignment of variables, the latent variables, each corresponding to a module, and the parameters of the GGM of the latent variables. In synthetic data experiments, MGL outperforms the standard graphical lasso and three other methods that incorporate latent variables into GGMs. 
Moment Matching Method  The momentmatching methods are also called the Krylov subspace methods, as well as Padé approximation methods. They belong to the Projection based MOR methods. These methods are applicable to nonparametric linear time invariant systems, often descriptor systems … momentchi2 
Monalytics  To effectively manage largescale data centers and utility clouds, operators must understand current system and application behaviors. This requires continuous monitoring along with online analysis of the data captured by the monitoring system. As a result, there is a need to move to systems in which both tasks can be performed in an integrated fashion, thereby better able to drive online system management. Coining the term ‘monalytics’ to refer to the combined monitoring and analysis systems used for managing largescale data center systems, this paper articulates principles for monalytics systems, describes software approaches for implementing them, and provides experimental evaluations justifying principles and implementation approach. Specific technical contributions include consideration of scalability across both ‘space’ and ‘time’, the ability to dynamically deploy and adjust monalytics functionality at multiple levels of abstraction in target systems, and the capability to operate across the range of application to hypervisor layers present in largescale data center or cloud computing systems. Our monalytics implementation targets virtualized systems and cloud infrastructures, via the integration of its functionality into the Xen hypervisor. 
MongoDB  MongoDB (from humongous) is a crossplatform documentoriented database. Classified as a NoSQL database, MongoDB eschews the traditional tablebased relational database structure in favor of JSONlike documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster. Released under a combination of the GNU Affero General Public License and the Apache License, MongoDB is free and opensource software. First developed by the software company 10gen (now MongoDB Inc.) in October 2007 as a component of a planned platform as a service product, the company shifted to an open source development model in 2009, with 10gen offering commercial support and other services. Since then, MongoDB has been adopted as backend software by a number of major websites and services, including Craigslist, eBay, Foursquare, SourceForge, Viacom, and The New York Times among others. As of 2014, MongoDB was the most popular NoSQL database system. 
Monte Carlo Tree Search (MCTS) 
In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm of making decisions in some decision processes, most notably employed in game playing. The leading example of its use is in contemporary computer Go programs, but it is also used in other board games, as well as realtime video games and nondeterministic games such as poker. A Survey of Monte Carlo Tree Search Methods 
MOOC Replication Framework (MORF) 
The MOOC Replication Framework (MORF) is a novel software system for feature extraction, model training/testing, and evaluation of predictive dropout models in Massive Open Online Courses (MOOCs). MORF makes largescale replication of complex machinelearned models tractable and accessible for researchers, and enables public research on privacyprotected data. It does so by focusing on the highlevel operations of an \emph{extracttraintestevaluate} workflow, and enables researchers to encapsulate their implementations in portable, fully reproducible software containers which are executed on data with a known schema. MORF’s workflow allows researchers to use data in analysis without providing them access to the underlying data directly, preserving privacy and data security. During execution, containers are sandboxed for security and data leakage and parallelized for efficiency, allowing researchers to create and test new models rapidly, on largescale multiinstitutional datasets that were previously inaccessible to most researchers. MORF is provided both as a Python API (the MORF Software), for institutions to use on their own MOOC data) or in a platformasaservice (PaaS) model with a web API and a highperformance computing environment (the MORF Platform). 
Morpheo  Morpheo is a transparent and secure machine learning platform collecting and analysing large datasets. It aims at building stateofthe art prediction models in various fields where data are sensitive. Indeed, it offers strong privacy of data and algorithm, by preventing anyone to read the data, apart from the owner and the chosen algorithms. Computations in Morpheo are orchestrated by a blockchain infrastructure, thus offering total traceability of operations. Morpheo aims at building an attractive economic ecosystem around data prediction by channelling cryptomoney from prediction requests to useful data and algorithms providers. Morpheo is designed to handle multiple data sources in a transfer learning approach in order to mutualize knowledge acquired from large datasets for applications with smaller but similar datasets. 
MorphNet  We introduce MorphNet, a single model that combines morphological analysis and disambiguation. Traditionally, analysis of morphologically complex languages has been performed in two stages: (i) A morphological analyzer based on finitestate transducers produces all possible morphological analyses of a word, (ii) A statistical disambiguation model picks the correct analysis based on the context for each word. MorphNet uses a sequencetosequence recurrent neural network to combine analysis and disambiguation. We show that when trained with text labeled with correct morphological analyses, MorphNet obtains stateofthe art or comparable results for nine different datasets in seven different languages. 
Mountain Plot  A mountain plot (or “folded empirical cumulative distribution plot”) is created by computing a percentile for each ranked difference between a new method and a reference method. To get a folded plot, the following transformation is performed for all percentiles above 50: percentile = 100 – percentile. These percentiles are then plotted against the differences between the two methods (Krouwer & Monti, 1995). The mountain plot is a useful complementary plot to the Bland & Altman plot. In particular, the mountain plot offers the following advantages: • It is easier to find the central 95% of the data, even when the data are not Normally distributed. • Different distributions can be compared more easily. mountainplot 
Moving Average  In statistics, a moving average (rolling average or running average) is a calculation to analyze data points by creating a series of averages of different subsets of the full data set. It is also called a moving mean (MM) or rolling mean and is a type of finite impulse response filter. Variations include: simple, and cumulative, or weighted forms (described below). seismicRoll 
MPACT  Action classification is a widely known and popular task that offers an approach towards video understanding. The absence of an easytouse platform containing stateoftheart (SOTA) models presents an issue for the community. Given that individual research code is not written with an end user in mind and in certain cases code is not released, even for published articles, the importance of a common unified platform capable of delivering results while removing the burden of developing an entire system cannot be overstated. To try and overcome these issues, we develop a tensorflowbased unified platform to abstract away unnecessary overheads in terms of an endtoend pipeline setup in order to allow the user to quickly and easily prototype action classification models. With the use of a consistent coding style across different models and seamless data flow between various submodules, the platform lends itself to the quick generation of results on a wide range of SOTA methods across a variety of datasets. All of these features are made possible through the use of fully predefined training and testing blocks built on top of a small but powerful set of modular functions that handle asynchronous data loading, model initializations, metric calculations, saving and loading of checkpoints, and logging of results. The platform is geared towards easily creating models, with the minimum requirement being the definition of a network architecture and preprocessing steps from a large custom selection of layers and preprocessing functions. MPACT currently houses four SOTA activity classification models which include, I3D, C3D, ResNet50+LSTM and TSN. The classification performance achieved by these models are, 43.86% for ResNet50+LSTM on HMDB51 while C3D and TSN achieve 93.66% and 85.25% on UCF101 respectively. 
MPDCompress  Deep neural networks (DNNs) have become the stateoftheart technique for machine learning tasks in various applications. However, due to their size and the computational complexity, large DNNs are not readily deployable on edge devices in realtime. To manage complexity and accelerate computation, network compression techniques based on pruning and quantization have been proposed and shown to be effective in reducing network size. However, such network compression can result in irregular matrix structures that are mismatched with modern hardwareaccelerated platforms, such as graphics processing units (GPUs) designed to perform the DNN matrix multiplications in a structured (blockbased) way. We propose MPDCompress, a DNN compression algorithm based on matrix permutation decomposition via random mask generation. Intraining application of the masks molds the synaptic weight connection matrix to a subgraph separation format. Aided by the random permutations, a hardwaredesirable block matrix is generated, allowing for a more efficient implementation and compression of the network. To show versatility, we empirically verify MPDCompress on several network models, compression rates, and image datasets. On the LeNet 300100 model (MNIST dataset), Deep MNIST, and CIFAR10, we achieve 10 X network compression with less than 1% accuracy loss compared to noncompressed accuracy performance. On AlexNet for the full ImageNet ILSVRC2012 dataset, we achieve 8 X network compression with less than 1% accuracy loss, with top5 and top1 accuracies of 79.6% and 56.4%, respectively. Finally, we observe that the algorithm can offer inference speedups across various hardware platforms, with 4 X faster operation achieved on several mobile GPUs. 
mQAPViz  Modern digital products and services are instrumental in understanding users activities and behaviors. In doing so, we have to extract relevant relationships and patterns from extensive data collections efficiently. Data visualization algorithms are essential tools in transforming data into narratives. Unfortunately, very few visualization algorithms can handle a significant amount of data. In this study, we address the visualization of largescale datasets as a multiobjective optimization problem. We propose mQAPViz, a divideandconquer multiobjective optimization algorithm to compute largescale data visualizations. Our method employs the MultiObjective Quadratic Assignment Problem (mQAP) as the mathematical foundation to solve the visualization task at hand. The algorithm applies advanced machine learning sampling techniques and efficient data structures to scale to millions of data objects. The divideandconquer strategy can efficiently handle millions of objects which the algorithm allocates onto a layout that allows the visualization of a whole dataset. Experimental results on realworld and large datasets demonstrate that mQAPViz is a competitive alternative to compute largescale visualizations that we can employ to inform the development and improvement of digital applications. 
MQGrad  One of the most significant bottleneck in training large scale machine learning models on parameter server (PS) is the communication overhead, because it needs to frequently exchange the model gradients between the workers and servers during the training iterations. Gradient quantization has been proposed as an effective approach to reducing the communication volume. One key issue in gradient quantization is setting the number of bits for quantizing the gradients. Small number of bits can significantly reduce the communication overhead while hurts the gradient accuracies, and vise versa. An ideal quantization method would dynamically balance the communication overhead and model accuracy, through adjusting the number bits according to the knowledge learned from the immediate past training iterations. Existing methods, however, quantize the gradients either with fixed number of bits, or with predefined heuristic rules. In this paper we propose a novel adaptive quantization method within the framework of reinforcement learning. The method, referred to as MQGrad, formalizes the selection of quantization bits as actions in a Markov decision process (MDP) where the MDP states records the information collected from the past optimization iterations (e.g., the sequence of the loss function values). During the training iterations of a machine learning algorithm, MQGrad continuously updates the MDP state according to the changes of the loss function. Based on the information, MDP learns to select the optimal actions (number of bits) to quantize the gradients. Experimental results based on a benchmark dataset showed that MQGrad can accelerate the learning of a large scale deep neural network while keeping its prediction accuracies. 
MR3  Recommender systems (RSs) provide an effective way of alleviating the information overload problem by selecting personalized items for different users. Latent factors based collaborative filtering (CF) has become the popular approaches for RSs due to its accuracy and scalability. Recently, online social networks and usergenerated content provide diverse sources for recommendation beyond ratings. Although {\em social matrix factorization} (Social MF) and {\em topic matrix factorization} (Topic MF) successfully exploit social relations and item reviews, respectively, both of them ignore some useful information. In this paper, we investigate the effective data fusion by combining the aforementioned approaches. First, we propose a novel model {\em \mbox{MR3}} to jointly model three sources of information (i.e., ratings, item reviews, and social relations) effectively for rating prediction by aligning the latent factors and hidden topics. Second, we incorporate the implicit feedback from ratings into the proposed model to enhance its capability and to demonstrate its flexibility. We achieve more accurate rating prediction on reallife datasets over various stateoftheart methods. Furthermore, we measure the contribution from each of the three data sources and the impact of implicit feedback from ratings, followed by the sensitivity analysis of hyperparameters. Empirical studies demonstrate the effectiveness and efficacy of our proposed model and its extension. 
MRNetProduct2Vec  Ecommerce websites such as Amazon, Alibaba, Flipkart, and Walmart sell billions of products. Machine learning (ML) algorithms involving products are often used to improve the customer experience and increase revenue, e.g., product similarity, recommendation, and price estimation. The products are required to be represented as features before training an ML algorithm. In this paper, we propose an approach called MRNetProduct2Vec for creating generic embeddings of products within an ecommerce ecosystem. We learn a dense and lowdimensional embedding where a diverse set of signals related to a product are explicitly injected into its representation. We train a Discriminative Multitask Bidirectional Recurrent Neural Network (RNN), where the input is a product title fed through a Bidirectional RNN and at the output, product labels corresponding to fifteen different tasks are predicted. The task set includes several intrinsic characteristics about a product such as price, weight, size, color, popularity, and material. We evaluate the proposed embedding quantitatively and qualitatively. We demonstrate that they are almost as good as sparse and extremely highdimensional TFIDF representation in spite of having less than 3% of the TFIDF dimension. We also use a multimodal autoencoder for comparing products from different languageregions and show preliminary yet promising qualitative results. 
MS MARCO  Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer. 
MTNet  ➘ “TNet” 
mTSNE  Multivariate time series (MTS) have become increasingly common in healthcare domains where human vital signs and laboratory results are collected for predictive diagnosis. Recently, there have been increasing efforts to visualize healthcare MTS data based on star charts or parallel coordinates. However, such techniques might not be ideal for visualizing a large MTS dataset, since it is difficult to obtain insights or interpretations due to the inherent high dimensionality of MTS. In this paper, we propose ‘mTSNE’: a simple and novel framework to visualize highdimensional MTS data by projecting them into a lowdimensional (2D or 3D) space while capturing the underlying data properties. Our framework is easy to use and provides interpretable insights for healthcare professionals to understand MTS data. We evaluate our visualization framework on two realworld datasets and demonstrate that the results of our mTSNE show patterns that are easy to understand while the other methods’ visualization may have limitations in interpretability. 
MUCB  Multiarmed bandit (MAB) is a class of online learning problems where a learning agent aims to maximize its expected cumulative reward while repeatedly selecting to pull arms with unknown reward distributions. In this paper, we consider a scenario in which the arms’ reward distributions may change in a piecewisestationary fashion at unknown time steps. By connecting changedetection techniques with classic UCB algorithms, we motivate and propose a learning algorithm called MUCB, which can detect and adapt to changes, for the considered scenario. We also establish an $O(\sqrt{MKT\log T})$ regret bound for MUCB, where $T$ is the number of time steps, $K$ is the number of arms, and $M$ is the number of stationary segments. % and $\Delta$ is the gap between the expected rewards of the optimal and best suboptimal arms. Comparison with the best available lower bound shows that MUCB is nearly optimal in $T$ up to a logarithmic factor. We also compare MUCB with stateoftheart algorithms in a numerical experiment based on a public Yahoo! dataset. In this experiment, MUCB achieves about $50 \%$ regret reduction with respect to the best performing stateoftheart algorithm. 
Muller Plot  A Muller plot combines information about the succession of different OTUs (genotypes, phenotypes, species, …) and information about dynamics of their abundances (populations or frequencies) over time. Muller plots may be used to visualize evolutionary dynamics. They may be also employed in the study of diversity and its dynamics; that is, how diversity emerges and how changes over time. An example of a Muller plot (produced by the MullerPlot package in R) showing the evolutionary dynamics of an artificial community They are called Muller plots in honor of Hermann Joseph Muller, who used them to explain his idea of Muller’s ratchet. ggmuller 
Multi Agent System (MAS) 
A multiagent system (M.A.S.) is a computerized system composed of multiple interacting intelligent agents within an environment. Multiagent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Intelligence may include some methodic, functional, procedural approach, algorithmic search or reinforcement learning. Although there is considerable overlap, a multiagent system is not always the same as an agentbased model (ABM). The goal of an ABM is to search for explanatory insight into the collective behavior of agents (which don’t necessarily need to be “intelligent”) obeying simple rules, typically in natural systems, rather than in solving specific practical or engineering problems. The terminology of ABM tends to be used more often in the sciences, and MAS in engineering and technology. Topics where multiagent systems research may deliver an appropriate approach include online trading, disaster response, and modelling social structures. 
Multi Attribute Utility Theory (MAUT) 
mau 
Multi Expression Programming (MEP) 
In this paper a new evolutionary paradigm, called MultiExpression Programming (MEP), intended for solving computationally difficult problems is proposed. A new encoding method is designed. MEP individuals are linear entities that encode complex computer programs. In this paper MEP is used for solving some computationally difficult problems like symbolic regression, game strategy discovering, and for generating heuristics. Other exciting applications of MEP are suggested. Some of them are currently under development. MEP is compared with Gene Expression Programming (GEP) by using a wellknown test problem. For the considered problems MEP performs better than GEP. Evolving TSP heuristics using Multi Expression Programming 
MultiAdvisor Reinforcement Learning  This article deals with a novel branch of Separation of Concerns, called MultiAdvisor Reinforcement Learning (MAdRL), where a singleagent RL problem is distributed to $n$ learners, called advisors. Each advisor tries to solve the problem with a different focus. Their advice is then communicated to an aggregator, which is in control of the system. For the local training, three offpolicy bootstrapping methods are proposed and analysed: localmax bootstraps with the local greedy action, randpolicy bootstraps with respect to the random policy, and aggpolicy bootstraps with respect to the aggregator’s greedy policy. MAdRL is positioned as a generalisation of Reinforcement Learning with Ensemble methods. An experiment is held on a simplified version of the Ms. PacMan Atari game. The results confirm the theoretical relative strengths and weaknesses of each method. 
MultiAgent Path Finding  Explanation of the hot topic ‘multiagent path finding’. 
Multiagent Reinforcement Learning (MARL) 
To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (nonstationary) environment. Fully Decentralized MultiAgent Reinforcement Learning with Networked Agents 
Multiagent Soft Qlearning  Policy gradient methods are often applied to reinforcement learning in continuous multiagent games. These methods perform local search in the jointaction space, and as we show, they are susceptable to a gametheoretic pathology known as relative overgeneralization. To resolve this issue, we propose Multiagent Soft Qlearning, which can be seen as the analogue of applying Qlearning to continuous controls. We compare our method to MADDPG, a stateoftheart approach, and show that our method achieves better coordination in multiagent cooperative tasks, converging to better local optima in the joint action space. 
MultiAgent Systems (MAS) 
A multiagent system (M.A.S.) is a computerized system composed of multiple interacting intelligent agents within an environment. Multiagent systems can be used to solve problems that are difficult or impossible for an individual agent or a monolithic system to solve. Intelligence may include some methodic, functional, procedural approach, algorithmic search or reinforcement learning. Although there is considerable overlap, a multiagent system is not always the same as an agentbased model (ABM). The goal of an ABM is to search for explanatory insight into the collective behavior of agents (which don’t necessarily need to be ‘intelligent’) obeying simple rules, typically in natural systems, rather than in solving specific practical or engineering problems. The terminology of ABM tends to be used more often in the sciences, and MAS in engineering and technology. Topics where multiagent systems research may deliver an appropriate approach include online trading, disaster response, and modelling social structures. 
MultiArmed Bandit  In probability theory, the multiarmed bandit problem (sometimes called the K or Narmed bandit problem) is the problem a gambler faces at a row of slot machines, sometimes known as “onearmed bandits”, when deciding which machines to play, how many times to play each machine and in which order to play them. When played, each machine provides a random reward from a distribution specific to that machine. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. 
MultIclass learNing Algorithm for data Streams (MINAS) 
Novelty detection has been presented in the literature as oneclass problem. In this case, new examples are classified as either belonging to the target class or not. The examples not explained by the model are detected as belonging to a class named novelty. However, novelty detection is much more general, especially in data streams scenarios, where the number of classes might be unknown before learning and new classes can appear any time. In this case, the novelty concept is composed by different classes. This work presents a new algorithm to address novelty detection in data streams multiclass problems, the MINAS algorithm. Moreover, we also present a new experimental methodology to evaluate novelty detection methods in multiclass problems. The data used in the experiments include artificial and real data sets. Experimental results show that MINAS is able to discover novelties in multiclass problems. 
Multicollinearity  In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a nontrivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others. In case of perfect multicollinearity the predictor matrix is singular and therefore cannot be inverted. Under these circumstances, the ordinary leastsquares estimator \hat{\beta} = (X’X)^{1}X’y does not exist. Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase ‘no multicollinearity’ is sometimes used to mean the absence of perfect multicollinearity, which is an exact (nonstochastic) linear relation among the regressors. 
MultiContext Label Embedding (MCLE) 
Label embedding plays an important role in zeroshot learning. Side information such as attributes, semantic text representations, and label hierarchy are commonly used as the label embedding in zeroshot classification tasks. However, the label embedding used in former works considers either only one single context of the label, or multiple contexts without dependency. Therefore, different contexts of the label may not be well aligned in the embedding space to preserve the relatedness between labels, which will result in poor interpretability of the label embedding. In this paper, we propose a MultiContext Label Embedding (MCLE) approach to incorporate multiple label contexts, e.g., label hierarchy and attributes, within a unified matrix factorization framework. To be specific, we model each single context by a matrix factorization formula and introduce a shared variable to capture the dependency among different contexts. Furthermore, we enforce sparsity constraint on our multicontext framework to strengthen the interpretability of the learned label embedding. Extensive experiments on two realworld datasets demonstrate the superiority of our MCLE in label description and zeroshot image classification. 
MultiDimensional Recurrent Neural Network (MDRNN) 
Some of the properties that make RNNs suitable for one dimensional sequence learning tasks, are also desirable in multidimensional domains. This paper introduces multidimensional recurrent neural networks (MDRNNs), thereby extending the potential applicability of RNNs to vision, video processing, medical imaging and many other areas. 
Multidimensional Scaling (MDS) 
Multidimensional scaling (MDS) is a means of visualizing the level of similarity of individual cases of a dataset. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix. An MDS algorithm aims to place each object in Ndimensional space such that the betweenobject distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a twodimensional scatterplot. 
MultiDirectional Recurrent Neural Network (MRNN) 
Missing data is a ubiquitous problem. It is especially challenging in medical settings because many streams of measurements are collected at different – and often irregular – times. Accurate estimation of those missing measurements is critical for many reasons, including diagnosis, prognosis and treatment. Existing methods address this estimation problem by interpolating within data streams or imputing across data streams (both of which ignore important information) or ignoring the temporal aspect of the data and imposing strong assumptions about the nature of the datagenerating process and/or the pattern of missing data (both of which are especially problematic for medical data). We propose a new approach, based on a novel deep learning architecture that we call a Multidirectional Recurrent Neural Network (MRNN) that interpolates within data streams and imputes across data streams. We demonstrate the power of our approach by applying it to five realworld medical datasets. We show that it provides dramatically improved estimation of missing measurements in comparison to 11 stateoftheart benchmarks (including Spline and Cubic Interpolations, MICE, MissForest, matrix completion and several RNN methods); typical improvements in Root Mean Square Error are between 35% – 50%. Additional experiments based on the same five datasets demonstrate that the improvements provided by our method are extremely robust. 
Multifractal Detrended Fluctuation Analysis (MFDFA) 
Fractal structures are found in biomedical time series from a wide range of physiological phenomena. The multifractal spectrum identifies the deviations in fractal structure within time periods with large and small fluctuations. MFDFA 
MultiFunction Recurrent Units (MuFuRU) 
Recurrent neural networks such as the GRU and LSTM found wide adoption in natural language processing and achieve stateoftheart results for many tasks. These models are characterized by a memory state that can be written to and read from by applying gated composition operations to the current input and the previous state. However, they only cover a small subset of potentially useful compositions. We propose MultiFunction Recurrent Units (MuFuRUs) that allow for arbitrary differentiable functions as composition operations. Furthermore, MuFuRUs allow for an input and statedependent choice of these composition operations that is learned. Our experiments demonstrate that the additional functionality helps in different sequence modeling tasks, including the evaluation of propositional logic formulae, language modeling and sentiment analysis. 
MultiInstance Learning (MIL) 
In machine learning, multipleinstance learning (MIL) is a variation on supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multipleinstance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept. Take image classification for example in Amores (2013). Given an image, we want to know its target class based on its visual content. For instance, the target class might be ‘beach’, where the image contains both ‘sand’ and ‘water’. In MIL terms, the image is described as a bag X = , where eachX_i is the feature vector (called instance) extracted from the corresponding ith region in the image and N is the total regions (instances) partitioning the image. The bag is labeled positive (‘beach’) if it contains both ‘sand’ region instances and ‘water’ region instances. Multipleinstance learning was originally proposed under this name by Dietterich, Lathrop & LozanoPérez (1997), but earlier examples of similar research exist, for instance in the work on handwritten digit recognition by Keeler, Rumelhart & Leow (1990). Recent reviews of the MIL literature include Amores (2013), which provides an extensive review and comparative study of the different paradigms, and Foulds & Frank (2010), which provides a thorough review of the different assumptions used by different paradigms in the literature. Examples of where MIL is applied are: • Molecule activity • Predicting binding sites of Calmodulin binding proteins • Predicting function for alternatively spliced isoforms Li, Menon & et al. (2014),Eksi et al. (2013) • Image classification Maron & Ratan (1998) • Text or document categorization Kotzias et al. (2015) • Predicting functional binding sites of MicroRNA targets Bandyopadhyay, Ghosh & et al. (2015) Numerous researchers have worked on adapting classical classification techniques, such as support vector machines or boosting, to work within the context of multipleinstance learning. Multiple Instance Learning: Algorithms and Applications 
MultiItem Gamma Poisson Shrinker (MGPS) 
MGPS is a disproportionality method that utilizes an empirical Bayesian model to detect the magnitude of drugevent associations in drug safety databases. MGPS calculates adjusted reporting ratios for pairs of drug event combinations. The adjusted reporting ratio values are termed the EBGM or the “Empirical Bayes Geometric Mean.” EBGM values indicate the strength of the reporting relationship between a particular drug and event pair. openEBGM 
Multilabel Feature Selection (MLFS) 
Multilabel feature selection: A comprehensive review and guiding experiments 
MultiLayer Convolutional Sparse Coding (MLCSC) 
The recently proposed MultiLayer Convolutional Sparse Coding (MLCSC) model, consisting of a cascade of convolutional sparse layers, provides a new interpretation of Convolutional Neural Networks (CNNs). Under this framework, the computation of the forward pass in a CNN is equivalent to a pursuit algorithm aiming to estimate the nested sparse representation vectors — or feature maps — from a given input signal. Despite having served as a pivotal connection between CNNs and sparse modeling, a deeper understanding of the MLCSC is still lacking: there are no pursuit algorithms that can serve this model exactly, nor are there conditions to guarantee a nonempty model. While one can easily obtain signals that approximately satisfy the MLCSC constraints, it remains unclear how to simply sample from the model and, more importantly, how one can train the convolutional filters from real data. In this work, we propose a sound pursuit algorithm for the MLCSC model by adopting a projection approach. We provide new and improved bounds on the stability of the solution of such pursuit and we analyze different practical alternatives to implement this in practice. We show that the training of the filters is essential to allow for nontrivial signals in the model, and we derive an online algorithm to learn the dictionaries from real data, effectively resulting in cascaded sparse convolutional layers. Last, but not least, we demonstrate the applicability of the MLCSC model for several applications in an unsupervised setting, providing competitive results. Our work represents a bridge between matrix factorization, sparse dictionary learning and sparse autoencoders, and we analyze these connections in detail. 
MultiLayer Fast ISTA (MLFISTA) 
Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent MultiLayer Convolutional Sparse Coding (MLCSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multilayer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multilayer Iterative Soft Thresholding (MLISTA) and multilayer Fast ISTA (MLFISTA) converge to the global optimum of our multilayer formulation at a rate of $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feedforward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multilayer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feedforward ones. We demonstrate the emerging constructions by training them in an endtoend manner, consistently improving the performance of classical networks without introducing extra filters or parameters. 
MultiLayer Iterative Soft Thresholding (MLISTA) 
Parsimonious representations in data modeling are ubiquitous and central for processing information. Motivated by the recent MultiLayer Convolutional Sparse Coding (MLCSC) model, we herein generalize the traditional Basis Pursuit regression problem to a multilayer setting, introducing similar sparse enforcing penalties at different representation layers in a symbiotic relation between synthesis and analysis sparse priors. We propose and analyze different iterative algorithms to solve this new problem in practice. We prove that the presented multilayer Iterative Soft Thresholding (MLISTA) and multilayer Fast ISTA (MLFISTA) converge to the global optimum of our multilayer formulation at a rate of $\mathcal{O}(1/k)$ and $\mathcal{O}(1/k^2)$, respectively. We further show how these algorithms effectively implement particular recurrent neural networks that generalize feedforward architectures without any increase in the number of parameters. We demonstrate the different architectures resulting from unfolding the iterations of the proposed multilayer pursuit algorithms, providing a principled way to construct deep recurrent CNNs from feedforward ones. We demonstrate the emerging constructions by training them in an endtoend manner, consistently improving the performance of classical networks without introducing extra filters or parameters. 
MultiLayer KMeans (MLKM) 
Datatarget association is an important step in multitarget localization for the intelligent operation of un manned systems in numerous applications such as search and rescue, traffic management and surveillance. The objective of this paper is to present an innovative data association learning approach named multilayer Kmeans (MLKM) based on leveraging the advantages of some existing machine learning approaches, including Kmeans, Kmeans++, and deep neural networks. To enable the accurate data association from different sensors for efficient target localization, MLKM relies on the clustering capabilities of Kmeans++ structured in a multilayer framework with the error correction feature that is motivated by the backpropogation that is wellknown in deep learning research. To show the effectiveness of the MLKM method, numerous simulation examples are conducted to compare its performance with Kmeans, Kmeans++, and deep neural networks. 
MultiLayer Vector Approximate Message Passing (MLVAMP) 
Deep generative networks provide a powerful tool for modeling complex data in a wide range of applications. In inverse problems that use these networks as generative priors on data, one must often perform inference of the inputs of the networks from the outputs. Inference is also required for sampling during stochastic training on these generative models. This paper considers inference in a deep stochastic neural network where the parameters (e.g., weights, biases and activation functions) are known and the problem is to estimate the values of the input and hidden units from the output. While several approximate algorithms have been proposed for this task, there are few analytic tools that can provide rigorous guarantees in the reconstruction error. This work presents a novel and computationally tractable outputtoinput inference method called MultiLayer Vector Approximate Message Passing (MLVAMP). The proposed algorithm, derived from expectation propagation, extends earlier AMP methods that are known to achieve the replica predictions for optimality in simple linear inverse problems. Our main contribution shows that the meansquared error (MSE) of MLVAMP can be exactly predicted in a certain large system limit (LSL) where the numbers of layers is fixed and weight matrices are random and orthogonallyinvariant with dimensions that grow to infinity. MLVAMP is thus a principled method for outputtoinput inference in deep networks with a rigorous and precise performance achievability result in high dimensions. 
Multilevel Model (MLM) 
Multilevel models (also hierarchical linear models, nested models, mixed models, random coefficient, randomeffects models, random parameter models, or splitplot designs) are statistical models of parameters that vary at more than one level. These models can be seen as generalizations of linear models (in particular, linear regression), although they can also extend to nonlinear models. These models became much more popular after sufficient computing power and software became available. 
Multilevel Networks Analysis  Described in Lazega et al (2008) <doi:10.1016/j.socnet.2008.02.001> and in Lazega and Snijders (2016, ISBN:9783319245201). multinets 
Multilinear ClassSpecific Discriminant Analysis  There has been a great effort to transfer linear discriminant techniques that operate on vector data to highorder data, generally referred to as Multilinear Discriminant Analysis (MDA) techniques. Many existing works focus on maximizing the interclass variances to intraclass variances defined on tensor data representations. However, there has not been any attempt to employ classspecific discrimination criteria for the tensor data. In this paper, we propose a multilinear subspace learning technique suitable for applications requiring classspecific tensor models. The method maximizes the discrimination of each individual class in the feature space while retains the spatial structure of the input. We evaluate the efficiency of the proposed method on two problems, i.e. facial image analysis and stock price prediction based on limit order book data. 
Multilinear Subspace Learning (MSL) 
Multilinear subspace learning (MSL) aims to learn a specific small part of a large space of multidimensional objects having a particular desired property. It is a dimensionality reduction approach for finding a lowdimensional representation with certain preferred characteristics of highdimensional tensor data through direct mapping, without going through vectorization. The term tensor in MSL refers to multidimensional arrays. Examples of tensor data include images (2D/3D), video sequences (3D/4D), and hyperspectral cubes (3D/4D). The mapping from a highdimensional tensor space to a lowdimensional tensor space or vector space is named as multilinear projection. MSL methods are higherorder generalizations of linear subspace learning methods such as principal component analysis (PCA), linear discriminant analysis (LDA) and canonical correlation analysis (CCA). In the literature, MSL is also referred to as tensor subspace learning or tensor subspace analysis. Research on MSL has progressed from heuristic exploration in 2000s (decade) to systematic investigation in 2010s. 
Multilingual Question Answering (mQA) 
In this paper, we present the mQA model, which is able to answer questions about the content of an image. The answer can be a sentence, a phrase or a single word. Our model contains four components: a LongShort Term Memory (LSTM) to extract the question representation, a Convolutional Neural Network (CNN) to extract the visual representation, a LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer. We construct a Freestyle Multilingual Image Question Answering (FMIQA) dataset to train and evaluate our mQA model. It contains over 120,000 images and 250,000 freestyle Chinese questionanswer pairs and their English translations. The quality of the generated answers of our mQA model on this dataset are evaluated by human judges through a Turing Test. Specifically, we mix the answers provided by humans and our model. The human judges need to distinguish our model from the human. They will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the quality of the answer. We propose strategies to monitor the quality of this evaluation process. The experiments show that in 64.7% of cases, the human judges cannot distinguish our model from humans. The average score is 1.454 (1.918 for human). 
Multimodal Attribute Extraction  The broad goal of information extraction is to derive structured information from unstructured data. However, most existing methods focus solely on text, ignoring other types of unstructured data such as images, video and audio which comprise an increasing portion of the information on the web. To address this shortcoming, we propose the task of multimodal attribute extraction. Given a collection of unstructured and semistructured contextual information about an entity (such as a textual description, or visual depictions) the task is to extract the entity’s underlying attributes. In this paper, we provide a dataset containing mixedmedia data for over 2 million product items along with 7 million attributevalue pairs describing the items which can be used to train attribute extractors in a weakly supervised manner. We provide a variety of baselines which demonstrate the relative effectiveness of the individual modes of information towards solving the task, as well as study human performance. 
Multimodal Dynamic Timetable Model (Multimodal DTM) 
We present multimodal DTM, a new model for multimodal journey planning in public (schedulebased) transport networks. Multimodal DTM constitutes an extension of the dynamic timetable model (DTM), developed originally for unimodal journey planning. Multimodal DTM exhibits a very fast query algorithm, meeting the request for realtime response to best journey queries and an extremely fast update algorithm for updating the timetable information in case of delays. In particular, an experimental study on realworld metropolitan networks demonstrates that our methods compare favorably with other stateoftheart approaches when public transport along with unrestricted w.r.t. departing time traveling (walking and electric vehicles) is considered. 
Multimodal Intelligent inteRactIon for Autonomous systeMs (MIRIAM) 
We present MIRIAM (Multimodal Intelligent inteRactIon for Autonomous systeMs), a multimodal interface to support situation awareness of autonomous vehicles through chatbased interaction. The user is able to chat about the vehicle’s plan, objectives, previous activities and mission progress. The system is mixed initiative in that it proactively sends messages about key events, such as fault warnings. We will demonstrate MIRIAM using SeeByte’s SeeTrack command and control interface and Neptune autonomy simulator. 
Multimodal Learning  The information in real world usually comes as different modalities. For example, images are usually associated with tags and text explanations; texts contain images to more clearly express the main idea of the article. Different modalities are characterized by very different statistical properties. For instance, images are usually represented as pixel intensities or outputs of feature extractors, while texts are represented as discrete word count vectors. Due to the distinct statistical properties of different information resources, it is very important to discover the relationship between different modalities. Multimodal learning is a good model to represent the joint representations of different modalities. The multimodal learning model is also capable to fill missing modality given the observed ones. The multimodal learning model combines two deep Boltzmann machines each corresponds to one modality. An additional hidden layer is placed on top of the two Boltzmann Machines to give the joint representation. 
Multimodal Machine Learning  Our experience of the world is multimodal – we see objects, hear sounds, feel texture, smell odors, and taste flavors. Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities. In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together. Multimodal machine learning aims to build models that can process and relate information from multiple modalities. It is a vibrant multidisciplinary field of increasing importance and with extraordinary potential. Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy. We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and colearning. This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research. 
Multimodal Named Entity Recognition (MNER) 
We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy usergenerated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat imagecaption pairs submitted to public and crowdsourced stories with fully annotated named entities). We then build upon the stateoftheart BiLSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modalityattention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts from, adaptive to each sample and token. The proposed MNER model with modality attention significantly outperforms the stateoftheart textonly NER models by successfully leveraging provided visual contexts, opening up potential applications of MNER on myriads of social media platforms. 
multimodal sparse Bayesian dictionary learning (MSBDL) 
The purpose of this paper is to address the problem of learning dictionaries for multimodal datasets, i.e. datasets collected from multiple data sources. We present an algorithm called multimodal sparse Bayesian dictionary learning (MSBDL). The MSBDL algorithm is able to leverage information from all available data modalities through a joint sparsity constraint on each modality’s sparse codes without restricting the coefficients themselves to be equal. Our framework offers a considerable amount of flexibility to practitioners and addresses many of the shortcomings of existing multimodal dictionary learning approaches. Unlike existing approaches, MSBDL allows the dictionaries for each data modality to have different cardinality. In addition, MSBDL can be used in numerous scenarios, from small datasets to extensive datasets with large dimensionality. MSBDL can also be used in supervised settings and allows for learning multimodal dictionaries concurrently with classifiers for each modality. 
MultiNet  Representation learning of networks via embeddings has garnered popularity and has witnessed significant progress recently. Such representations have been effectively used for classic networkbased machine learning tasks like link prediction, community detection, and network alignment. However, most existing network embedding techniques largely focus on developing distributed representations for traditional flat networks and are unable to capture representations for multilayer networks. Large scale networks such as social networks and human brain tissue networks, for instance, can be effectively captured in multiple layers. In this work, we propose MultiNet a fast and scalable embedding technique for multilayer networks. Our work adds a new wrinkle to the the recently introduced family of network embeddings like node2vec, LINE, DeepWalk, SIGNet, sub2vec, graph2vec, and OhmNet. We demonstrate the usability of MultiNet by leveraging it to reconstruct the friends and followers network on Twitter using network layers mined from the body of tweets, like mentions network and the retweet network. This is the Workinprogress paper and our preliminary contribution for multilayer network embeddings. 
Multinomial Probit Bayesian Additive Regression Trees (MPBART) 
mpbart 
MultiObjective Deep Reinforcement Learning (MODRL) 
This paper presents a new multiobjective deep reinforcement learning (MODRL) framework based on deep Qnetworks. We propose linear and nonlinear methods to develop the MODRL framework that includes both singlepolicy and multipolicy strategies. The experimental results on a deep sea treasure environment indicate that the proposed approach is able to converge to the optimal Pareto solutions. The proposed framework is generic, which allows implementation of different deep reinforcement learning algorithms in various complex environments. Details of the framework implementation can be referred to http://…/drl.htm. 
MultiObjective Optimization  Multiobjective optimization (also known as multiobjective programming, vector optimization, multicriteria optimization, multiattribute optimization or Pareto optimization) is an area of multiple criteria decision making, that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. Multiobjective optimization has been applied in many fields of science, including engineering, economics and logistics where optimal decisions need to be taken in the presence of tradeoffs between two or more conflicting objectives. Minimizing cost while maximizing comfort while buying a car, and maximizing performance whilst minimizing fuel consumption and emission of pollutants of a vehicle are examples of multiobjective optimization problems involving two and three objectives, respectively. In practical problems, there can be more than three objectives. For a nontrivial multiobjective optimization problem, no single solution exists that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions. A solution is called nondominated, Pareto optimal, Pareto efficient or noninferior, if none of the objective functions can be improved in value without degrading some of the other objective values. Without additional subjective preference information, all Pareto optimal solutions are considered equally good (as vectors cannot be ordered completely). Researchers study multiobjective optimization problems from different viewpoints and, thus, there exist different solution philosophies and goals when setting and solving them. The goal may be to find a representative set of Pareto optimal solutions, and/or quantify the tradeoffs in satisfying the different objectives, and/or finding a single solution that satisfies the subjective preferences of a human decision maker (DM). 
Multiobjective Programming  Multiobjective optimization (also known as multiobjective programming, vector optimization, multicriteria optimization, multiattribute optimization or Pareto optimization) is an area of multiple criteria decision making, that is concerned with mathematical optimization problems involving more than one objective function to be optimized simultaneously. Multiobjective optimization has been applied in many fields of science, including engineering, economics and logistics (see the section on applications for detailed examples) where optimal decisions need to be taken in the presence of tradeoffs between two or more conflicting objectives. Minimizing cost while maximizing comfort while buying a car, and maximizing performance whilst minimizing fuel consumption and emission of pollutants of a vehicle are examples of multiobjective optimization problems involving two and three objectives, respectively. In practical problems, there can be more than three objectives. For a nontrivial multiobjective optimization problem, there does not exist a single solution that simultaneously optimizes each objective. In that case, the objective functions are said to be conflicting, and there exists a (possibly infinite) number of Pareto optimal solutions. A solution is called nondominated, Pareto optimal, Pareto efficient or noninferior, if none of the objective functions can be improved in value without degrading some of the other objective values. Without additional subjective preference information, all Pareto optimal solutions are considered equally good (as vectors cannot be ordered completely). Researchers study multiobjective optimization problems from different viewpoints and, thus, there exist different solution philosophies and goals when setting and solving them. The goal may be to find a representative set of Pareto optimal solutions, and/or quantify the tradeoffs in satisfying the different objectives, and/or finding a single solution that satisfies the subjective preferences of a human decision maker (DM). 
MultiParameter Regression (MPR) 
mpr 
Multiple Correspondence Analysis (MCA) 
In statistics, multiple correspondence analysis (MCA) is a data analysis technique for nominal categorical data, used to detect and represent underlying structures in a data set. It does this by representing data as points in a lowdimensional Euclidean space. The procedure thus appears to be the counterpart of principal component analysis for categorical data. MCA is an extension of simple correspondence analysis (CA) in that it is applicable to a large set of categorical variables. GDAtools 
Multiple Criteria Decision Making (MCDM) 
Multiplecriteria decisionmaking or multiplecriteria decision analysis (MCDA) is a subdiscipline of operations research that explicitly considers multiple criteria in decisionmaking environments. Whether in our daily lives or in professional settings, there are typically multiple conflicting criteria that need to be evaluated in making decisions. Cost or price is usually one of the main criteria. Some measure of quality is typically another criterion that is in conflict with the cost. In purchasing a car, cost, comfort, safety, and fuel economy may be some of the main criteria we consider. It is unusual that the cheapest car is the most comfortable and the safest one. In portfolio management, we are interested in getting high returns but at the same time reducing our risks. Again, the stocks that have the potential of bringing high returns typically also carry high risks of losing money. In a service industry, customer satisfaction and the cost of providing service are two conflicting criteria that would be useful to consider. In our daily lives, we usually weigh multiple criteria implicitly and we may be comfortable with the consequences of such decisions that are made based on only intuition. On the other hand, when stakes are high, it is important to properly structure the problem and explicitly evaluate multiple criteria. In making the decision of whether to build a nuclear power plant or not, and where to build it, there are not only very complex issues involving multiple criteria, but there are also multiple parties who are deeply affected from the consequences. Structuring complex problems well and considering multiple criteria explicitly leads to more informed and better decisions. There have been important advances in this field since the start of the modern multiplecriteria decisionmaking discipline in the early 1960s. A variety of approaches and methods, many implemented by specialized decisionmaking software, have been developed for their application in an array of disciplines, ranging from politics and business to the environment and energy. 
Multiple Factor Analysis (MFA) 
Multiple factor analysis (MFA) is a factorial method devoted to the study of tables in which a group of individuals is described by a set of variables (quantitative and / or qualitative) structured in groups. It may be seen as an extension of: • Principal component analysis (PCA) when variables are quantitative, • Multiple correspondence analysis (MCA) when variables are qualitative, • Factor analysis of mixed data (FAMD) when the active variables belong to the two types. FactoMineR,MFAg 
Multiple Instance Learning (MIL) 
Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 stateoftheart MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research. 
Multiple Response Permutation Procedure (MRPP) 
Multiple Response Permutation Procedure (MRPP) provides a test of whether there is a significant difference between two or more groups of sampling units. vegan,Blossom 
MultipleCriteria Decision Analysis (MCDA) 

MultipleOutput Regression  Predicting multivariate responses in multiple linear regression. Multioutput Decision Tree Regression Multiple Output Regression 
Multiplicative Integration (MI) 
We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no extra parameters. The new structure can be easily embedded into many popular RNN models, including LSTMs and GRUs. We empirically analyze its learning behaviour and conduct evaluations on several tasks using different RNN models. Our experimental results demonstrate that Multiplicative Integration can provide a substantial performance boost over many of the existing RNN models. 
Multipolar Analytics  The layercake bestpractice model of analytics (operational systems and external data feeding data marts and a data warehouse, with BI tools as the cherry on the top) is rapidly becoming obsolete. It’s being replaced by a new, multipolar model where data is collected and analyzed in multiple places, according to the type of data and analysis required: • New HTAP systems (traditional operational data and realtime analytics) • Traditional data warehouses (finance, budgets, corporate KPIs, etc.) • Hadoop/Spark (sensor and polystructured data, longterm storage and analysis) • Standalone BI systems (personal and departmental analytics, including spreadsheets) 
MultiRange Reasoning Unit (MRU) 
We propose MRU (MultiRange Reasoning Units), a new fast compositional encoder for machine comprehension (MC). Our proposed MRU encoders are characterized by multiranged gating, executing a series of parameterized contractandexpand layers for learning gating vectors that benefit from long and shortterm dependencies. The aims of our approach are as follows: (1) learning representations that are concurrently aware of long and shortterm context, (2) modeling relationships between intradocument blocks and (3) fast and efficient sequence encoding. We show that our proposed encoder demonstrates promising results both as a standalone encoder and as well as a complementary building block. We conduct extensive experiments on three challenging MC datasets, namely RACE, SearchQA and NarrativeQA, achieving highly competitive performance on all. On the RACE benchmark, our model outperforms DFN (Dynamic Fusion Networks) by 1.5%6% without using any recurrent or convolution layers. Similarly, we achieve competitive performance relative to AMANDA on the SearchQA benchmark and BiDAF on the NarrativeQA benchmark without using any LSTM/GRU layers. Finally, incorporating MRU encoders with standard BiLSTM architectures further improves performance, achieving stateoftheart results. 
Multiregression Dynamic Models (MDM) 
Multiregression dynamic models are defined to preserve certain conditional independence structures over time across a multivariate time series. They are nonGaussian and yet they can often be updated in closed form. The first two moments of their onestepahead forecast distribution can be easily calculated. Furthermore, they can be built to contain all the features of the univariate dynamic linear model and promise more efficient identification of causal structures in a time series than has been possible in the past multdyn 
MultiRelevance Transfer Learning (MRTL) 
Transfer learning aims to faciliate learning tasks in a labelscarce target domain by leveraging knowledge from a related source domain with plenty of labeled data. Often times we may have multiple domains with little or no labeled data as targets waiting to be solved. Most existing efforts tackle target domains separately by modeling the `sourcetarget’ pairs without exploring the relatedness between them, which would cause loss of crucial information, thus failing to achieve optimal capability of knowledge transfer. In this paper, we propose a novel and effective approach called MultiRelevance Transfer Learning (MRTL) for this purpose, which can simultaneously transfer different knowledge from the source and exploits the shared common latent factors between target domains. Specifically, we formulate the problem as an optimization task based on a collective nonnegative matrix trifactorization framework. The proposed approach achieves both sourcetarget transfer and targettarget leveraging by sharing multiple decomposed latent subspaces. Further, an alternative minimization learning algorithm is developed with convergence guarantee. Empirical study validates the performance and effectiveness of MRTL compared to the stateoftheart methods. 
MultiResolution Scanning (MRS) 
MRS 
MultiRobot Transfer Learning  Multirobot transfer learning allows a robot to use data generated by a second, similar robot to improve its own behavior. The potential advantages are reducing the time of training and the unavoidable risks that exist during the training phase. Transfer learning algorithms aim to find an optimal transfer map between different robots. In this paper, we investigate, through a theoretical study of singleinput singleoutput (SISO) systems, the properties of such optimal transfer maps. We first show that the optimal transfer learning map is, in general, a dynamic system. The main contribution of the paper is to provide an algorithm for determining the properties of this optimal dynamic map including its order and regressors (i.e., the variables it depends on). The proposed algorithm does not require detailed knowledge of the robots’ dynamics, but relies on basic system properties easily obtainable through simple experimental tests. We validate the proposed algorithm experimentally through an example of transfer learning between two different quadrotor platforms. Experimental results show that an optimal dynamic map, with correct properties obtained from our proposed algorithm, achieves 6070% reduction of transfer learning error compared to the cases when the data is directly transferred or transferred using an optimal static map. 
MultiScale Deep Neural Network (MSDNN) 
Salient object detection is a fundamental problem and has been received a great deal of attentions in computer vision. Recently deep learning model became a powerful tool for image feature extraction. In this paper, we propose a multiscale deep neural network (MSDNN) for salient object detection. The proposed model first extracts global highlevel features and context information over the whole source image with recurrent convolutional neural network (RCNN). Then several stacked deconvolutional layers are adopted to get the multiscale feature representation and obtain a series of saliency maps. Finally, we investigate a fusion convolution module (FCM) to build a final pixel level saliency map. The proposed model is extensively evaluated on four salient object detection benchmark datasets. Results show that our deep model significantly outperforms other 12 stateoftheart approaches. 
Multiset Dimension  We introduce a variation of the metric dimension, called the multiset dimension. The representation multiset of a vertex $v$ with respect to $W$ (which is a subset of the vertex set of a graph $G$), $r_m (vW)$, is defined as a multiset of distances between $v$ and the vertices in $W$ together with their multiplicities. If $r_m (u W) \neq r_m(vW)$ for every pair of distinct vertices $u$ and $v$, then $W$ is called a resolving set of $G$. If $G$ has a resolving set, then the cardinality of a smallest resolving set is called the multiset dimension of $G$, denoted by $md(G)$. If $G$ does not contain a resolving set, we write $md(G) = \infty$. We present basic results on the multiset dimension. We also study graphs of given diameter and give some sufficient conditions for a graph to have an infinite multiset dimension. 
MultiState Adaptive Dynamic Principal Component Analysis  mvMonitoring 
MultiState Morkov Model  
MultiTask Attention Network (MTAN) 
In this paper, we propose a novel multitask learning architecture, which incorporates recent advances in attention mechanisms. Our approach, the MultiTask Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with taskspecific softattention modules, which are trainable in an endtoend manner. These attention modules allow for learning of taskspecific features from the global pool, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be built upon any feedforward neural network, is simple to implement, and is parameter efficient. Experiments on the CityScapes dataset show that our method outperforms several baselines in both singletask and multitask learning, and is also more robust to the various weighting schemes in the multitask loss function. We further explore the effectiveness of our method through experiments over a range of task complexities, and show how our method scales well with task complexity compared to baselines. 
Multitask Determinantal Point Process (multitask DPP) 
Determinantal point processes (DPPs) have received significant attention in the recent years as an elegant model for a variety of machine learning tasks, due to their ability to elegantly model set diversity and item quality or popularity. Recent work has shown that DPPs can be effective models for product recommendation and basket completion tasks. We present an enhanced DPP model that is specialized for the task of basket completion, the multitask DPP. We view the basket completion problem as a multiclass classification problem, and leverage ideas from tensor factorization and multiclass classification to design the multitask DPP model. We evaluate our model on several realworld datasets, and find that the multitask DPP provides significantly better predictive quality than a number of stateoftheart models. 
MultiTask Multiple Kernel Relationship Learning (MKMTRL) 
This paper presents a novel multitask multiplekernel learning framework that efficiently learns the kernel weights leveraging the relationship across multiple tasks. The idea is to automatically infer this task relationship in the \textit{RKHS} space corresponding to the given base kernels. The problem is formulated as a regularizationbased approach called \textit{MultiTask Multiple Kernel Relationship Learning} (\textit{MKMTRL}), which models the task relationship matrix from the weights learned from latent feature spaces of taskspecific base kernels. Unlike in previous work, the proposed formulation allows one to incorporate prior knowledge for simultaneously learning several related task. We propose an alternating minimization algorithm to learn the model parameters, kernel weights and task relationship matrix. In order to tackle largescale problems, we further propose a twostage \textit{MKMTRL} online learning algorithm and show that it significantly reduces the computational time, and also achieves performance comparable to that of the joint learning framework. Experimental results on benchmark datasets show that the proposed formulations outperform several stateoftheart multitask learning methods. 
Multivariate Adaptive Regression Splines (MARS) 
Deep neural networks (DNNs) generate much richer function spaces than shallow networks. Since the function spaces induced by shallow networks have several approximation theoretic drawbacks, this explains, however, not necessarily the success of deep networks. In this article we take another route by comparing the expressive power of DNNs with ReLU activation function to piecewise linear spline methods. We show that MARS (multivariate adaptive regression splines) is improper learnable by DNNs in the sense that for any given function that can be expressed as a function in MARS with $M$ parameters there exists a multilayer neural network with $O(M \log (M/\varepsilon))$ parameters that approximates this function up to supnorm error $\varepsilon.$ We show a similar result for expansions with respect to the FaberSchauder system. Based on this, we derive risk comparison inequalities that bound the statistical risk of fitting a neural network by the statistical risk of splinebased methods. This shows that deep networks perform better or only slightly worse than the considered spline methods. We provide a constructive proof for the function approximations. earth 
Multivariate Bayesian Model with Shrinkage Priors (MBSP) 
The method is described in Bai and Ghosh (2018) <arXiv:1711.07635>. MBSP 
Multivariate Count Autoregression  We are studying the problems of modeling and inference for multivariate count time series data with Poisson marginals. The focus is on linear and loglinear models. For studying the properties of such processes we develop a novel conceptual framework which is based on copulas. However, our approach does not impose the copula on a vector of counts; instead the joint distribution is determined by imposing a copula function on a vector of associated continuous random variables. This specific construction avoids conceptual difficulties resulting from the joint distribution of discrete random variables yet it keeps the properties of the Poisson process marginally. We employ Markov chain theory and the notion of weak dependence to study ergodicity and stationarity of the models we consider. We obtain easily verifiable conditions for both linear and loglinear models under both theoretical frameworks. Suitable estimating equations are suggested for estimating unknown model parameters. The large sample properties of the resulting estimators are studied in detail. The work concludes with some simulations and a real data example. 
Multivariate DVine Time Series Model (mDvine) 
This paper proposes a novel semiparametric multivariate Dvine time series model (mDvine) that enables the simultaneous copulabased modeling of both temporal and crosssectional dependence for multivariate time series. To construct the mDvine, we first build a semiparametric univariate Dvine time series model (uDvine) based on a Dvine. The uDvine generalizes the existing firstorder copulabased Markov chain models to Markov chains of an arbitraryorder. Building upon the uDvine, we then construct the mDvine by joining multiple uDvines via another parametric copula. As a simple and tractable model, the mDvine provides flexible models for marginal behavior of time series and can also generate sophisticated temporal and crosssectional dependence structures. Probabilistic properties of both the uDvine and mDvine are studied in detail. Furthermore, robust and computationally efficient procedures, including a sequential model selection method and a twostage MLE, are proposed for model estimation and inference, and their statistical properties are investigated. Numerical experiments are conducted to demonstrate the flexibility of the mDvine, and to examine the performance of the sequential model selection procedure and the twostage MLE. Real data applications on the Australian electricity price and the Ireland wind speed data demonstrate the superior performance of the mDvine to traditional multivariate time series models. 
Multivariate Imputation by Chained Equations (MICE) 
Multivariate imputation by chained equations (MICE) is a particular multiple imputation technique (Raghunathan et al., 2001; Van Buuren, 2007). MICE operates under the assumption that given the variables used in the imputation procedure, the missing data are Missing At Random (MAR), which means that the probability that a value is missing depends only on observed values and not on unobserved values (Schafer & Graham, 2002). In other words, after controlling for all of the available data (i.e., the variables included in the imputation model) “any remaining missingness is completely random” (Graham, 2009). Implementing MICE when data are not MAR could result in biased estimates. In the remainder of this paper, we assume that the MICE procedures are used with data that are MAR. mice 
Multivariate Locally Stationary Wavelet Analysis (mvLSW) 
mvLSW 
Multivariate Ordinal Regression Model  mvord 
Multivariate Process Capability Indices (MPCI) 
MPCI 
Multivariate Range Boxes  dynRB 
Multivariate Response Regression Models  
Multivariate Statistics  Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis. Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical implementation of multivariate statistics to a particular problem may involve several types of univariate and multivariate analysis in order to understand the relationships between variables and their relevance to the actual problem being studied. In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both: 1. how these can be used to represent the distributions of observed data; 2. how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis. Certain types of problem involving multivariate data, for example simple linear regression and multiple regression, are NOT usually considered as special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables. 
Multivariate Subjective Fiducial Inference  The aim of this paper is to firmly establish subjective fiducial inference as a rival to the more conventional schools of statistical inference, and to show that Fisher’s intuition concerning the importance of the fiducial argument was correct. In particular, methodology outlined in an earlier paper will be modified, enhanced and extended to deal with general inferential problems in which various parameters are unknown. Although the resulting theory is classified as being ‘subjective’, it is shown that this is simply due to the argument that all probability statements made about fixed but unknown parameters must be inherently subjective, rather than due to a need to emphasize how different the fiducial probabilities that can be derived using this theory are from objective probabilities. Some important examples of the application of this theory are presented. 
Multiway Data Analysis  Multiway data analysis is a method of analyzing large data sets by representing the data as a multidimensional array. The proper choice of array dimensions and analysis techniques can reveal patterns in the underlying data undetected by other methods. 
MuProp  Deep neural networks are powerful parametric models that can be trained efficiently using the backpropagation algorithm. Stochastic neural networks combine the power of large parametric functions with that of graphical models, which makes it possible to learn very complex distributions. However, as backpropagation is not directly applicable to stochastic networks that include discrete sampling operations within their computational graph, training such networks remains difficult. We present MuProp, an unbiased gradient estimator for stochastic networks, designed to make this task easier. MuProp improves on the likelihoodratio estimator by reducing its variance using a control variate based on the firstorder Taylor expansion of a meanfield network. Crucially, unlike prior attempts at using backpropagation for training stochastic networks, the resulting estimator is unbiased and well behaved. 
Murphy Diagram  In the context of probability forecasts for binary weather events, displays of this type have a rich tradition that can be traced to Thompson and Brier (1955) and Murphy (1977). More recent examples include the papers by Schervish (1989), Richardson (2000), Wilks (2001), Mylne (2002), and Berrocal et al. (2010), among many others. Murphy (1977) distinguished three kinds of diagrams that reflect the economic decisions involved. The negatively oriented expense diagram shows the mean raw loss or expense of a given forecast scheme; the positively oriented value diagram takes the unconditional or climatological forecast as reference and plots the difference in expense between this reference forecast and the forecast at hand, and lastly, the relativevalue diagram plots the ratio of the utility of a given forecast and the utility of an oracle forecast. The displays introduced above are similar to the value diagrams of Murphy, and we refer to them as Murphy diagrams. Murphy diagrams in R 
Mutual Information Neural Estimator (MINE) 
We argue that the estimation of the mutual information between high dimensional continuous random variables is achievable by gradient descent over neural networks. This paper presents a Mutual Information Neural Estimator (MINE) that is linearly scalable in dimensionality as well as in sample size. MINE is backpropable and we prove that it is strongly consistent. We illustrate a handful of applications in which MINE is succesfully applied to enhance the property of generative models in both unsupervised and supervised settings. We apply our framework to estimate the information bottleneck, and apply it in tasks related to supervised classification problems. Our results demonstrate substantial added flexibility and improvement in these settings. 
mxnet  MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scaling effectively to multiple GPUs and multiple machines. MXNet is also more than a deep learning project. It is also a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers. mxnet 
MXNETMPI  Existing Deep Learning frameworks exclusively use either Parameter Server(PS) approach or MPI parallelism. In this paper, we discuss the drawbacks of such approaches and propose a generic framework supporting both PS and MPI programming paradigms, coexisting at the same time. The key advantage of the new model is to embed the scaling benefits of MPI parallelism into the loosely coupled PS task model. Apart from providing a practical usage model of MPI in cloud, such framework allows for novel communication avoiding algorithms that do parameter averaging in Stochastic Gradient Descent(SGD) approaches. We show how MPI and PS models can synergestically apply algorithms such as Elastic SGD to improve the rate of convergence against existing approaches. These new algorithms directly help scaling SGD clusterwide. Further, we also optimize the critical component of the framework, namely global aggregation or allreduce using a novel concept of tensor collectives. These treat a group of vectors on a node as a single object allowing for the existing single vector algorithms to be directly applicable. We back our claims with sufficient emperical evidence using large scale ImageNet 1K data. Our framework is built upon MXNET but the design is generic and can be adapted to other popular DL infrastructures. 
Advertisements