d3.compose  Compose complex, datadriven visualizations from reusable charts and components with d3. · Get started quickly with standard charts and components · Layout charts and components automatically · Powerful foundation for creating custom charts and components 
DAGOR  Effective overload control for largescale online service system is crucial for protecting the system backend from overload. Conventionally, the design of overload control is adhoc for individual service. However, servicespecific overload control could be detrimental to the overall system due to intricate service dependencies or flawed implementation of service. Service developers usually have difficulty to accurately estimate the dynamics of actual workload during the development of service. Therefore, it is essential to decouple the overload control from the service logic. In this paper, we propose DAGOR, an overload control scheme designed for the accountoriented microservice architecture. DAGOR is service agnostic and systemcentric. It manages overload at the microservice granule such that each microservice monitors its load status in real time and triggers load shedding in a collaborative manner among its relevant services when overload is detected. DAGOR has been used in the WeChat backend for five years. Experimental results show that DAGOR can benefit high success rate of service even when the system is experiencing overload, while ensuring fairness in the overload control. 
Daleel  In this paper we present Daleel, a multicriteria adaptive decision making framework that is developed to find the optimal IaaS deployment strategy. 
DALEX  Predictive modeling is invaded by elastic, yet complex methods such as neural networks or ensembles (model stacking, boosting or bagging). Such methods are usually described by a large number of parameters or hyper parameters – a price that one needs to pay for elasticity. The very number of parameters makes models hard to understand. This paper describes a consistent collection of explainers for predictive models, a.k.a. black boxes. Each explainer is a technique for exploration of a black box model. Presented approaches are modelagnostic, what means that they extract useful information from any predictive method despite its internal structure. Each explainer is linked with a specific aspect of a model. Some are useful in decomposing predictions, some serve better in understanding performance, while others are useful in understanding importance and conditional responses of a particular variable. Every explainer presented in this paper works for a single model or for a collection of models. In the latter case, models can be compared against each other. Such comparison helps to find strengths and weaknesses of different approaches and gives additional possibilities for model validation. Presented explainers are implemented in the DALEX package for R. They are based on a uniform standardized grammar of model exploration which may be easily extended. The current implementation supports the most popular frameworks for classification and regression. 
DamerauLevenshtein Distance  In information theory and computer science, the DamerauLevenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the DamerauLevenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The DamerauLevenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical singlecharacter edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that these four operations correspond to more than 80% of all human misspellings. Damerau’s paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, DamerauLevenshtein distance has also seen uses in biology to measure the variation between protein sequences. 
Damped LeastSquares (DLS) 
In mathematics and computing, the LevenbergMarquardt algorithm (LMA), also known as the damped leastsquares (DLS) method, is used to solve nonlinear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the GaussNewton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For wellbehaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as GaussNewton using a trust region approach. The LMA is a very popular curvefitting algorithm used in many software applications for solving generic curvefitting problems. However, as for many fitting algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum. onls 
DancingLines  Nowadays, events usually burst and are propagated online through multiple modern media like social networks and search engines. There exists various research discussing the event dissemination trends on individual medium, while few studies focus on event popularity analysis from a crossplatform perspective. Challenges come from the vast diversity of events and media, limited access to aligned datasets across different media and a great deal of noise in the datasets. In this paper, we design DancingLines, an innovative scheme that captures and quantitatively analyzes event popularity between pairwise text media. It contains two models: TFSW, a semanticaware popularity quantification model, based on an integrated weight coefficient leveraging Word2Vec and TextRank; and wDTWCD, a pairwise event popularity time series alignment model matching different event phases adapted from Dynamic Time Warping. We also propose three metrics to interpret event popularity trends between pairwise social platforms. Experimental results on eighteen realworld event datasets from an influential social network and a popular search engine validate the effectiveness and applicability of our scheme. DancingLines is demonstrated to possess broad application potentials for discovering the knowledge of various aspects related to events and different media. 
Dark Data  The total amount of data in every organization is far, far greater than anyone including, most crucially, their Information Technology group knows about. Moreover this “missing” data that can’t be seen and currently can’t be made use of is also the very stuff that holds the organization together. This is what we call “Dark Data.” 
Dark Knowledge  A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give an image of a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This ‘dark knowledge’, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier. http://…/darkknowledgeneuralnetwork.html http://…/geoffhintonsdarkknowledge http://…/1503.02531v1.pdf 
DARPA Open Catalog  Welcome to the DARPA Open Catalog, which contains a curated list of DARPAsponsored software and peerreviewed publications. DARPA sponsors fundamental and applied research in a variety of areas that may lead to experimental results and reusable technology designed to benefit multiple government domains. The DARPA Open Catalog organizes publicly releasable material from DARPA programs. DARPA has an open strategy to help increase the impact of government investments. DARPA is interested in building communities around governmentfunded research. DARPA plans to continue to make available information generated by DARPA programs, including software, publications, data, and experimental results. The table on this page lists the programs currently participating in the catalog. 
dasksearchcv  This library provides implementations of ScikitLearn’s GridSearchCV and RandomizedSearchCV. They implement many (but not all) of the same parameters, and should be a dropin replacement for the subset that they do implement. For certain problems, these implementations can be more efficient than those in ScikitLearn, as they can avoid expensive repeated computations. 
Dat  Build data pipelines – Dat is an open source project that provides a streaming interface between every file format and data storage backend. 
Data Acceleration  Data technologies are evolving rapidly, but organizations have adopted most of these in piecemeal fashion. As a result, enterprise data – whether related to customer interactions, business performance, computer notifications, or external events in the business environment – is vastly underutilized. Moreover, companies’ data ecosystems have become complex and littered with data silos. This makes the data more difficult to access, which in turn limits the value that organizations can get out of it. Indeed, according to a recent Gartner, Inc. report, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage through 2015. Furthermore, a recent Accenture study found that half of all companies have concerns about the accuracy of their data, and the majority of executives are unclear about the business outcomes they are getting from their data analytics programs. To unlock the value hidden in their data, companies must start treating data as a supply chain, enabling it to flow easily and usefully through the entire organization – and eventually throughout each company’s ecosystem of partners, including suppliers and customers. The time is right for this approach. For one thing, new external data sources are becoming available, providing fresh opportunities for data insights. In addition, the tools and technology required to build a better data platform are available and in use. These provide a foundation on which companies can construct an integrated, endtoend data supply chain. 
Data Acquisition  Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems (abbreviated with the acronym DAS or DAQ) typically convert analog waveforms into digital values for processing. 
Data Aggregation  In statistics, aggregate data describes data combined from several measurements. When data are aggregated, groups of observations are replaced with summary statistics based on those observations. In economics, aggregate data or data aggregates describes highlevel data that is composed from a multitude or combination of other more individual data. 
Data Analysis  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. 
Data Analytics  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. Data analytics is used in many industries to allow companies and [organizations] to make better business decisions and in the sciences to verify or disprove existing models or theories. Definition 
Data Archaeology  Data archaeology refers to the art and science of recovering computer data encoded and/or encrypted in now obsolete media or formats. Data archaeology can also refer to recovering information from damaged electronic formats after natural or man made disasters. 
Data as a Service (DaaS) 
Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of serviceoriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. Traditionally, most enterprises have used data stored in a selfcontained repository, for which software was specifically developed to access and present the data in a humanreadable form. One result of this paradigm is the bundling of both the data and the software needed to interpret it into a single package, sold as a consumer product. As the number of bundled software/data packages proliferated and required interaction among one another, another layer of interface was required. These interfaces, collectively known as enterprise application integration (EAI), often tended to encourage vendor lockin, as it is generally easy to integrate applications that are built upon the same foundation technology. The result of the combined software/data consumer package and required EAI middleware has been an increased amount of software for organizations to manage and maintain, simply for the use of particular data. In addition to routine maintenance costs, a cascading amount of software updates are required as the format of the data changes. The existence of this situation contributes to the attractiveness of DaaS to data consumers, because it allows for the separation of data cost and usage from that of a specific software or platform. 
Data Assimilation  Data assimilation is the process by which observations are incorporated into a computer model of a real system. Applications of data assimilation arise in many fields of geosciences, perhaps most importantly in weather forecasting and hydrology. Data assimilation proceeds by analysis cycles. In each analysis cycle, observations of the current (and possibly past) state of a system are combined with the results from a numerical model (the forecast) to produce an analysis, which is considered as ‘the best’ estimate of the current state of the system. This is called the analysis step. Essentially, the analysis step tries to balance the uncertainty in the data and in the forecast. The model is then advanced in time and its result becomes the forecast in the next analysis cycle. Book: Data Assimilation Book: Data Assimilation Data Assimilation Lecture Notes 
Data Augmentation  Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more indepth insight. Data augmentation can help reduce the manual interventation required to developed meaningful information and insight of business data, as well as significantly enhance data quality. Data augmentation is of the last steps done in enterprise data management after monitoring, profiling and integration Some of the common techniques used in data augmentation include: · Extrapolation Technique: Based on heuristics. The relevant fields are updated or provided with values. · Tagging Technique: Common records are tagged to a group, making it easier to understand and differentiate for the group. · Aggregation Technique: Using mathematical values of averages and means, values are estimated for relevant fields if needed · Probability Technique: Based on heuristics and analytical statistics, values are populated based on the probability of events. https://…/01jcgsart.pdf 
Data Blending  Data blending is the process of combining data from multiple sources to reveal deeper intelligence that drives better business decisionmaking. Data blending differs from data integration and data warehousing in that its primary use is not to create the single, unified version of the truth that is stored in systems of record. Rather, business and data analysts use data blending to build an analytic dataset to assist in answering a specific business questions and driving a particular business process. 
Data Broker / Information Broker  An information broker (independent information professional, information consultant, or data broker) collects information, often about individual people. The data are then sold to companies that use it to target advertising and marketing towards specific groups, to verify a person’s identity including for purposes of fraud detection, and to sell to individuals and organizations so they can research particular individuals. Critics, including consumer protection organizations, say the industry is secretive and unaccountable, and should be better regulated. 
Data Business Model  According to Wikipedia, a business model “describes the rationale of how an organization creates, delivers, and captures value.” A Data Business Model is a business model where data is an indispensable component. If you remove the data, the business fails (or at least suffers greatly). To take one example, Amazon’s data is core to their business. Their historical transaction data helps them figure out how much inventory to hold and how to price products. Additionally, data about product views and purchases powers the recommendation engine, which drives a large portion of sales. Furthermore, product reviews drive traffic and SEO. As icing on the cake, all of this is a virtuous cycle: recommendations drive purchases, which result in more reviews, which lead to better SEO and more traffic, which results in more visitors and better recommendations. If Amazon wasn’t so effective as using data, it would be a much smaller company. The best part of data business models is that they often have the same kind of positive feedback loop as Amazon. In each business model, the more you use data to make money, the more data you get as a result, which helps you make more money in the future. 
Data Communications  Data Communications concerns the transmission of digital messages to devices external to the message source. “External” devices are generally thought of as being independently powered circuitry that exists beyond the chassis of a computer or other digital message source. As a rule, the maximum permissible transmission rate of a message is directly proportional to signal power, and inversely proportional to channel noise. It is the aim of any communications system to provide the highest possible transmission rate at the lowest possible power and with the least possible noise. 
Data Cube Materialization  Data cube materialization is a classical database operator introduced in Gray et al.~(Data Mining and Knowledge Discovery, Vol.~1), which is critical for many analysis tasks. Nandi et al.~(Transactions on Knowledge and Data Engineering, Vol.~6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. 
Data Curation  Data curation is a term used to indicate management activities required to maintain research data longterm such that it is available for reuse and preservation. In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. The term is also used in the humanities, where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. 
Data Decorations  Alberto Cairo left a comment about ‘data decorations’. This is a name he’s using to describe something like the windshieldwiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience – one might argue that the experience was worse than just staring at the data table. 
Data Distillery  The paper tackles the unsupervised estimation of the effective dimension of a sample of dependent random vectors. The proposed method uses the principal components (PC) decomposition of sample covariance to establish a lowrank approximation that helps uncover the hidden structure. The number of PCs to be included in the decomposition is determined via a Probabilistic Principal Components Analysis (PPCA) embedded in a penalized profile likelihood criterion. The choice of penalty parameter is guided by a datadriven procedure that is justified via analytical derivations and extensive finite sample simulations. Application of the proposed penalized PPCA is illustrated with three gene expression datasets in which the number of cancer subtypes is estimated from all expression measurements. The analyses point towards hidden structures in the data, e.g. additional subgroups, that could be of scientific interest. 
Data Driven Business Model (DDBM) 
This paper contributes by providing a definition of a datadriven business model as a business model that relies on data as a key resource. 
Data Driven Documents (D3) 
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a datadriven approach to DOM manipulation. Visualizing Data with D3.js D3 Tips and Tricks Awesome D3 
Data Envelopment Analysis (DEA) 
Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. It is used to empirically measure productive efficiency of decision making units (or DMUs). Although DEA has a strong link to production theory in economics, the tool is also used for benchmarking in operations management, where a set of measures is selected to benchmark the performance of manufacturing and service operations. Data Envelopment Analysis rDEA 
Data Federation  In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition ‘Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using ondemand data integration.’ This definition is based on the following concepts: · Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization. · Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages. · Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation. · One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data. · Ondemand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format. Spark Reaches for the Holy Grail: Federated Queries 
Data Fusion  Data fusion is the process of integration of multiple data and knowledge representing the same realworld object into a consistent, accurate, and useful representation. Data fusion processes are often categorized as low, intermediate or high, depending on the processing stage at which fusion takes place. Low level data fusion combines several sources of raw data to produce new raw data. The expectation is that fused data is more informative and synthetic than the original inputs. For example, sensor fusion is also known as (multisensor) data fusion and is a subset of information fusion. 
Data Hoarding 
http://…/Wikipedia:Avoid_datahoarding http://…hoardingpcavisualizationdecisions.html 
Data Illustrator  Data Illustrator: Augmenting Vector Design Tools with Lazy Data Binding for Expressive Visualization Authoring. Building graphical user interfaces for visualization authoring is challenging as one must reconcile the tension between flexible graphics manipulation and procedural visualization generation based on a graphical grammar or declarative languages. To better support designers’ workflows and practices, we propose Data Illustrator, a novel visualization framework. In our approach, all visualizations are initially vector graphics; data binding is applied when necessary and only constrains interactive manipulation to that data bound property. The framework augments graphic design tools with new concepts and operators, and describes the structure and generation of a variety of visualizations. Based on the framework, we design and implement a visualization authoring system. The system extends interaction techniques in modern vector design tools for direct manipulation of visualization configurations and parameters. We demonstrate the expressive power of our approach through a variety of examples. A qualitative study shows that designers can use our framework to compose visualizations. 
Data Impartment  
Data Journalism / Data Driven Journalism  Datadriven journalism, often shortened to “ddj”, is a term in use since 2009/2010, to describe a journalistic process based on analyzing and filtering large data sets for the purpose of creating a news story. Main drivers for this process are newly available resources such as “open source” software and “open data”. This approach to journalism builds on older practices, most notably on CAR (acronym for “computerassisted reporting”) a label used mainly in the US for decades. Other labels for partially similar approaches are “precision journalism”, based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories. 
Data Justice  As a handful of data platforms generate massive amounts of user data, the barriers to entry rise since potential competitors have little data themselves to entice advertisers compared to the incumbents who have both the concentrated processing power and supply of user data to dominate particular sectors. The upshot of this market power by big data platforms is that the marketplace is doing little to create options for consumers that might alleviate the misuse of consumer data or encourage big data platforms to better compensate users who are willing to share their data. Data Justice has been launched as a project to promote public education and new alliances to challenge the danger of big data to workers, consumers and the public. 
Data Lake  A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. The term data lake is often associated with Hadooporiented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. 
Data Leakage  Data Leakage is the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. Leakage is a pervasive challenge in applied machine learning, causing models to overrepresent their generalization error and often rendering them useless in the real world. It can caused by human or mechanical error, and can be intentional or unintentional in both cases. 
Data Learning  Technology is generating a huge and growing availability of observa tions of diverse nature. This big data is placing data learning as a central scientific discipline. It includes collection, storage, preprocessing, visualization and, essentially, statistical analysis of enormous batches of data. In this paper, we discuss the role of statistics regarding some of the issues raised by big data in this new paradigm and also propose the name of data learning to describe all the activities that allow to obtain relevant knowledge from this new source of information. 
Data Lineage  Data lineage is generally defined as a kind of data life cycle that includes the data’s origins and where it moves over time. This term can also describe what happens to data as it goes through diverse processes. Data lineage can help with efforts to analyze how information is used and to track key bits of information that serve a particular purpose. How to track and visualize data lineage 
Data Lineage Analysis  “Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time.” It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for stepwise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already. Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as datadependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. “Lineage is a simple type of why provenance.” 
Data Literacy  A statistical understanding and experience of applying analysis techniques to real data through code and visualization. Data literacy will be the fundamental skill for the 21st century. It’s also extremely easy to learn. The best way to develop this skill is to simply work with datasets. 
Data Mining (DM) 
Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, postprocessing of discovered structures, visualization, and online updating. 
Data Mining Reality Check (DMRC) 
Is a means for ensuring the validity of data mining results. Data Mining Reality Check is technology for giving you the probability distribution against which to compare the best performance from your data mining exercise – that is, the probability distribution of your best network relative to the benchmark, viewed as a random variable generated by a random process in which really there is nothing better than the benchmark. 
Data Normalization  Data normalization is the process of reducing data to its canonical form. For instance, Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. In the field of software security, a common vulnerability is unchecked malicious input. The mitigation for this problem is proper input validation. Before input validation may be performed, the input must be normalized, i.e., eliminating encoding (for instance HTML encoding) and reducing the input data to a single common character set. 
Data Paring  The problem that needs to be more discussed is data paring. The need for this is fairly obvious: data is growing exponentially, and growing your compute data exponentially will require budgets that aren’t realistic. One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t; I would bet that for most purposes the interesting data added is a tiny percentage of the new data that is added to the overall pool of data. 
Data Partitioning  Data partitioning in data mining is the division of the whole data available into two or three non overlapping sets: the training set , the validation set , and the test set. If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model. 
Data Pattern Processing  
Data Plumbing  
Data Preprocessing  Data preprocessing is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects. Datagathering methods are often loosely controlled, resulting in outofrange values (e.g., Income: 100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. 
Data Profiling  Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: 1. Find out whether existing data can easily be used for other purposes 2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category 3. Give metrics on data quality including whether the data conforms to particular standards or patterns 4. Assess the risk involved in integrating data for new applications, including the challenges of joins 5. Assess whether metadata accurately describes the actual values in the source database 6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. 7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality. 
Data Science  Data science is a buzz word reflecting the application of statistics by advances in computer science. Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. 
Data Science Maturity Model (DSMM) 
Many organizations have been underwhelmed by the return on their investment in data science. This is due to a narrow focus on tools, rather than a broader consideration of how data science teams work and how they fit within the larger organization. To help data science practitioners and leaders identify their existing gaps and direct future investment, Domino has developed a framework called the Data Science Maturity Model (DSMM). The DSMM assesses how reliably and sustainably a data science team can deliver value for their organization. The model consists of four levels of maturity and is split along five dimensions that apply to all analytical organizations. By design, the model is not specific to any given industry – it applies as much to an insurance company as it does to a manufacturer. 
Data Science Virtual Machine (DSVM) 
The Data Science Virtual Machine runs on Windows Server 2012 and contains popular tools for data exploration, modeling and development activities. The main tools included are Microsoft R Server Developer Edition (An enterprise ready scalable R framework), Anaconda Python distribution, Julia Pro developer edition, Jupyter notebooks for R, Python and Julia, Visual Studio Community Edition with Python, R and node.js tools, Power BI desktop, SQL Server 2016 Developer edition including support InDatabase analytics using Microsoft R Server. It also includes open source deep learning tools like Microsoft Cognitive Toolkit (CNTK 2.0) and mxnet; ML algorithms like xgboost, Vowpal Wabbit. The Azure SDK and libraries on the VM allows you to build your applications using various services in the cloud that are part of the Cortana Analytics Suite which includes Azure Machine Learning, Azure data factory, Stream Analytics and SQL Datawarehouse, Hadoop, Data Lake, Spark and more. You can deploy models as web services in the cloud on Azure Machine Learning OR deploy them either on the cloud or onpremises using the Microsoft R Server operationalization. 
Data ScienceasaService (DSaaS) 
Data Science as a Service is Analyze’s unique approach to providing today’s business and government with the most advanced and efficient big data and data science analytics. With more than 25 years combined experience in data science and cybersecurity, Analyze helps organizations stay ahead of the competition, increase revenue, and improve operational efficiency. 
Data Standardization  When approaching data for modeling, some standard procedures should be used to prepare the data for modeling: 1.First the data should be filtered, and any outliers removed from the data (watch for a future post on how to scrub your raw data removing only legitimate outliers). 2.The data should be normalized or standardized to bring all of the variables into proportion with one another. For example, if one variable is 100 times larger than another (on average), then your model may be better behaved if you normalize/standardize the two variables to be approximately equivalent. Technically though, whether normalized/standardized, the coefficients associated with each variable will scale appropriately to adjust for the disparity in the variable sizes. 
Data Stewardship  In metadata, a data steward is a person that is responsible for maintaining a data element in a metadata registry. A data steward is a broad job role that incorporates processes, policies, guidelines and responsibilities for administering organizations’ entire data in compliance with business and/or regulatory obligations. A data steward’s responsibility stems from an understanding of the business domain and the interaction of business processes with data entities/elements. A data steward ensures that there are documented procedures and guidelines for data access and use. A data steward may share some responsibilities with a data custodian, and work with database/warehouse administrators and other related staff to plan and execute an enterprisewide data governance, control and compliance policy. Data stewardship roles are common when organizations are attempting to exchange data precisely and consistently between computer systems and reuse datarelated resources. Master data management often makes references to the need for data stewardship for its implementation to succeed. 
Data Stream Mining  Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Often, concepts from the field of incremental learning, a generalization of Incremental heuristic search are applied to cope with structural changes, online learning and realtime demands. In many applications, especially operating within nonstationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift. 
Data Structure Graph  A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph. 
Data Understanding  In the field of machine learning, data understanding is the practice of getting initial insights in unknown datasets. Such knowledgeintensive tasks require a lot of documentation, which is necessary for data scientists to grasp the meaning of the data. Usually, documentation is separate from the data in various external documents, diagrams, spreadsheets and tools which causes considerable look up overhead. Moreover, other supporting applications are not able to consume and utilize such unstructured data. That is why we propose a methodology that uses a single semantic model that interlinks data with its documentation. Hence, data scientists are able to directly look up the connected information about the data by simply following links. Equally, they can browse the documentation which always refers to the data. Furthermore, the model can be used by other approaches providing additional support, like searching, comparing, integrating or visualizing data. To showcase our approach we also demonstrate an early prototype. 
Data Version Control (DVC) 
DVC makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data – through cloud storage (AWS S3, GCP) in a single DVC environment. 
Data Visualization  Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning ‘information that has been abstracted in some schematic form, including attributes or variables for the units of information’. A primary goal of data visualization is to communicate information clearly and efficiently to users via the information graphics selected, such as tables and charts. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will lookup a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables. Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly informationbased economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as ‘Big Data’. Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge. 
Data Warehouse (DW) 
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. 
Data2Vis  Rapidly creating effective visualizations using expressive grammars is challenging for users who have limited time and limited skills in statistics and data visualization. Even highlevel, dedicated visualization tools often require users to manually select among data attributes, decide which transformations to apply, and specify mappings between visual encoding variables and raw or transformed attributes. In this paper, we introduce Data2Vis, a neural translation model, for automatically generating visualizations from given datasets. We formulate visualization generation as a sequence to sequence translation problem where data specification is mapped to a visualization specification in a declarative language (VegaLite). To this end, we train a multilayered Long ShortTerm Memory (LSTM) model with attention on a corpus of visualization specifications. Qualitative results show that our model learns the vocabulary and syntax for a valid visualization specification, appropriate transformations (count, bins, mean) and how to use common data selection patterns that occur within data visualizations. Our model generates visualizations that are comparable to manuallycreated visualizations in a fraction of the time, with potential to learn more complex visualization strategies at scale. 
DataasaService (DaaS) 
Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of serviceoriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. 
DataDriven Threshold Machine (DTM) 
We present a novel distributionfree approach, the datadriven threshold machine (DTM), for a fundamental problem at the core of many learning tasks: choose a threshold for a given prespecified level that bounds the tail probability of the maximum of a (possibly dependent but stationary) random sequence. We do not assume data distribution, but rather relying on the asymptotic distribution of extremal values, and reduce the problem to estimate three parameters of the extreme value distributions and the extremal index. We specially take care of data dependence via estimating extremal index since in many settings, such as scan statistics, changepoint detection, and extreme bandits, where dependence in the sequence of statistics can be significant. Key features of our DTM also include robustness and the computational efficiency, and it only requires one sample path to form a reliable estimate of the threshold, in contrast to the Monte Carlo sampling approach which requires drawing a large number of sample paths. We demonstrate the good performance of DTM via numerical examples in various dependent settings. 
Dataflow Matrix Machine (DMM) 
Dataflow matrix machines generalize neural nets by replacing streams of numbers with streams of vectors (or other kinds of linear streams admitting a notion of linear combination of several streams) and adding a few more changes on top of that, namely arbitrary input and output arities for activation functions, countablesized networks with finite dynamically changeable active part capable of unbounded growth, and a very expressive selfreferential mechanism. While recurrent neural networks are Turingcomplete, they form an esoteric programming platform, not conductive for practical generalpurpose programming. Dataflow matrix machines are more suitable as a generalpurpose programming platform, although it remains to be seen whether this platform can be made fully competitive with more traditional programming platforms currently in use. At the same time, dataflow matrix machines retain the key property of recurrent neural networks: programs are expressed via matrices of real numbers, and continuous changes to those matrices produce arbitrarily small variations in the programs associated with those matrices. Spaces of vectorlike elements are of particular importance in this context. In particular, we focus on the vector space $V$ of finite linear combinations of strings, which can be also understood as the vector space of finite prefix trees with numerical leaves, the vector space of ‘mixed rank tensors’, or the vector space of recurrent maps. This space, and a family of spaces of vectorlike elements derived from it, are sufficiently expressive to cover all cases of interest we are currently aware of, and allow a compact and streamlined version of dataflow matrix machines based on a single space of vectorlike elements and variadic neurons. We call elements of these spaces Vvalues. Their role in our context is somewhat similar to the role of Sexpressions in Lisp. 
Dataflow Reuse Algorithms  Distributed Stream Processing Systems (DSPS) like Apache Storm and Spark Streaming enable composition of continuous dataflows that execute persistently over data streams. They are used by Internet of Things (IoT) applications to analyze sensor data from Smart City cyberinfrastructure, and make active utility management decisions. As the ecosystem of such IoT applications that leverage shared urban sensor streams continue to grow, applications will perform duplicate preprocessing and analytics tasks. This offers the opportunity to collaboratively reuse the outputs of overlapping dataflows, thereby improving the resource efficiency. In this paper, we propose \emph{dataflow reuse algorithms} that given a submitted dataflow, identifies the intersection of reusable tasks and streams from a collection of running dataflows to form a \emph{merged dataflow}. Similar algorithms to unmerge dataflows when they are removed are also proposed. We implement these algorithms for the popular Apache Storm DSPS, and validate their performance and resource savings for 35 synthetic dataflows based on public OPMW workflows with diverse arrival and departure distributions, and on 21 real IoT dataflows from RIoTBench. 
DataflowFlavored Model of Computation (MoC) 
The majority of contemporary mobile devices and personal computers are based on heterogeneous computing platforms that consist of a number of CPU cores and one or more Graphics Processing Units (GPUs). Despite the high volume of these devices, there are few existing programming frameworks that target full and simultaneous utilization of all CPU and GPU devices of the platform. This article presents a dataflowflavored Model of Computation (MoC) that has been developed for deploying signal processing applications to heterogeneous platforms. The presented MoC is dynamic and allows describing applications with data dependent runtime behavior. On top of the MoC, formal design rules are presented that enable application descriptions to be simultaneously dynamic and decidable. Decidability guarantees compiletime application analyzability for deadlock freedom and bounded memory. The presented MoC and the design rules are realized in a novel Open Source programming environment ‘PRUNE’ and demonstrated with representative application examples from the domains of image processing, computer vision and wireless communications. Experimental results show that the proposed approach outperforms the stateoftheart in analyzability, flexibility and performance. 
Datasheets for Datasets  Currently there is no standard way to identify how a dataset was created, and what characteristics, motivations, and potential skews it represents. To begin to address this issue, we propose the concept of a datasheet for datasets, a short document to accompany public datasets, commercial APIs, and pretrained models. The goal of this proposal is to enable better communication between dataset creators and users, and help the AI community move toward greater transparency and accountability. By analogy, in computer hardware, it has become industry standard to accompany everything from the simplest components (e.g., resistors), to the most complex microprocessor chips, with datasheets detailing standard operating characteristics, test results, recommended usage, and other information. We outline some of the questions a datasheet for datasets should answer. These questions focus on when, where, and how the training data was gathered, its recommended use cases, and, in the case of humancentric datasets, information regarding the subjects’ demographics and consent as applicable. We develop prototypes of datasheets for two wellknown datasets: Labeled Faces in The Wild~\cite{lfw} and the Pang \& Lee Polarity Dataset~\cite{polarity}. 
DatatoDecisions (D2D) 

Datification / Datafication  A concept that tracks the conception, development, storage and marketing of all types of data, both for business and life. It has grown in popularity of late to capture how data measures things and organizations in order to compete and win. It is about making business visible. 
Dato  Dato (formerly known as GraphLab): Your app drives business. From inspiration to production, build intelligent apps fast with the power of Dato’s machine learning platform. Data science at scale has never been easier. 
DawidSkene Algorithm (DSA) 
More and more online communities classify contributions based on collaborative ratings of these contributions. A popular method for such a ratingbased classification is the DawidSkene algorithm (DSA). However, despite its popularity, DSA has two major shortcomings: (1) It is vulnerable to raters with a low competence, i.e., a low probability of rating correctly. (2) It is defenseless against collusion attacks. In a collusion attack, raters coordinate to rate the same data objects with the same value to artificially increase their remuneration. Error Rate Analysis of Labeling by Crowdsourcing Fast DawidSkene 
DAWNBENCH  DAWNBench is a benchmark suite for endtoend deep learning training and inference. Computation time and cost are critical resources in building deep models, yet many existing benchmarks focus solely on model accuracy. DAWNBench provides a reference set 
DC.js  dc.js is a javascript charting library with native crossfilter support and allowing highly efficient exploration on large multidimensional dataset (inspired by crossfilter’s demo). It leverages d3 engine to render charts in css friendly svg format. Charts rendered using dc.js are naturally data driven and reactive therefore providing instant feedback on user’s interaction. The main objective of this project is to provide an easy yet powerful javascript library which can be utilized to perform data visualization and analysis in browser as well as on mobile device. 
DCDistance  Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the preprocessing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curseofdimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99% of the original set. Additionally, this algorithm was also capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and stateoftheart features selection algorithms. 
DCM Bandits  Search engines recommend a list of web pages. The user examines this list, from the first page to the last, and may click on multiple attractive pages. This type of user behavior can be modeled by the \emph{dependent click model (DCM)}. In this work, we propose \emph{DCM bandits}, an online learning variant of the DCM model where the objective is to maximize the probability of recommending a satisfactory item. The main challenge of our problem is that the learning agent does not observe the reward. It only observes the clicks. This imbalance between the feedback and rewards makes our setting challenging. We propose a computationallyefficient learning algorithm for our problem, which we call dcmKLUCB; derive gapdependent upper bounds on its regret under reasonable assumptions; and prove a matching lower bound up to logarithmic factors. We experiment with dcmKLUCB on both synthetic and realworld problems. Our algorithm outperforms a range of baselines and performs well even when our modeling assumptions are violated. To the best of our knowledge, this is the first regretoptimal online learning algorithm for learning to rank with multiple clicks in a cascadelike model. 
De Bruijn Entropy  De Bruijn entropy and string similarity 
Debagging  It is easy to convert a sentence into a bag of words, but it is much harder to convert a bag of words into a meaningful sentence. We name the latter the debagging problem. 
DeBiased Sparse PCA  Sparse principal component analysis (sPCA) has become one of the most widely used techniques for dimensionality reduction in highdimensional datasets. The main challenge underlying sPCA is to estimate the first vector of loadings of the population covariance matrix, provided that only a certain number of loadings are nonzero. In this paper, we propose confidence intervals for individual loadings and for the largest eigenvalue of the population covariance matrix. Given an independent sample $X^i \in\mathbb R^p, i = 1,…,n,$ generated from an unknown distribution with an unknown covariance matrix $\Sigma_0$, our aim is to estimate the first vector of loadings and the largest eigenvalue of $\Sigma_0$ in a setting where $p\gg n$. Next to the highdimensionality, another challenge lies in the inherent nonconvexity of the problem. We base our methodology on a Lassopenalized Mestimator which, despite nonconvexity, may be solved by a polynomialtime algorithm such as coordinate or gradient descent. We show that our estimator achieves the minimax optimal rates in $\ell_1$ and $\ell_2$norm. We identify the bias in the Lassobased estimator and propose a debiased sparse PCA estimator for the vector of loadings and for the largest eigenvalue of the covariance matrix $\Sigma_0$. Our main results provide theoretical guarantees for asymptotic normality of the debiased estimator. The major conditions we impose are sparsity in the first eigenvector of small order $\sqrt{n}/\log p$ and sparsity of the same order in the columns of the inverse Hessian matrix of the population risk. 
Decentralized HighDimensional Bayesian Optimization (DECHBO) 
This paper presents a novel decentralized highdimensional Bayesian optimization (DECHBO) algorithm that, in contrast to existing HBO algorithms, can exploit the interdependent effects of various input components on the output of the unknown objective function f for boosting the BO performance and still preserve scalability in the number of input dimensions without requiring prior knowledge or the existence of a low (effective) dimension of the input space. To realize this, we propose a sparse yet rich factor graph representation of f to be exploited for designing an acquisition function that can be similarly represented by a sparse factor graph and hence be efficiently optimized in a decentralized manner using distributed message passing. Despite richly characterizing the interdependent effects of the input components on the output of f with a factor graph, DECHBO can still guarantee noregret performance asymptotically. Empirical evaluation on synthetic and realworld experiments (e.g., sparse Gaussian process model with 1811 hyperparameters) shows that DECHBO outperforms the stateoftheart HBO algorithms. 
Decision Analysis (DA) 
Decision analysis (DA) is the discipline comprising the philosophy, theory, methodology, and professional practice necessary to address important decisions in a formal manner. Decision analysis includes many procedures, methods, and tools for identifying, clearly representing, and formally assessing important aspects of a decision, for prescribing a recommended course of action by applying the maximum expected utility action axiom to a wellformed representation of the decision, and for translating the formal representation of a decision and its corresponding recommendation into insight for the decision maker and other stakeholders. 
Decision Model and Notation (DMN) 
The primary goal of DMN is to provide an industry standard modelling notation for decision management and business rules that is readily understandable by all business users: from the business analysts who need to create initial decision requirements and then more detailed decision models, to the technical developers responsible for automating the decisions in processes, and finally, to the business people who will manage and monitor those decisions. The submission has been designed to be complementary to and useable alongside the OMG Business Process Model & Notation (BPMN) standard and will ensure that decision models are interchangeable across organizations. 
Decision Scientist  Decision Scientists build decision support tools to enable decision makers to make decisions, or take action, under uncertainty with a datacentric bias. Traditional analytics falls under this domain. Often decision makers like linear solutions that provide simple, explainable, socializable decision making frameworks. That is, they are looking for a rationale. Data Scientists build machines to make decisions about largescale complex dynamical processes that are typically too fast (velocity, veracity, volume, etc.) for a human operator/manager. They typically don’t concern themselves with whether the algorithm is explainable or socializable, but are more concerned with whether it is functional, reliable, accurate, and robust. 
Decision Stream  Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Selection of relevant features for spitting the tree nodes is a key property of their architecture, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture – a Decision Stream, – aimed to overcome this problem. Instead of building an acyclic tree structure during the training process, we propose merging nodes from different branches based on their similarity that is estimated with twosample test statistics. To evaluate the proposed solution, we test it on several common machine learning problems~— credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree method on both regression and classification tasks, yielding a prediction error decrease up to 35%. 
Decision Stump  A decision stump is a machine learning model consisting of a onelevel decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1rules. 
Decision Support (DS) 
The term Decision Support (DS) is used often and in a variety of contexts related to decision making. Recently, for example, it is often mentioned in connection with Data Warehouses and OnLine Analytical Processing (OLAP). Another recent trend is to associate DS with Data Mining. This is the case in the project SolEuNet , which attempts to exploit these two approaches in a complementary way in order to support difficult reallife problem solving. Unfortunately, although the term ‘Decision Support’ seems rather intuitive and simple, it is in fact very loosely defined. It means different things to different people and in different contexts. Also, its meaning has shifted during the recent history. Nowadays, DS is probably most often associated with Data Warehouses and OLAP. A decade ago, it was coupled with Decision Support Systems (DSS). Still before that, there was a close link with Operations Research (OR) and Decision Analysis (DA). This causes a lot of confusion and misunderstanding, and provokes requests for clarification. The confusion is further exemplified by the multitude of related terms and acronyms that are either equal to, or start with ‘DS’: Decision Support, Decision Sciences, Decision Systems, Decision Support Systems, etc. This paper attempts to clarify these issues. We take the viewpoint that Decision Support is a broad, generic term that encompasses all aspects related to supporting people in making decisions. First, we present the results of a survey of WWW documents related to DS. On this basis, and on the basis of relevant literature and our previous experience in the field of DS, we provide a classification of DS and related disciplines. DS itself is given a role within Decision Making and Decision Sciences. Some most prominent DS disciplines are briefly overviewed: Operations Research, Decision Analysis, Decision Support Systems, Data Warehousing and OLAP, and Group Decision Support. 
Decision Support System (DSS) 
A Decision Support System (DSS) is a computerbased information system that supports business or organizational decisionmaking activities. DSSs serve the management, operations, and planning levels of an organization (usually mid and higher management) and help to make decisions, which may be rapidly changing and not easily specified in advance (Unstructured and SemiStructured decision problems). Decision support systems can be either fully computerized, human or a combination of both. While academics have perceived DSS as a tool to support decision making process, DSS users see DSS as a tool to facilitate organizational processes. Some authors have extended the definition of DSS to include any system that might support decision making. Sprague (1980) defines DSS by its characteristics: 1. DSS tends to be aimed at the less well structured, underspecified problem that upper level managers typically face; 2. DSS attempts to combine the use of models or analytic techniques with traditional data access and retrieval functions; 3. DSS specifically focuses on features which make them easy to use by noncomputer people in an interactive mode; and 4. DSS emphasizes flexibility and adaptability to accommodate changes in the environment and the decision making approach of the user. DSSs include knowledgebased systems. A properly designed DSS is an interactive softwarebased system intended to help decision makers compile useful information from a combination of raw data, documents, and personal knowledge, or business models to identify and solve problems and make decisions. Typical information that a decision support application might gather and present includes: · inventories of information assets (including legacy and relational data sources, cubes, data warehouses, and data marts), · comparative sales figures between one period and the next, · projected revenue figures based on product sales assumptions. 
Decision Theory  Decision theory or theory of choice in economics, psychology, philosophy, mathematics, computer science, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision. It is closely related to the field of game theory; decision theory is concerned with the choices of individual agents whereas game theory is concerned with interactions of agents whose decisions affect each other. 
Decision Tree Based Missing Value Imputation Technique (DMI) 
Decision tree based Missing value Imputation technique’ (DMI) makes use of an EM algorithm and a decision tree (DT) algorithm. 
Decision Tree Learning / Classification and Regression Trees (CART) 
Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. 
Declarative Statistics  In this work we introduce declarative statistics, a suite of declarative modelling tools for statistical analysis. Statistical constraints represent the key building block of declarative statistics. First, we introduce a range of relevant counting and matrix constraints and associated decompositions, some of which novel, that are instrumental in the design of statistical constraints. Second, we introduce a selection of novel statistical constraints and associated decompositions, which constitute a selfcontained toolbox that can be used to tackle a wide range of problems typically encountered by statisticians. Finally, we deploy these statistical constraints to a wide range of application areas drawn from classical statistics and we contrast our framework against established practices. 
Deconvolutional Paragraph Representation Learning  Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNNbased decoding (reconstruction) decreases with the length of the text. We propose a sequencetosequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semisupervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data. 
Decoupled Learning  Incorporating encodingdecoding nets with adversarial nets has been widely adopted in image generation tasks. We observe that the stateoftheart achievements were obtained by carefully balancing the reconstruction loss and adversarial loss, and such balance shifts with different network structures, datasets, and training strategies. Empirical studies have demonstrated that an inappropriate weight between the two losses may cause instability, and it is tricky to search for the optimal setting, especially when lacking prior knowledge on the data and network. This paper gives the first attempt to relax the need of manual balancing by proposing the concept of \textit{decoupled learning}, where a novel network structure is designed that explicitly disentangles the backpropagation paths of the two losses. Experimental results demonstrate the effectiveness, robustness, and generality of the proposed method. The other contribution of the paper is the design of a new evaluation metric to measure the image quality of generative models. We propose the socalled \textit{normalized relative discriminative score} (NRDS), which introduces the idea of relative comparison, rather than providing absolute estimates like existing metrics. 
Decoupled Network  Inner productbased convolution has been a central component of convolutional neural networks (CNNs) and the key to learning visual representations. Inspired by the observation that CNNlearned features are naturally decoupled with the norm of features corresponding to the intraclass variation and the angle corresponding to the semantic difference, we propose a generic decoupled learning framework which models the intraclass variation and semantic difference independently. Specifically, we first reparametrize the inner product to a decoupled form and then generalize it to the decoupled convolution operator which serves as the building block of our decoupled networks. We present several effective instances of the decoupled convolution operator. Each decoupled operator is well motivated and has an intuitive geometric interpretation. Based on these decoupled operators, we further propose to directly learn the operator from data. Extensive experiments show that such decoupled reparameterization renders significant performance gain with easier convergence and stronger robustness. 
Deducer  An R Graphical User Interface (GUI) for Everyone: Deducer is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excellike spreadsheet in which to view and edit data frames. The goal of the project is two fold. 1. Provide an intuitive graphical user interface (GUI) for R, encouraging nontechnical users to learn and perform analyses without programming getting in their way. 2. Increase the efficiency of expert R users when performing common tasks by replacing hundreds of keystrokes with a few mouse clicks. Also, as much as possible the GUI should not get in their way if they just want to do some programming. Deducer is designed to be used with the Java based R console JGR, though it supports a number of other R environments (e.g. Windows RGUI and RTerm). 
Deductive Reasoning  Deductive reasoning, also deductive logic, logical deduction is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion. Deductive reasoning goes in the same direction as that of the conditionals, and links premises with conclusions. If all premises are true, the terms are clear, and the rules of deductive logic are followed, then the conclusion reached is necessarily true. Deductive reasoning (‘topdown logic’) contrasts with inductive reasoning (‘bottomup logic’) in the following way; in deductive reasoning, a conclusion is reached reductively by applying general rules which hold over the entirety of a closed domain of discourse, narrowing the range under consideration until only the conclusion(s) is left. In inductive reasoning, the conclusion is reached by generalizing or extrapolating from specific cases to general rules, i.e., there is epistemic uncertainty. However, the inductive reasoning mentioned here is not the same as induction used in mathematical proofs – mathematical induction is actually a form of deductive reasoning. Deductive reasoning differs from abductive reasoning by the direction of the reasoning relative to the conditionals. Deductive reasoning goes in the same direction as that of the conditionals, whereas abductive reasoning goes in the opposite direction to that of the conditionals. 
Deductron  The current paper is a study in Recurrent Neural Networks (RNN), motivated by the lack of examples simple enough so that they can be thoroughly understood theoretically, but complex enough to be realistic. We constructed an example of structured data, motivated by problems from imagetotext conversion (OCR), which requires longterm memory to decode. Our data is a simple writing system, encoding characters ‘X’ and ‘O’ as their upper halves, which is possible due to symmetry of the two characters. The characters can be connected, as in some languages using cursive, such as Arabic (abjad). The string ‘XOOXXO’ may be encoded as ‘${\vee}{\wedge}\kern1.5pt{\wedge}{\vee}\kern1.5pt{\vee}{\wedge}$’. It follows that we may need to know arbitrarily long past to decode a current character, thus requiring longterm memory. Subsequently we constructed an RNN capable of decoding sequences encoded in this manner. Rather than by training, we constructed our RNN ‘by inspection’, i.e. we guessed its weights. This involved a sequence of steps. We wrote a conventional program which decodes the sequences as the example above. Subsequently, we interpreted the program as a neural network (the only example of this kind known to us). Finally, we generalized this neural network to discover a new RNN architecture whose instance is our handcrafted RNN. It turns out to be a 3 layer network, where the middle layer is capable of performing simple logical inferences; thus the name ‘deductron’. It is demonstrated that it is possible to train our network by simulated annealing. Also, known variants of stochastic gradient descent (SGD) methods are shown to work. 
Deduplication with Hadoop (Dedoop) 
Entity Matching for Big Data: Automatically matching entities (objects) and ontologies are key technologies to semantically integrate heterogeneous data. These match techniques are needed to identify equivalent data objects (duplicates) or semantically equivalent metadata elements (ontology concepts, schema attributes). The proposed techniques demand very high resources that limit their applicability to largescale (Big Data) problems unless a powerful cloud infrastructure can be utilized. This is because the (fuzzy) match approaches basically have a quadratic complexity to compare the all elements to be matched with each other. For sufficient match quality, multiple match algorithms need to be applied and combined within socalled match workflows adding further resource requirements as well as a significant optimization problem to select matchers and configure their combination. 
Deep Abstract QNetwork  We examine the problem of learning and planning on highdimensional domains with long horizons and sparse rewards. Recent approaches have shown great successes in many Atari 2600 domains. However, domains with long horizons and sparse rewards, such as Montezuma’s Revenge and Venture, remain challenging for existing methods. Methods using abstraction (Dietterich 2000; Sutton, Precup, and Singh 1999) have shown to be useful in tackling longhorizon problems. We combine recent techniques of deep reinforcement learning with existing modelbased approaches using an expertprovided state abstraction. We construct toy domains that elucidate the problem of long horizons, sparse rewards and highdimensional inputs, and show that our algorithm significantly outperforms previous methods on these domains. Our abstractionbased approach outperforms Deep QNetworks (Mnih et al. 2015) on Montezuma’s Revenge and Venture, and exhibits backtracking behavior that is absent from previous methods. 
Deep Alignment Network (DAN) 
In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the stateoftheart failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge. 
Deep Appearance map (DAM) 
We propose a deep representation of appearance, i. e. the relation of color, surface orientation, viewer position, material and illumination. Previous approaches have used deep learning to extract classic appearance representations relating to reflectance model parameters (e. g. Phong) or illumination (e. g. HDR environment maps). We suggest to directly represent appearance itself as a network we call a deep appearance map (DAM). This is a 4D generalization over 2D reflectance maps, which held the view direction fixed. First, we show how a DAM can be learned from images or video frames and later be used to synthesize appearance, given new surface orientations and viewer positions. Second, we demonstrate how another network can be used to map from an image or video frames to a DAM network to reproduce this appearance, without using a lengthy optimization such as stochastic gradient descent (learningtolearn). Finally, we generalize this to an appearance estimationandsegmentation task, where we map from an image showing multiple materials to multiple networks reproducing their appearance, as well as perpixel segmentation. 
Deep Approximately Orthogonal Nonnegative Matrix Factorization  Nonnegative Matrix Factorization (NMF) is a widely used technique for data representation. Inspired by the expressive power of deep learning, several NMF variants equipped with deep architectures have been proposed. However, these methods mostly use the only nonnegativity while ignoring taskspecific features of data. In this paper, we propose a novel deep approximately orthogonal nonnegative matrix factorization method where both nonnegativity and orthogonality are imposed with the aim to perform a hierarchical clustering by using different level of abstractions of data. Experiment on two face image datasets showed that the proposed method achieved better clustering performance than other deep matrix factorization methods and stateoftheart single layer NMF variants. 
Deep Asymmetric Multitask Feature Learning (DeepAMTFL) 
We propose Deep Asymmetric Multitask Feature Learning (DeepAMTFL) which can learn deep representations shared across multiple tasks while effectively preventing negative transfer that may happen in the feature sharing process. Specifically, we introduce an asymmetric autoencoder term that allows predictors for the confident tasks to have high contribution to the feature learning while suppressing the influences of less confident task predictors. This allows learning less noisy representations, and allows weak predictors to exploit knowledge from the strong predictors via the shared latent features. Such asymmetric knowledge transfer through shared features is also more scalable and efficient than intertask asymmetric transfer. We validate our DeepAMTFL model on multiple benchmark datasets for multitask learning and image classification, on which it significantly outperforms existing symmetric and asymmetric multitask learning models, by effectively preventing negative transfer in deep feature learning. 
Deep Attention GAN (DAGAN) 
Unsupervised image translation, which aims in translating two independent sets of images, is challenging in discovering the correct correspondences without paired data. Existing works build upon Generative Adversarial Network (GAN) such that the distribution of the translated images are indistinguishable from the distribution of the target set. However, such setlevel constraints cannot learn the instancelevel correspondences (e.g. aligned semantic parts in object configuration task). This limitation often results in false positives (e.g. geometric or semantic artifacts), and further leads to mode collapse problem. To address the above issues, we propose a novel framework for instancelevel image translation by Deep Attention GAN (DAGAN). Such a design enables DAGAN to decompose the task of translating samples from two sets into translating instances in a highlystructured latent space. Specifically, we jointly learn a deep attention encoder, and the instancelevel correspondences could be consequently discovered through attending on the learned instance pairs. Therefore, the constraints could be exploited on both setlevel and instancelevel. Comparisons against several stateofthe arts demonstrate the superiority of our approach, and the broad application capability, e.g, pose morphing, data augmentation, etc., pushes the margin of domain translation problem. 
Deep AutoEncoder and QNetwork (DAQN) 
The deep reinforcement learning method usually requires a large number of training images and executing actions to obtain sufficient results. When it is extended a realtask in the real environment with an actual robot, the method will be required more training images due to complexities or noises of the input images, and executing a lot of actions on the real robot also becomes a serious problem. Therefore, we propose an extended deep reinforcement learning method that is applied a generative model to initialize the network for reducing the number of training trials. In this paper, we used a deep qnetwork method as the deep reinforcement learning method and a deep autoencoder as the generative model. We conducted experiments on three different tasks: a cartpole game, an atari game, and a realgame with an actual robot. The proposed method trained efficiently on all tasks than the previous method, especially 2.5 times faster on a task with real environment images. 
Deep BackProjection Network (DBPN) 
The feedforward architectures of recently proposed deep superresolution networks learn representations of lowresolution inputs, and the nonlinear mapping from those to highresolution output. However, this approach does not fully address the mutual dependencies of low and highresolution images. We propose Deep BackProjection Networks (DBPN), that exploit iterative up and downsampling layers, providing an error feedback mechanism for projection errors at each stage. We construct mutuallyconnected up and downsampling stages each of which represents different types of image degradation and highresolution components. We show that extending this idea to allow concatenation of features across up and downsampling stages (Dense DBPN) allows us to reconstruct further improve superresolution, yielding superior results and in particular establishing new state of the art results for large scaling factors such as 8x across multiple data sets. 
Deep Bayesian Active SemiSupervised Learning  In many applications the process of generating label information is expensive and time consuming. We present a new method that combines active and semisupervised deep learning to achieve high generalization performance from a deep convolutional neural network with as few known labels as possible. In a setting where a small amount of labeled data as well as a large amount of unlabeled data is available, our method first learns the labeled data set. This initialization is followed by an expectation maximization algorithm, where further training reduces classification entropy on the unlabeled data by targeting a low entropy fit which is consistent with the labeled data. In addition the algorithm asks at a specified frequency an oracle for labels of data with entropy above a certain entropy quantile. Using this active learning component we obtain an agile labeling process that achieves high accuracy, but requires only a small amount of known labels. For the MNIST dataset we report an error rate of 2.06% using only 300 labels and 1.06% for 1000 labels. These results are obtained without employing any special network architecture or data augmentation. 
Deep Bayesian Regression Model (DBRM) 
Regression models are used for inference and prediction in a wide range of applications providing a powerful scientific tool for researchers and analysts from different fields. In many research fields the amount of available data as well as the number of potential explanatory variables is rapidly increasing. Variable selection and model averaging have become extremely important tools for improving inference and prediction. However, often linear models are not sufficient and the complex relationship between input variables and a response is better described by introducing nonlinearities and complex functional interactions. Deep learning models have been extremely successful in terms of prediction although they are often difficult to specify and potentially suffer from overfitting. The aim of this paper is to bring the ideas of deep learning into a statistical framework which yields more parsimonious models and allows to quantify model uncertainty. To this end we introduce the class of deep Bayesian regression models (DBRM) consisting of a generalized linear model combined with a comprehensive nonlinear feature space, where nonlinear features are generated just like in deep learning but combined with variable selection in order to include only important features. DBRM can easily be extended to include latent Gaussian variables to model complex correlation structures between observations, which seems to be not easily possible with existing deep learning approaches. Two different algorithms based on MCMC are introduced to fit DBRM and to perform Bayesian inference. The predictive performance of these algorithms is compared with a large number of state of the art algorithms. Furthermore we illustrate how DBRM can be used for model inference in various applications. 
Deep Belief Networks (DBN) 
In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a type of deep neural network, composed of multiple layers of latent variables (“hidden units”), with connections between the layers but not between units within each layer. When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs. After this learning step, a DBN can be further trained in a supervised way to perform classification. DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each subnetwork’s hidden layer serves as the visible layer for the next. This also leads to a fast, layerbylayer unsupervised training procedure, where contrastive divergence is applied to each subnetwork in turn, starting from the “lowest” pair of layers (the lowest visible layer being a training set). The observation, due to Hinton’s student Teh, that DBNs can be trained greedily, one layer at a time, has been called a breakthrough in deep learning. 
Deep Broad Learning (DBL) 
Deep learning has demonstrated the power of detailed modeling of complex highorder (multivariate) interactions in data. For some learning tasks there is power in learning models that are not only Deep but also Broad. By Broad, we mean models that incorporate evidence from large numbers of features. This is of especial value in applications where many different features and combinations of features all carry small amounts of information about the class. The most accurate models will integrate all that information. In this paper, we propose an algorithm for Deep Broad Learning called DBL. The proposed algorithm has a tunable parameter $n$, that specifies the depth of the model. It provides straightforward paths towards outofcore learning for large data. We demonstrate that DBL learns models from large quantities of data with accuracy that is highly competitive with the stateoftheart. 
Deep Canonical Correlation Analysis (DCCA) 

Deep Coherence Model (DCM) 
In this paper, we propose a novel deep coherence model (DCM) using a convolutional neural network architecture to capture the text coherence. The text coherence problem is investigated with a new perspective of learning sentence distributional representation and text coherence modeling simultaneously. In particular, the model captures the interactions between sentences by computing the similarities of their distributional representations. Further, it can be easily trained in an endtoend fashion. The proposed model is evaluated on a standard Sentence Ordering task. The experimental results demonstrate its effectiveness and promise in coherence assessment showing a significant improvement over the stateoftheart by a wide margin. 
Deep Collaborative Autoencoder (DCAE) 
In recent years, deep neural networks have yielded stateoftheart performance on several tasks. Although some recent works have focused on combining deep learning with recommendation, we highlight three issues of existing works. First, most works perform deep content feature learning and resort to matrix factorization, which cannot effectively model the highly complex useritem interaction function. Second, due to the difficulty on training deep neural networks, existing models utilize a shallow architecture, and thus limit the expressiveness potential of deep learning. Third, neural network models are easy to overfit on the implicit setting, because negative interactions are not taken into account. To tackle these issues, we present a novel recommender framework called Deep Collaborative Autoencoder (DCAE) for both explicit feedback and implicit feedback, which can effectively capture the relationship between interactions via its nonlinear expressiveness. To optimize the deep architecture of DCAE, we develop a threestage pretraining mechanism that combines supervised and unsupervised feature learning. Moreover, we propose a popularitybased error reweighting module and a sparsityaware dataaugmentation strategy for DCAE to prevent overfitting on the implicit setting. Extensive experiments on three realworld datasets demonstrate that DCAE can significantly advance the stateoftheart. 
Deep Collaborative WeightBased Classification (DeepCWC) 
One of the biggest problems in deep learning is its difficulty to retain consistent robustness when transferring the model trained on one dataset to another dataset. To conquer the problem, deep transfer learning was implemented to execute various vision tasks by using a pretrained deep model in a diverse dataset. However, the robustness was often far from stateoftheart. We propose a collaborative weightbased classification method for deep transfer learning (DeepCWC). The method performs the L2norm based collaborative representation on the original images, as well as the deep features extracted by pretrained deep models. Two distance vectors will be obtained based on the two representation coefficients, and then fused together via the collaborative weight. The two feature sets show a complementary character, and the original images provide information compensating the missed part in the transferred deep model. A series of experiments conducted on both small and large vision datasets demonstrated the robustness of the proposed DeepCWC in both face recognition and object recognition tasks. 
Deep Complex Network  At present, the vast majority of building blocks, techniques, and architectures for deep learning are based on realvalued operations and representations. However, recent work on recurrent neural networks and older fundamental theoretical analysis suggests that complex numbers could have a richer representational capacity and could also facilitate noiserobust memory retrieval mechanisms. Despite their attractive properties and potential for opening up entirely new neural architectures, complexvalued deep neural networks have been marginalized due to the absence of the building blocks required to design such models. In this work, we provide the key atomic components for complexvalued deep neural networks and apply them to convolutional feedforward networks and convolutional LSTMs. More precisely, we rely on complex convolutions and present algorithms for complex batchnormalization, complex weight initialization strategies for complexvalued neural nets and we use them in experiments with endtoend training schemes. We demonstrate that such complexvalued models are competitive with their realvalued counterparts. We test deep complex models on several computer vision tasks, on music transcription using the MusicNet dataset and on Speech Spectrum Prediction using the TIMIT dataset. We achieve stateoftheart performance on these audiorelated tasks. 
Deep Component Analysis (DeepCA) 
Despite a lack of theoretical understanding, deep neural networks have achieved unparalleled performance in a wide range of applications. On the other hand, shallow representation learning with component analysis is associated with rich intuition and theory, but smaller capacity often limits its usefulness. To bridge this gap, we introduce Deep Component Analysis (DeepCA), an expressive multilayer model formulation that enforces hierarchical structure through constraints on latent variables in each layer. For inference, we propose a differentiable optimization algorithm implemented using recurrent Alternating Direction Neural Networks (ADNNs) that enable parameter learning using standard backpropagation. By interpreting feedforward networks as singleiteration approximations of inference in our model, we provide both a novel theoretical perspective for understanding them and a practical technique for constraining predictions with prior knowledge. Experimentally, we demonstrate performance improvements on a variety of tasks, including singleimage depth prediction with sparse output constraints. 
Deep Continuous Clustering  Clustering highdimensional datasets is hard because interpoint distances become less informative in highdimensional spaces. We present a clustering algorithm that performs nonlinear dimensionality reduction and clustering jointly. The data is embedded into a lowerdimensional space by a deep autoencoder. The autoencoder is optimized as part of the clustering process. The resulting network produces clustered data. The presented approach does not rely on prior knowledge of the number of groundtruth clusters. Joint nonlinear dimensionality reduction and clustering are formulated as optimization of a global continuous objective. We thus avoid discrete reconfigurations of the objective that characterize prior clustering algorithms. Experiments on datasets from multiple domains demonstrate that the presented algorithm outperforms stateoftheart clustering schemes, including recent methods that use deep networks. 
Deep Convolutional Decision Jungle (CDJ) 
We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response mapor nodein both the convolutional and fullyconnected layers selectively respond to class labels s.t. each data sample travels via a specific soft route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require predefined feature sets. Compared to CNNs, the method embeds the benefits of using datadependent discriminative functions, which better handles multimodal/heterogeneous data; further,the method offers more diverse sparse network responses, which in turn can be used for costeffective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss,as used in decision tree growing, measures the purity of data activation according to the class label distribution. The backpropagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms stateoftheart methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification. 
Deep Convolutional Generative Adversarial Networks (DCGAN) 
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. 
Deep Convolutional Neural Network (DCN) 
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks such as visual object and speech recognition. The key factor complicating such tasks is the presence of numerous nuisance variables, for instance, the unknown object position, orientation, and scale in object recognition or the unknown voice pronunciation, pitch, and speed in speech recognition. Recently, a new breed of deep learning algorithms have emerged for highnuisance inference tasks; they are constructed from many layers of alternating linear and nonlinear processing units and are trained using largescale algorithms and massive amounts of training data. The recent success of deep learning systems is impressive – they now routinely yield pattern recognition systems with nearor superhuman capabilities – but a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on a Bayesian generative probabilistic model that explicitly captures variation due to nuisance variables. The graphical structure of the model enables it to be learned from data using classical expectationmaximization techniques. Furthermore, by relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks (DCNs) and random decision forests (RDFs), providing insights into their successes and shortcomings as well as a principled route to their improvement. 
Deep Convolutional Sparse Coding (DCSC) 
Deep Convolutional Sparse Coding (DCSC) is a framework reminiscent of deep convolutional neural networks (DCNNs), but by omitting the learning of the dictionaries one can more transparently analyse the role of the activation function and its ability to recover activation paths through the layers. Papyan, Romano, and Elad conducted an analysis of such an architecture, demonstrated the relationship with DCNNs and proved conditions under which the DCSC is guaranteed to recover specific activation paths. A technical innovation of their work highlights that one can view the efficacy of the ReLU nonlinear activation function of a DCNN through a new variant of the tensor’s sparsity, referred to as stripesparsity. Using this they proved that representations with an activation density proportional to the ambient dimension of the data are recoverable. 
Deep CoSpace (DCS) 
Aiming at improving performance of visual classification in a costeffective manner, this paper proposes an incremental semisupervised learning paradigm called Deep CoSpace (DCS). Unlike many conventional semisupervised learning methods usually performing within a fixed feature space, our DCS gradually propagates information from labeled samples to unlabeled ones along with deep feature learning. We regard deep feature learning as a series of steps pursuing feature transformation, i.e., projecting the samples from a previous space into a new one, which tends to select the reliable unlabeled samples with respect to this setting. Specifically, for each unlabeled image instance, we measure its reliability by calculating the category variations of feature transformation from two different neighborhood variation perspectives, and merged them into an unified sample mining criterion deriving from Hellinger distance. Then, those samples keeping stable correlation to their neighboring samples (i.e., having small category variation in distribution) across the successive feature space transformation, are automatically received labels and incorporated into the model for incrementally training in terms of classification. Our extensive experiments on standard image classification benchmarks (e.g., Caltech256 and SUN397) demonstrate that the proposed framework is capable of effectively mining from largescale unlabeled images, which boosts image classification performance and achieves promising results compared to other semisupervised learning methods. 
Deep Curiosity Loop (DCL) 
Inspired by infants’ intrinsic motivation to learn, which values informative sensory channels contingent on their immediate social environment, we developed a deep curiosity loop (DCL) architecture. The DCL is composed of a learner, which attempts to learn a forward model of the agent’s stateaction transition, and a novel reinforcementlearning (RL) component, namely, an ActionConvolution Deep QNetwork, which uses the learner’s prediction error as reward. The environment for our agent is composed of visual social scenes, composed of sitcom video streams, thereby both the learner and the RL are constructed as deep convolutional neural networks. The agent’s learner learns to predict the zeroth order of the dynamics of visual scenes, resulting in intrinsic rewards proportional to changes within its social environment. The sources of these socially informative changes within the sitcom are predominantly motions of faces and hands, leading to the unsupervised curiositybased learning of social interaction features. The face and hand detection is represented by the value function and the social interaction opticalflow is represented by the policy. Our results suggest that face and hand detection are emergent properties of curiositybased learning embedded in social environments. 
Deep Data  What we call ‘deep data’ is a combination of experts’ domain knowledge of the area … combined with data science. 
Deep Density Networks (DDN) 
Building robust online content recommendation systems requires learning complex interactions between user preferences and content features. The field has evolved rapidly in recent years from traditional multiarm bandit and collaborative filtering techniques, with new methods integrating Deep Learning models that enable to capture nonlinear feature interactions. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this paper we provide a novel method, Deep Density Networks (DDN) which deconvolves measurement and data uncertainties and predicts probability density of CTR (Click Through Rate), enabling us to perform more efficient exploration of the feature space. We show the usefulness of using DDN online in a real world content recommendation system that serves billions of recommendations per day, and present online and offline results to evaluate the benefit of using DDN. 
Deep Differential Recurrent Neural Network (DDRNN) 
Due to the special gating schemes of Long ShortTerm Memory (LSTM), LSTMs have shown greater potential to process complex sequential information than the traditional Recurrent Neural Network (RNN). The conventional LSTM, however, fails to take into consideration the impact of salient spatiotemporal dynamics present in the sequential input data. This problem was first addressed by the differential Recurrent Neural Network (dRNN), which uses a differential gating scheme known as Derivative of States (DoS). DoS uses higher orders of internal state derivatives to analyze the change in information gain caused by the salient motions between the successive frames. The weighted combination of several orders of DoS is then used to modulate the gates in dRNN. While each individual order of DoS is good at modeling a certain level of salient spatiotemporal sequences, the sum of all the orders of DoS could distort the detected motion patterns. To address this problem, we propose to control the LSTM gates via individual orders of DoS and stack multiple levels of LSTM cells in an increasing order of state derivatives. The proposed model progressively builds up the ability of the LSTM gates to detect salient dynamical patterns in deeper stacked layers modeling higher orders of DoS, and thus the proposed LSTM model is termed deep differential Recurrent Neural Network (d2RNN). The effectiveness of the proposed model is demonstrated on two publicly available human activity datasets: NUSHGA and ViolentFlows. The proposed model outperforms both LSTM and nonLSTM based stateoftheart algorithms. 
Deep Directional Statistics  ➘ “Directional Statistics” Deep Directional Statistics: Pose Estimation with Uncertainty Quantification 
Deep Discrete Supervised Hashing (DDSH) 
Hashing has been widely used for largescale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, utilizing supervised information to directly guide discrete (binary) coding procedure can avoid suboptimal solution and improve the accuracy. On the other hand, deep hashing, which integrates deep feature learning and hashcode learning into an endtoend architecture, can enhance the feedback between feature learning and hashcode learning. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure in hashing. The key in deep hashing is to adopt the supervised information to directly guide the deep feature learning procedure. However, there have not existed works which can use the supervised information to directly guide both discrete coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH), to address this problem. DDSH is the first deep hashing method which can utilize supervised information to directly guide both discrete coding procedure and deep feature learning procedure, and thus enhance the feedback between these two important procedures. Experiments on three real datasets show that DDSH can outperform other stateoftheart baselines, including both discrete hashing and deep hashing baselines, for image retrieval. 
Deep Distance Metric Learning (DDML) 
Deep distance metric learning (DDML), which is proposed to learn image similarity metrics in an endtoend manner based on the convolution neural network. 
Deep Echo State Network (deepESN) 
The study of deep recurrent neural networks (RNNs) and, in particular, of deep Reservoir Computing (RC) is gaining an increasing research attention in the neural networks community. The recently introduced deep Echo State Network (deepESN) model opened the way to an extremely efficient approach for designing deep neural networks for temporal data. At the same time, the study of deepESNs allowed to shed light on the intrinsic properties of state dynamics developed by hierarchical compositions of recurrent layers, i.e. on the bias of depth in RNNs architectural design. In this paper, we summarize the advancements in the development, analysis and applications of deepESNs. 
Deep Euclidean Feature Representations through Adaptation on the Grassmann Manifold (DEFRAG) 
We propose a novel technique for training deep networks with the objective of obtaining feature representations that exist in a Euclidean space and exhibit strong clustering behavior. Our desired features representations have three traits: they can be compared using a standard Euclidian distance metric, samples from the same class are tightly clustered, and samples from different classes are well separated. However, most deep networks do not enforce such feature representations. The DEFRAG training technique consists of two steps: first good feature clustering behavior is encouraged though an auxiliary loss function based on the Silhouette clustering metric. Then the feature space is retracted onto a Grassmann manifold to ensure that the L_2 Norm forms a similarity metric. The DEFRAG technique achieves state of the art results on standard classification datasets using a relatively small network architecture with significantly fewer parameters than many standard networks. 
Deep Evolutionary Network Structured Representation (DENSER) 
Deep Evolutionary Network Structured Representation (DENSER) is a novel approach to automatically design Artificial Neural Networks (ANNs) using Evolutionary Computation (EC). The algorithm not only searches for the best network topology (e.g., number of layers, type of layers), but also tunes hyperparameters, such as, learning parameters or data augmentation parameters. The automatic design is achieved using a representation with two distinct levels, where the outer level encodes the general structure of the network, i.e., the sequence of layers, and the inner level encodes the parameters associated with each layer. The allowed layers and hyperparameter value ranges are defined by means of a humanreadable ContextFree Grammar. DENSER was used to evolve ANNs for two widely used image classification benchmarks obtaining an average accuracy result of up to 94.27% on the CIFAR10 dataset, and of 78.75% on the CIFAR100. To the best of our knowledge, our CIFAR100 results are the highest performing models generated by methods that aim at the automatic design of Convolutional Neural Networks (CNNs), and is amongst the best for manually designed and finetuned CNNs . 
Deep Expander Network (XNet) 
Deep Neural Networks, while being unreasonably effective for several vision tasks, have their usage limited by the computational and memory requirements, both during training and inference stages. Analyzing and improving the connectivity patterns between layers of a network has resulted in several compact architectures like GoogleNet, ResNet and DenseNetBC. In this work, we utilize results from graph theory to develop an efficient connection pattern between consecutive layers. Specifically, we use {\it expander graphs} that have excellent connectivity properties to develop a sparse network architecture, the deep expander network (XNet). The XNets are shown to have high connectivity for a given level of sparsity. We also develop highly efficient training and inference algorithms for such networks. Experimental results show that we can achieve the similar or better accuracy as DenseNetBC with twothirds the number of parameters and FLOPs on several image classification benchmarks. We hope that this work motivates other approaches to utilize results from graph theory to develop efficient network architectures. 
Deep Factor Alpha  Deep Factor Alpha provides a framework for extracting nonlinear factors information to explain the timeseries crosssection properties of asset returns. Sorting securities based on firm characteristics is viewed as a nonlinear activation function which can be implemented within a deep learning architecture. Multilayer deep learners are constructed to augment traditional longshort factor models. Searching firm characteristic space over deep architectures of nonlinear transformations is compatible with the economic goal of eliminating mispricing Alphas. Joint estimation of factors and betas is achieved with stochastic gradient descent. To illustrate our methodology, we design longshort latent factors in a trainvalidationtesting framework of US stock market asset returns from 1975 to 2017. We perform an outofsample study to analyze FamaFrench factors, in both the crosssection and timeseries, versus their deep learning counterparts. Finally, we conclude with directions for future research. 
Deep Feature Factorization  We propose Deep Feature Factorization (DFF), a method capable of localizing similar semantic concepts within an image or a set of images. We use DFF to gain insight into a deep convolutional neural network’s learned features, where we detect hierarchical cluster structures in feature space. This is visualized as heat maps, which highlight semantically matching regions across a set of images, revealing what the network `perceives’ as similar. DFF can also be used to perform cosegmentation and colocalization, and we report stateoftheart results on these tasks. 
Deep Feature Synthesis (DFS) 
In this paper, we develop the Data Science Machine, which is able to derive predictive models from raw data automatically. To achieve this automation, we first propose and develop the Deep Feature Synthesis algorithm for automatically generating features for relational datasets. The algorithm follows relationships in the data to a base field, and then sequentially applies mathematical functions along that path to create the final feature. Second, we implement a generalizable machine learning pipeline and tune it using a novel Gaussian Copula process based approach. We entered the Data Science Machine in 3 data science competitions that featured 906 other data science teams. Our approach beats 615 teams in these data science competitions. In 2 of the 3 competitions we beat a majority of competitors, and in the third, we achieved 94% of the best competitor’s score. In the best case, with an ongoing competition, we beat 85.6% of the teams and achieved 95.7% of the top submissions score. Deep Feature Synthesis: How Automated Feature Engineering Works 
Deep Frame Interpolation  This work presents a supervised learning based approach to the computer vision problem of frame interpolation. The presented technique could also be used in the cartoon animations since drawing each individual frame consumes a noticeable amount of time. The most existing solutions to this problem use unsupervised methods and focus only on real life videos with already high frame rate. However, the experiments show that such methods do not work as well when the frame rate becomes low and object displacements between frames becomes large. This is due to the fact that interpolation of the large displacement motion requires knowledge of the motion structure thus the simple techniques such as frame averaging start to fail. In this work the deep convolutional neural network is used to solve the frame interpolation problem. In addition, it is shown that incorporating the prior information such as optical flow improves the interpolation quality significantly. 
Deep Gaussian Covariance Network (DGCP) 
The correlation lengthscale next to the noise variance are the most used hyperparameters for the Gaussian processes. Typically, stationary covariance functions are used, which are only dependent on the distances between input points and thus invariant to the translations in the input space. The optimization of the hyperparameters is commonly done by maximizing the log marginal likelihood. This works quite well, if the distances are uniform distributed. In the case of a locally adapted or even sparse input space, the prediction of a test point can be worse dependent of its position. A possible solution to this, is the usage of a nonstationary covariance function, where the hyperparameters are calculated by a deep neural network. So that the correlation length scales and possibly the noise variance are dependent on the test point. Furthermore, different types of covariance functions are trained simultaneously, so that the Gaussian process prediction is an additive overlay of different covariance matrices. The right covariance functions combination and its hyperparameters are learned by the deep neural network. Additional, the Gaussian process will be able to be trained by batches or online and so it can handle arbitrarily large data sets. We call this framework Deep Gaussian Covariance Network (DGCP). There are also further extensions to this framework possible, for example sequentially dependent problems like time series or the local mixture of experts. The basic framework and some extension possibilities will be presented in this work. Moreover, a comparison to some recent state of the art surrogate model methods will be performed, also for a time dependent problem. 
Deep Gaussian Mixture Model  Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers. 
Deep Generalized Canonical Correlation Analysis (DGCCA) 
We present Deep Generalized Canonical Correlation Analysis (DGCCA) — a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear twoview representation learning (Deep CCA, (Andrew et al., 2013)) and linear manyview representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCAstyle multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre sentations on two distinct datasets for three downstream tasks: phonetic transcrip tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear manyview techniques. 
Deep Generative Markov State Model (DeepGenMSM) 
We propose a deep generative Markov State Model (DeepGenMSM) learning framework for inference of metastable dynamical systems and prediction of trajectories. After unsupervised training on time series data, the model contains (i) a probabilistic encoder that maps from highdimensional configuration space to a smallsized vector indicating the membership to metastable (longlived) states, (ii) a Markov chain that governs the transitions between metastable states and facilitates analysis of the longtime dynamics, and (iii) a generative part that samples the conditional distribution of configurations in the next time step. The model can be operated in a recursive fashion to generate trajectories to predict the system evolution from a defined starting state and propose new configurations. The DeepGenMSM is demonstrated to provide accurate estimates of the longtime kinetics and generate valid distributions for molecular dynamics (MD) benchmark systems. Remarkably, we show that DeepGenMSMs are able to make long timesteps in molecular configuration space and generate physically realistic structures in regions that were not seen in training data. 
Deep Gradient Compression (DGC) 
Largescale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multinode training, and requires expensive highbandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warmup training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables largescale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. 
Deep Graph Translation  Inspired by the tremendous success of deep generative models on generating continuous data like image and audio, in the most recent year, few deep graph generative models have been proposed to generate discrete data such as graphs. They are typically unconditioned generative models which has no control on modes of the graphs being generated. Differently, in this paper, we are interested in a new problem named \emph{Deep Graph Translation}: given an input graph, we want to infer a target graph based on their underlying (both global and local) translation mapping. Graph translation could be highly desirable in many applications such as disaster management and rare event forecasting, where the rare and abnormal graph patterns (e.g., traffic congestions and terrorism events) will be inferred prior to their occurrence even without historical data on the abnormal patterns for this graph (e.g., a road network or human contact network). To achieve this, we propose a novel GraphTranslationGenerative Adversarial Networks (GTGAN) which will generate a graph translator from input to target graphs. GTGAN consists of a graph translator where we propose new graph convolution and deconvolution layers to learn the global and local translation mapping. A new conditional graph discriminator has also been proposed to classify target graphs by conditioning on input graphs. Extensive experiments on multiple synthetic and realworld datasets demonstrate the effectiveness and scalability of the proposed GTGAN. 
Deep Hashing Neural Network (HNN) 
In this paper we propose a synergistic melting of neural networks and decision trees into a deep hashing neural network (HNN) having a modeling capability exponential with respect to its number of neurons. We first derive a soft decision tree named neural decision tree allowing the optimization of arbitrary decision function at each split node. We then rewrite this soft space partitioning as a new kind of neural network layer, namely the hashing layer (HL), which can be seen as a generalization of the known softmax layer. This HL can easily replace the standard last layer of ANN in any known network topology and thus can be used after a convolutional or recurrent neural network for example. We present the modeling capacity of this deep hashing function on small datasets where one can reach at least equally good results as standard neural networks by diminishing the number of output neurons. Finally, we show that for the case where the number of output neurons is large, the neural network can mitigate the absence of linear decision boundaries by learning for each difficult class a collection of not necessarily connected subregions of the space leading to more flexible decision surfaces. Finally, the HNN can be seen as a deep locality sensitive hashing function which can be trained in a supervised or unsupervised setting as we will demonstrate for classification and regression problems. 
Deep Hyperalignment (DHA) 
This paper proposes Deep Hyperalignment (DHA) as a regularized, deep extension, scalable Hyperalignment (HA) method, which is wellsuited for applying functional alignment to fMRI datasets with nonlinearity, highdimensionality (broad ROI), and a large number of subjects. Unlink previous methods, DHA is not limited by a restricted fixed kernel function. Further, it uses a parametric approach, rank$m$ Singular Value Decomposition (SVD), and stochastic gradient descent for optimization. Therefore, DHA has a suitable time complexity for large datasets, and DHA does not require the training data when it computes the functional alignment for a new subject. Experimental studies on multisubject fMRI analysis confirm that the DHA method achieves superior performance to other stateoftheart HA algorithms. 
Deep Hyperspherical Learning  ➘ “Hyperspherical Convolution” 
Deep Incremental Boosting  This paper introduces Deep Incremental Boosting, a new technique derived from AdaBoost, specifically adapted to work with Deep Learning methods, that reduces the required training time and improves generalisation. We draw inspiration from Transfer of Learning approaches to reduce the startup time to training each incremental Ensemble member. We show a set of experiments that outlines some preliminary results on some common Deep Learning datasets and discuss the potential improvements Deep Incremental Boosting brings to traditional Ensemble methods in Deep Learning. 
Deep Information Network  We describe a novel classifier with a tree structure, designed using information theory concepts. This Information Network is made of information nodes, that compress the input data, and multiplexers, that connect two or more input nodes to an output node. Each information node is trained, independently of the others, to minimize a local cost function that minimizes the mutual information between its input and output with the constraint of keeping a given mutual information between its output and the target (information bottleneck). We show that the system is able to provide good results in terms of accuracy, while it shows many advantages in terms of modularity and reduced complexity. 
Deep Invertible Network (iRevNet) 
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a onetoone mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the iRevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is illconditioned, we overcome this by providing an explicit inverse. An analysis of iRevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the iRevNet we reconstruct linear interpolations between natural image representations. 
Deep Kernelized Autoencoder  In this paper we introduce the deep kernelized autoencoder, a neural network model that allows an explicit approximation of (i) the mapping from an input space to an arbitrary, userspecified kernel space and (ii) the backprojection from such a kernel space to input space. The proposed method is based on traditional autoencoders and is trained through a new unsupervised loss function. During training, we optimize both the reconstruction accuracy of input samples and the alignment between a kernel matrix given as prior and the inner products of the hidden representations computed by the autoencoder. Kernel alignment provides control over the hidden representation learned by the autoencoder. Experiments have been performed to evaluate both reconstruction and kernel alignment performance. Additionally, we applied our method to emulate kPCA on a denoising task obtaining promising results. 
Deep kMeans  The current trend of pushing CNNs deeper with convolutions has created a pressing demand to achieve higher compression gains on CNNs where convolutions dominate the computation and parameter amount (e.g., GoogLeNet, ResNet and Wide ResNet). Further, the high energy consumption of convolutions limits its deployment on mobile devices. To this end, we proposed a simple yet effective scheme for compressing convolutions though applying kmeans clustering on the weights, compression is achieved through weightsharing, by only recording $K$ cluster centers and weight assignment indexes. We then introduced a novel spectrally relaxed $k$means regularization, which tends to make hard assignments of convolutional layer weights to $K$ learned cluster centers during retraining. We additionally propose an improved set of metrics to estimate energy consumption of CNN hardware implementations, whose estimation results are verified to be consistent with previously proposed energy estimation tool extrapolated from actual hardware measurements. We finally evaluated Deep $k$Means across several CNN models in terms of both compression ratio and energy consumption reduction, observing promising results without incurring accuracy loss. The code is available at https://…/DeepKMeans Deep $k$Means: Jointly Clustering with $k$Means and Learning Representations 
Deep kNearest Neighbors (DkNN) 
Deep neural networks (DNNs) enable innovative applications of machine learning like image recognition, machine translation, or malware detection. However, deep learning is often criticized for its lack of robustness in adversarial settings (e.g., vulnerability to adversarial inputs) and general inability to rationalize its predictions. In this work, we exploit the structure of deep learning to enable new learningbased inference and decision strategies that achieve desirable properties such as robustness and interpretability. We take a first step in this direction and introduce the Deep kNearest Neighbors (DkNN). This hybrid classifier combines the knearest neighbors algorithm with representations of the data learned by each layer of the DNN: a test input is compared to its neighboring training points according to the distance that separates them in the representations. We show the labels of these neighboring points afford confidence estimates for inputs outside the model’s training manifold, including on malicious inputs like adversarial examples–and therein provides protections against inputs that are outside the models understanding. This is because the nearest neighbors can be used to estimate the nonconformity of, i.e., the lack of support for, a prediction in the training data. The neighbors also constitute humaninterpretable explanations of predictions. We evaluate the DkNN algorithm on several datasets, and show the confidence estimates accurately identify inputs outside the model, and that the explanations provided by nearest neighbors are intuitive and useful in understanding model failures. 
Deep KnowledgeAware Network (DKN) 
Online news recommender systems aim to address the information explosion of news and make personalized recommendation for users. In general, news language is highly condensed, full of knowledge entities and common sense. However, existing methods are unaware of such external knowledge and cannot fully discover latent knowledgelevel connections among news. The recommended results for a user are consequently limited to simple patterns and cannot be extended reasonably. Moreover, news recommendation also faces the challenges of high timesensitivity of news and dynamic diversity of users’ interests. To solve the above problems, in this paper, we propose a deep knowledgeaware network (DKN) that incorporates knowledge graph representation into news recommendation. DKN is a contentbased deep recommendation framework for clickthrough rate prediction. The key component of DKN is a multichannel and wordentityaligned knowledgeaware convolutional neural network (KCNN) that fuses semanticlevel and knowledgelevel representations of news. KCNN treats words and entities as multiple channels, and explicitly keeps their alignment relationship during convolution. In addition, to address users’ diverse interests, we also design an attention module in DKN to dynamically aggregate a user’s history with respect to current candidate news. Through extensive experiments on a real online news platform, we demonstrate that DKN achieves substantial gains over stateoftheart deep recommendation models. We also validate the efficacy of the usage of knowledge in DKN. 
Deep Laplacian Pyramid SuperResolution Network  Convolutional neural networks have recently demonstrated highquality reconstruction for single image superresolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating highaccuracy superresolution results. In this paper, we propose the deep Laplacian Pyramid SuperResolution Network for fast and accurate image superresolution. The proposed network progressively reconstructs the subband residuals of highresolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for preprocessing (which results in large feature maps), the proposed method directly extracts features from the lowresolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve highquality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the stateoftheart methods in terms of runtime and image quality. 
Deep Layer Aggregation  Convolutional networks have had great success in image classification and other areas of computer vision. Recent efforts have designed deeper or wider networks to improve performance; as convolutional blocks are usually stacked together, blocks at different depths represent information at different scales. Recent models have explored `skip’ connections to aggregate information across layers, but heretofore such skip connections have themselves been `shallow’, projecting to a single fusion node. In this paper, we investigate new deepacrosslayer architectures to aggregate the information from multiple layers. We propose novel iterative and hierarchical structures for deep layer aggregation. The former can produce deep high resolution representations from a network whose final layers have low resolution, while the latter can effectively combine scale information from all blocks. Results show that the our proposed architectures can make use of network parameters and features more efficiently without dictating convolution module structure. We also show transfer of the learned networks to semantic segmentation tasks and achieve better results than alternative networks with baseline training settings. 
Deep Learning  Deep learning is a set of algorithms in machine learning that attempt to model highlevel abstractions in data by using architectures composed of multiple nonlinear transformations. Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to create models to learn these representations. Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce stateoftheart results on various tasks. 
Deep Learning Accelerator Unit (DLAU) 
As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well to maintain the low power cost, in this paper we design DLAU, which is a scalable accelerator architecture for largescale deep learning networks using FPGA as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the stateoftheart Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to 36.1x speedup comparing to the Intel Core2 processors, with the power consumption at 234mW. 
Deep Learning Approximation  Neural networks offer highaccuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called Deep Learning Approximation to build a faster network in a tiny fraction of the time required for training by only manipulating the network structure and coefficients without requiring retraining or access to the training data. Speedup is achieved by by applying a sequential series of independent optimizations that reduce the floatingpoint operations (FLOPs) required to perform a forward pass. First, lossless optimizations are applied, followed by lossy approximations using singular value decomposition (SVD) and lowrank matrix decomposition. The optimal approximation is chosen by weighing the relative accuracy loss and FLOP reduction according to a single parameter specified by the user. On PASCAL VOC 2007 with the YOLO network, we show an endtoend 2x speedup in a network forward pass with a 5% drop in mAP that can be regained by finetuning. 
Deep Learning Impact (DLI) 
Deep Learning Impact (DLI) is a set of software tools to help users develop AI models with the leading open source deep learning frameworks, like TensorFlow and Caffe, for the deployment and prediction phases of deep learning. DLI enables users to run distributed deep learning workloads on x86 and Power, and complements the PowerAI deep learning software distribution. 
Deep Learning Library (DLL) 
Deep Learning Library (DLL) is a new library for machine learning with deep neural networks that focuses on speed. It supports feedforward neural networks such as fullyconnected Artificial Neural Networks (ANNs) and Convolutional Neural Networks (CNNs). It also has very comprehensive support for Restricted Boltzmann Machines (RBMs) and Convolutional RBMs. Our main motivation for this work was to propose and evaluate novel software engineering strategies with potential to accelerate runtime for training and inference. Such strategies are mostly independent of the underlying deep learning algorithms. On three different datasets and for four different neural network models, we compared DLL to five popular deep learning frameworks. Experimentally, it is shown that the proposed framework is systematically and significantly faster on CPU and GPU. In terms of classification performance, similar accuracies as the other frameworks are reported. 
Deep Learning Virtual Machine (DLVM) 
Many current approaches to deep learning make use of highlevel toolkits such as TensorFlow, Torch, or Caffe. Toolkits such as Caffe have a layerbased programming framework with hardcoded gradients specified for each layer type, making research using novel layer types problematic. Toolkits such as Torch and TensorFlow define a computation graph in a host language such as Python, where each node represents a linear algebra operation parallelized as a compute kernel on GPU and stores the result of evaluation; some of these toolkits subsequently perform runtime interpretation over that graph, storing the results of forward calculations and reverseaccumulated gradients at each node. This approach is more flexible, but these toolkits take a very limited and adhoc approach to performing optimization. Also problematic are the facts that most toolkits lack type safety, and target only a single (usually GPU) architecture, limiting users’ abilities to make use of heterogeneous and emerging hardware architectures. We introduce a novel framework for highlevel programming that addresses all of the above shortcomings. 
Deep Linear Discriminant Analysis (DeepLDA) 
We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an endtoend fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a nonlinear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and backpropagation. 
Deep Local Binary Patterns (Deep LBP) 
Local Binary Pattern (LBP) is a traditional descriptor for texture analysis that gained attention in the last decade. Being robust to several properties such as invariance to illumination translation and scaling, LBPs achieved stateoftheart results in several applications. However, LBPs are not able to capture highlevel features from the image, merely encoding features with low abstraction levels. In this work, we propose Deep LBP, which borrow ideas from the deep learning community to improve LBP expressiveness. By using parametrized datadriven LBP, we enable successive applications of the LBP operators with increasing abstraction levels. We validate the relevance of the proposed idea in several datasets from a wide range of applications. Deep LBP improved the performance of traditional and multiscale LBP in all cases. 
Deep Loopy Neural Network  Existing deep learning models may encounter great challenges in handling graph structured data. In this paper, we introduce a new deep learning model for graph data specifically, namely the deep loopy neural network. Significantly different from the previous deep models, inside the deep loopy neural network, there exist a large number of loops created by the extensive connections among nodes in the input graph data, which makes model learning an infeasible task. To resolve such a problem, in this paper, we will introduce a new learning algorithm for the deep loopy neural network specifically. Instead of learning the model variables based on the original model, in the proposed learning algorithm, errors will be backpropagated through the edges in a group of extracted spanning trees. Extensive numerical experiments have been done on several realworld graph datasets, and the experimental results demonstrate the effectiveness of both the proposed model and the learning algorithm in handling graph data. 
Deep Matching and Validation Network (DMVN) 
Image splicing is a very common image manipulation technique that is sometimes used for malicious purposes. A splicing detection and localization algorithm usually takes an input image and produces a binary decision indicating whether the input image has been manipulated, and also a segmentation mask that corresponds to the spliced region. Most existing splicing detection and localization pipelines suffer from two main shortcomings: 1) they use handcrafted features that are not robust against subsequent processing (e.g., compression), and 2) each stage of the pipeline is usually optimized independently. In this paper we extend the formulation of the underlying splicing problem to consider two input images, a query image and a potential donor image. Here the task is to estimate the probability that the donor image has been used to splice the query image, and obtain the splicing masks for both the query and donor images. We introduce a novel deep convolutional neural network architecture, called Deep Matching and Validation Network (DMVN), which simultaneously localizes and detects image splicing. The proposed approach does not depend on handcrafted features and uses raw input images to create deep learned representations. Furthermore, the DMVN is endtoend op timized to produce the probability estimates and the segmentation masks. Our extensive experiments demonstrate that this approach outperforms stateoftheart splicing detection methods by a large margin in terms of both AUC score and speed. 
Deep Matching Autoencoder (DMAE) 
Increasingly many real world tasks involve data in multiple modalities or views. This has motivated the development of many effective algorithms for learning a common latent space to relate multiple domains. However, most existing crossview learning algorithms assume access to paired data for training. Their applicability is thus limited as the paired data assumption is often violated in practice: many tasks have only a small subset of data available with pairing annotation, or even no paired data at all. In this paper we introduce Deep Matching Autoencoders (DMAE), which learn a common latent space and pairing from unpaired multimodal data. Specifically we formulate this as a crossdomain representation learning and object matching problem. We simultaneously optimise parameters of representation learning autoencoders and the pairing of unpaired multimodal data. This framework elegantly spans the full regime from fully supervised, semisupervised, and unsupervised (no paired data) multimodal learning. We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning. 
Deep Mean Maps (DMM) 
The use of distributions and highlevel features from deep architecture has become commonplace in modern computer vision. Both of these methodologies have separately achieved a great deal of success in many computer vision tasks. However, there has been little work attempting to leverage the power of these to methodologies jointly. To this end, this paper presents the Deep Mean Maps (DMMs) framework, a novel family of methods to nonparametrically represent distributions of features in convolutional neural network models. DMMs are able to both classify images using the distribution of toplevel features, and to tune the toplevel features for performing this task. We show how to implement DMMs using a special mean map layer composed of typical CNN operations, making both forward and backward propagation simple. 
Deep Memory (MemGEN) 
We propose a new learning paradigm called Deep Memory. It has the potential to completely revolutionize the Machine Learning field. Surprisingly, this paradigm has not been reinvented yet, unlike Deep Learning. At the core of this approach is the \textit{Learning By Heart} principle, well studied in primary schools all over the world. Inspired by poem recitation, or by $\pi$ decimal memorization, we propose a concrete algorithm that mimics human behavior. We implement this paradigm on the task of generative modeling, and apply to images, natural language and even the $\pi$ decimals as long as one can print them as text. The proposed algorithm even generated this paper, in a oneshot learning setting. In carefully designed experiments, we show that the generated samples are indistinguishable from the training examples, as measured by any statistical tests or metrics. 
Deep MetaLearning  Fewshot learning remains challenging for metalearning that learns a learning algorithm (metalearner) from many related tasks. In this work, we argue that this is due to the lack of a good representation for metalearning, and propose deep metalearning to integrate the representation power of deep learning into metalearning. The framework is composed of three modules, a concept generator, a metalearner, and a concept discriminator, which are learned jointly. The concept generator, e.g. a deep residual net, extracts a representation for each instance that captures its highlevel concept, on which the metalearner performs fewshot learning, and the concept discriminator recognizes the concepts. By learning to learn in the concept space rather than in the complicated instance space, deep metalearning can substantially improve vanilla metalearning, which is demonstrated on various fewshot image recognition problems. For example, on 5way1shot image recognition on CIFAR100 and CUB200, it improves Matching Nets from 50.53% and 56.53% to 58.18% and 63.47%, improves MAML from 49.28% and 50.45% to 56.65% and 64.63%, and improves MetaSGD from 53.83% and 53.34% to 61.62% and 66.95%, respectively. 
Deep Motion Boundary Detection (MoBoNet) 
Motion boundary detection is a crucial yet challenging problem. Prior methods focus on analyzing the gradients and distributions of optical flow fields, or use handcrafted features for motion boundary learning. In this paper, we propose the first dedicated endtoend deep learning approach for motion boundary detection, which we term as MoBoNet. We introduce a refinement network structure which takes source input images, initial forward and backward optical flows as well as corresponding warping errors as inputs and produces highresolution motion boundaries. Furthermore, we show that the obtained motion boundaries, through a fusion subnetwork we design, can in turn guide the optical flows for removing the artifacts. The proposed MoBoNet is generic and works with any optical flows. Our motion boundary detection and the refined optical flow estimation achieve results superior to the state of the art. 
Deep Multimodal Attention Network (DMAN) 
Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and crossmodal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in suboptimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the finegranularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel SiameseTriplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multilabel classification and crossmodal search. Compared to stateoftheart image embeddings, our proposed DMAN achieves significant improvement in the tasks of multilabel classification and crossmodal search. 
Deep Multimodal Subspace Clustering Network  We present convolutional neural network (CNN) based approaches for unsupervised multimodal subspace clustering. The proposed framework consists of three main stages – multimodal encoder, selfexpressive layer, and multimodal decoder. The encoder takes multimodal data as input and fuses them to a latent space representation. We investigate early, late and intermediate fusion techniques and propose three different encoders corresponding to them for spatial fusion. The selfexpressive layers and multimodal decoders are essentially the same for different spatial fusionbased approaches. In addition to various spatial fusionbased methods, an affinity fusionbased network is also proposed in which the selfexpressiveness layer corresponding to different modalities is enforced to be the same. Extensive experiments on three datasets show that the proposed methods significantly outperform the stateoftheart multimodal subspace clustering methods. 
Deep Multiscale Model Learning  The objective of this paper is to design novel multilayer neural network architectures for multiscale simulations of flows taking into account the observed data and physical modeling concepts. Our approaches use deep learning concepts combined with local multiscale model reduction methodologies to predict flow dynamics. Using reducedorder model concepts is important for constructing robust deep learning architectures since the reducedorder models provide fewer degrees of freedom. Flow dynamics can be thought of as multilayer networks. More precisely, the solution (e.g., pressures and saturations) at the time instant $n+1$ depends on the solution at the time instant $n$ and input parameters, such as permeability fields, forcing terms, and initial conditions. One can regard the solution as a multilayer network, where each layer, in general, is a nonlinear forward map and the number of layers relates to the internal time steps. We will rely on rigorous model reduction concepts to define unknowns and connections for each layer. In each layer, our reducedorder models will provide a forward map, which will be modified (‘trained’) using available data. It is critical to use reducedorder models for this purpose, which will identify the regions of influence and the appropriate number of variables. Because of the lack of available data, the training will be supplemented with computational data as needed and the interpolation between datarich and datadeficient models. We will also use deep learning algorithms to train the elements of the reduced model discrete system. We will present main ingredients of our approach and numerical results. Numerical results show that using deep learning and multiscale models, we can improve the forward models, which are conditioned to the available data. 
Deep Mutual Learning (DML) 
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to lowmemory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static predefined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR100 recognition and Market1501 person reidentification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary — mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher. 
Deep Nearest Neighbor Descent (DNND) 
Most densitybased clustering methods largely rely on how well the underlying density is estimated. However, density estimation itself is also a challenging problem, especially the determination of the kernel bandwidth. A large bandwidth could lead to the oversmoothed density estimation in which the number of density peaks could be less than the true clusters, while a small bandwidth could lead to the undersmoothed density estimation in which spurious density peaks, or called the ‘ripple noise’, would be generated in the estimated density. In this paper, we propose a densitybased hierarchical clustering method, called the Deep Nearest Neighbor Descent (DNND), which could learn the underlying density structure layer by layer and capture the cluster structure at the same time. The oversmoothed density estimation could be largely avoided and the negative effect of the underestimated cases could be also largely reduced. Overall, DNND presents not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters. 
Deep Nested Agent Framework  Deep hierarchical reinforcement learning has gained a lot of attention in recent years due to its ability to produce stateoftheart results in challenging environments where nonhierarchical frameworks fail to learn useful policies. However, as problem domains become more complex, deep hierarchical reinforcement learning can become inefficient, leading to longer convergence times and poor performance. We introduce the Deep Nested Agent framework, which is a variant of deep hierarchical reinforcement learning where information from the main agent is propagated to the low level $nested$ agent by incorporating this information into the nested agent’s state. We demonstrate the effectiveness and performance of the Deep Nested Agent framework by applying it to three scenarios in Minecraft with comparisons to a deep nonhierarchical single agent framework, as well as, a deep hierarchical framework. 
Deep Neural Decision Forests  We present Deep Neural Decision Forests – a novel approach that unifies classification trees with the representation learning functionality known from deep convolutional networks, by training them in an endtoend manner. To combine these two worlds, we introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network. Our model differs from conventional deep networks because a decision forest provides the final predictions and it differs from conventional decision forests since we propose a principled, joint and global optimization of split and leaf node parameters. We show experimental results on benchmark machine learning datasets like MNIST and ImageNet and find onpar or superior results when compared to stateoftheart deep models. Most remarkably, we obtain Top5Errors of only 7:84%=6:38% on ImageNet validation data when integrating our forests in a singlecrop, single/seven model GoogLeNet architecture, respectively. Thus, even without any form of training data set augmentation we are improving on the 6.67% error obtained by the best GoogLeNet architecture (7 models, 144 crops). 
Deep Neural Decision Tree (DNDT) 
Deep neural networks have been proven powerful at processing perceptual data, such as images and audio. However for tabular data, treebased models are more popular. A nice property of treebased models is their natural interpretability. In this work, we present Deep Neural Decision Trees (DNDT) — tree models realised by neural networks. A DNDT is intrinsically interpretable, as it is a tree. Yet as it is also a neural network (NN), it can be easily implemented in NN toolkits, and trained with gradient descent rather than greedy splitting. We evaluate DNDT on several tabular datasets, verify its efficacy, and investigate similarities and differences between DNDT and vanilla decision trees. Interestingly, DNDT selfprunes at both split and featurelevel. 
Deep Optimistic Linear Support Learning (DOL) 
We propose Deep Optimistic Linear Support Learning (DOL) to solve highdimensional multiobjective decision problems where the relative importances of the objectives are not known a priori. Using features from the highdimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives. To our knowledge, this is the first time that deep reinforcement learning has succeeded in learning multiobjective policies. In addition, we provide a testbed with two experiments to be used as a benchmark for deep multiobjective reinforcement learning. 
Deep PermSet Net  We present a novel approach for learning to predict sets with unknown permutation and cardinality using deep neural networks. Even though the output of many realworld problems, e.g. object detection, are naturally expressed as sets of entities, existing deep learning architectures hinder a trivial extension to deal with this unstructured output. Even deep architectures that handle sequential data, such as recurrent neural networks, can only output an ordered set and may not guarantee a valid solution, i.e. a set with unique elements. In this paper, we derive a mathematical formulation for set prediction using feedforward neural networks, where the output has unknown and unfixed cardinality and permutation. Specifically, in our formulation we incorporate the permutation as unobservable variable and estimate its distribution during the learning process using alternating optimization. We demonstrate the validity of this formulation on two relevant problems including object detection and a complex CAPTCHA test. 
Deep Policy Inference QNetwork (DPIQN) 
We present DPIQN, a deep policy inference Qnetwork that targets multiagent systems composed of controllable agents, collaborators, and opponents that interact with each other. We focus on one challenging issue in such systems—modeling agents with varying strategies—and propose to employ ‘policy features’ learned from raw observations (e.g., raw images) of collaborators and opponents by inferring their policies. DPIQN incorporates the learned policy features as a hidden vector into its own deep Qnetwork (DQN), such that it is able to predict better Q values for the controllable agents than the stateoftheart deep reinforcement learning models. We further propose an enhanced version of DPIQN, called deep recurrent policy inference Qnetwork (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network’s attention to learn the policy features and its own Qvalues at different phases of the training process. We present a comprehensive analysis of DPIQN and DRPIQN, and highlight their effectiveness and generalizability in various multiagent settings. Our models are evaluated in a classic soccer game involving both competitive and collaborative scenarios. Experimental results performed on 1 vs. 1 and 2 vs. 2 games show that DPIQN and DRPIQN demonstrate superior performance to the baseline DQN and deep recurrent Qnetwork (DRQN) models. We also explore scenarios in which collaborators or opponents dynamically change their policies, and show that DPIQN and DRPIQN do lead to better overall performance in terms of stability and mean scores. 
Deep PrivateFeature Extractor (DPFE) 
We present and evaluate Deep PrivateFeature Extractor (DPFE), a deep model which is trained and evaluated based on information theoretic constraints. Using the selective exchange of information between a user’s device and a service provider, DPFE enables the user to prevent certain sensitive information from being shared with a service provider, while allowing them to extract approved information using their model. We introduce and utilize the logrank privacy, a novel measure to assess the effectiveness of DPFE in removing sensitive information and compare different models based on their accuracyprivacy tradeoff. We then implement and evaluate the performance of DPFE on smartphones to understand its complexity, resource demands, and efficiency tradeoffs. Our results on benchmark image datasets demonstrate that under moderate resource utilization, DPFE can achieve high accuracy for primary tasks while preserving the privacy of sensitive features. 
Deep Product Quantization (DPQ) 
Despite their widespread adoption, Product Quantization techniques were recently shown to be inferior to other hashing techniques. In this work, we present an improved Deep Product Quantization (DPQ) technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a representation that is inspired by Product Quantization and which is learned endtoend, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straightthrough estimator. A novel loss function, Joint Central Loss, is introduced, which both improves the retrieval performance, and decreases the discrepancy between the soft and the hard representations. Finally, by using a normalization technique, we improve the results for crossdomain category retrieval. 
Deep QNetwork (DQN) 
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching realworld complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from highdimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, lowdimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Qnetwork, that can learn successful policies directly from highdimensional sensory inputs using endtoend reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Qnetwork agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between highdimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks. 
Deep Quaternion Network  The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on realvalued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a set parameter budget compared to its realvalued counterpart. In this work, we explore the benefits of generalizing one step further into the hypercomplex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batchnormalization. These pieces are tested by endtoend training on the CIFAR10 and CIFAR100 data sets to show the improved convergence to a realvalued network. 
Deep Rendering Mixture Model (DRMM) 
A Probabilistic Framework for Deep Learning 
Deep Rendering Model (DRM) 
In this paper, we develop a new theoretical framework that provides insights into both the successes and shortcomings of deep learning systems, as well as a principled route to their design and improvement. Our framework is based on a generative probabilistic model that explicitly captures variation due to latent nuisance variables. The Rendering Model (RM) explicitly models nuisance variation through a rendering function that combines the taskspecific variables of interest (e.g., object class in an object recognition task) and the collection of nuisance variables. The Deep Rendering Model (DRM) extends the RM in a hierarchical fashion by rendering via a product of affine nuisance transformations across multiple levels of abstraction. The graphical structures of the RM and DRM enable inference via message passing, using, for example, the sumproduct or maxsum algorithms, and training via the expectationmaximization (EM) algorithm. A key element of the framework is the relaxation of the RM/DRM generative model to a discriminative one in order to optimize the biasvariance tradeoff. 
Deep Residual Hashing  In this paper, we define an extension of the supersymmetric hyperbolic nonlinear sigma model introduced by Zirnbauer. We show that it arises as a weak joint limit of a timechanged version introduced by Sabot and Tarr\`es of the vertexreinforced jump process. It describes the asymptotics of rescaled crossing numbers, rescaled fluctuations of local times, asymptotic local times on a logarithmic scale, endpoints of paths, and last exit trees. 
Deep Residual Network (ResNet) 
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers8 deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. 
Deep Rewiring (DEEP R) 
Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently on sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior. 
Deep Ritz Method  We propose a deep learning based method, the Deep Ritz Method, for numerically solving variational problems, particularly the ones that arise from partial differential equations. The Deep Ritz method is naturally nonlinear, naturally adaptive and has the potential to work in rather high dimensions. The framework is quite simple and fits well with the stochastic gradient descent method used in deep learning. We illustrate the method on several problems including some eigenvalue problems. 
Deep Roots  We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of stateoftheart deep CNNs without compromising accuracy. We validate our approach by using it to train more efficient variants of stateoftheart CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less compute, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 25% fewer floating point operations and 44% fewer parameters, while maintaining stateoftheart accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU). 
Deep Rotation Equivariant Network (DREN) 
Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR10 datasets and demonstrate that it can improve the performance of stateoftheart architectures. Our codes are released on GitHub. 
Deep Saliency Hashing (DSaH) 
In recent years, hashing methods have been proved efficient for largescale Web media search. However, existing general hashing methods have limited discriminative power for describing finegrained objects that share similar overall appearance but have subtle difference. To solve this problem, we for the first time introduce attention mechanism to the learning of hashing codes. Specifically, we propose a novel deep hashing model, named deep saliency hashing (DSaH), which automatically mines salient regions and learns semanticpreserving hashing codes simultaneously. DSaH is a twostep endtoend model consisting of an attention network and a hashing network. Our loss function contains three basic components, including the semantic loss, the saliency loss, and the quantization loss. The saliency loss guides the attention network to mine discriminative regions from pairs of images. We conduct extensive experiments on both finegrained and general retrieval datasets for performance evaluation. Experimental results on Oxford Flowers17 and Stanford Dogs120 demonstrate that our DSaH performs the best for finegrained retrieval task and beats the existing best retrieval performance (DPSH) by approximately 12%. DSaH also outperforms several stateoftheart hashing methods on general datasets, including CIFAR10 and NUSWIDE. 
Deep SelfOrganization  Human professionals are often required to make decisions based on complex multivariate time series measurements in an online setting, e.g. in health care. Since human cognition is not optimized to work well in highdimensional spaces, these decisions benefit from interpretable lowdimensional representations. However, many representation learning algorithms for time series data are difficult to interpret. This is due to nonintuitive mappings from data features to salient properties of the representation and nonsmoothness over time. To address this problem, we propose to couple a variational autoencoder to a discrete latent space and introduce a topological structure through the use of selforganizing maps. This allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the latent space. This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty. We evaluate our model on static (Fashion)MNIST data, a time series of linearly interpolated (Fashion)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application. In the latter experiment, our representation uncovers meaningful structure in the acute physiological state of a patient. 
Deep Sparse Subspace Clustering  In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original realworld data do not meet the classspecific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four realworld datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering. 
Deep Subspace Clustering Network  We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep autoencoders, which nonlinearly map the input data into a latent space. Our key idea is to introduce a novel selfexpressive layer between the encoder and the decoder to mimic the ‘selfexpressiveness’ property that has proven effective in traditional subspace clustering. Being differentiable, our new selfexpressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard backpropagation procedure. Being nonlinear, our neuralnetwork based method is able to cluster data points having complex (often nonlinear) structures. We further propose pretraining and finetuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the stateoftheart unsupervised subspace clustering methods. 
Deep Successor Reinforcement Learning (DSR) 
Learning robust value functions given raw observations and rewards is now possible with modelfree and modelbased deep reinforcement learning algorithms. There is a third alternative, called Successor Representations (SR), which decomposes the value function into two components — a reward predictor and a successor map. The successor map represents the expected future state occupancy from any given state and the reward predictor maps states to scalar rewards. The value function of a state can be computed as the inner product between the successor map and the reward weights. In this paper, we present DSR, which generalizes SR within an endtoend deep reinforcement learning framework. DSR has several appealing properties including: increased sensitivity to distal reward changes due to factorization of reward and world dynamics, and the ability to extract bottleneck states (subgoals) given successor maps trained under a random policy. We show the efficacy of our approach on two diverse environments given raw pixel observations — simple gridworld domains (MazeBase) and the Doom game engine. 
Deep Super Learning  Deep learning has become very popular for tasks such as predictive modeling and pattern recognition in handling big data. Deep learning is a powerful machine learning method that extracts lower level features and feeds them forward for the next layer to identify higher level features that improve performance. However, deep neural networks have drawbacks, which include many hyperparameters and infinite architectures, opaqueness into results, and relatively slower convergence on smaller datasets. While traditional machine learning algorithms can address these drawbacks, they are not typically capable of the performance levels achieved by deep neural networks. To improve performance, ensemble methods are used to combine multiple base learners. Super learning is an ensemble that finds the optimal combination of diverse learning algorithms. This paper proposes deep super learning as an approach which achieves log loss and accuracy results competitive to deep neural networks while employing traditional machine learning algorithms in a hierarchical structure. The deep super learner is flexible, adaptable, and easy to train with good performance across different tasks using identical hyperparameter values. Using traditional machine learning requires fewer hyperparameters, allows transparency into results, and has relatively fast convergence on smaller datasets. Experimental results show that the deep super learner has superior performance compared to the individual base learners, singlelayer ensembles, and in some cases deep neural networks. Performance of the deep super learner may further be improved with taskspecific tuning. 
Deep Survival  Previous research has shown that neural networks can model survival data in situations in which some patients’ death times are unknown, e.g. rightcensored. However, neural networks have rarely been shown to outperform their linear counterparts such as the Cox proportional hazards model. In this paper, we run simulated experiments and use real survival data to build upon the riskregression architecture proposed by Faraggi and Simon. We demonstrate that our model, DeepSurv, not only works as well as the standard linear Cox proportional hazards model but actually outperforms it in predictive ability on survival data with linear and nonlinear risk functions. We then show that the neural network can also serve as a recommender system by including a categorical variable representing a treatment group. This can be used to provide personalized treatment recommendations based on an individual’s calculated risk. We provide an open source Python module that implements these methods in order to advance research on deep learning and survival analysis. 
Deep Symbolic Network (DSN) 
We introduce the Deep Symbolic Network (DSN) model, which aims at becoming the whitebox version of Deep Neural Networks (DNN). The DSN model provides a simple, universal yet powerful structure, similar to DNN, to represent any knowledge of the world, which is transparent to humans. The conjecture behind the DSN model is that any type of real world objects sharing enough common features are mapped into human brains as a symbol. Those symbols are connected by links, representing the composition, correlation, causality, or other relationships between them, forming a deep, hierarchical symbolic network structure. Powered by such a structure, the DSN model is expected to learn like humans, because of its unique characteristics. First, it is universal, using the same structure to store any knowledge. Second, it can learn symbols from the world and construct the deep symbolic networks automatically, by utilizing the fact that real world objects have been naturally separated by singularities. Third, it is symbolic, with the capacity of performing causal deduction and generalization. Fourth, the symbols and the links between them are transparent to us, and thus we will know what it has learned or not – which is the key for the security of an AI system. Fifth, its transparency enables it to learn with relatively small data. Sixth, its knowledge can be accumulated. Last but not least, it is more friendly to unsupervised learning than DNN. We present the details of the model, the algorithm powering its automatic learning ability, and describe its usefulness in different use cases. The purpose of this paper is to generate broad interest to develop it within an open source project centered on the Deep Symbolic Network (DSN) model towards the development of general AI. 
Deep Temporal Clustering (DTC) 
Unsupervised learning of time series data, also known as temporal clustering, is a challenging problem in machine learning. Here we propose a novel algorithm, Deep Temporal Clustering (DTC), to naturally integrate dimensionality reduction and temporal clustering into a single endtoend learning framework, fully unsupervised. The algorithm utilizes an autoencoder for temporal dimensionality reduction and a novel temporal clustering layer for cluster assignment. Then it jointly optimizes the clustering objective and the dimensionality reduction objec tive. Based on requirement and application, the temporal clustering layer can be customized with any temporal similarity metric. Several similarity metrics and stateoftheart algorithms are considered and compared. To gain insight into temporal features that the network has learned for its clustering, we apply a visualization method that generates a region of interest heatmap for the time series. The viability of the algorithm is demonstrated using time series data from diverse domains, ranging from earthquakes to spacecraft sensor data. In each case, we show that the proposed algorithm outperforms traditional methods. The superior performance is attributed to the fully integrated temporal dimensionality reduction and clustering criterion. 
Deep Tensor Decomposition (DeepTD) 
In this paper we study the problem of learning the weights of a deep convolutional neural network. We consider a network where convolutions are carried out over nonoverlapping patches with a single kernel in each layer. We develop an algorithm for simultaneously learning all the kernels from the training data. Our approach dubbed Deep Tensor Decomposition (DeepTD) is based on a rank1 tensor decomposition. We theoretically investigate DeepTD under a realizable model for the training data where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted convolutional kernels. We show that DeepTD is dataefficient and provably works as soon as the sample size exceeds the total number of convolutional weights in the network. We carry out a variety of numerical experiments to investigate the effectiveness of DeepTD and verify our theoretical findings. 
Deep Texture Encoding Network (Deep TEN) 
We propose a Deep Texture Encoding Network (DeepTEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate offtheshelf features such as SIFT descriptors or pretrained CNN features for material recognition. Our new approach provides an endtoend learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to stateoftheart methods using goldstandard databases such as MINC2500, Flickr Material Database, KTHTIPS2b, and two recent databases 4DLightFieldMaterial and GTOS. The source code for the complete system are publicly available. 
Deep Transfer Network (DTN) 
In recent years, an increasing popularity of deep learning model for intelligent condition monitoring and diagnosis as well as prognostics used for mechanical systems and structures has been observed. In the previous studies, however, a major assumption accepted by default, is that the training and testing data are taking from same feature distribution. Unfortunately, this assumption is mostly invalid in real application, resulting in a certain lack of applicability for the traditional diagnosis approaches. Inspired by the idea of transfer learning that leverages the knowledge learnt from rich labeled data in source domain to facilitate diagnosing a new but similar target task, a new intelligent fault diagnosis framework, i.e., deep transfer network (DTN), which generalizes deep learning model to domain adaptation scenario, is proposed in this paper. By extending the marginal distribution adaptation (MDA) to joint distribution adaptation (JDA), the proposed framework can exploit the discrimination structures associated with the labeled data in source domain to adapt the conditional distribution of unlabeled target data, and thus guarantee a more accurate distribution matching. Extensive empirical evaluations on three fault datasets validate the applicability and practicability of DTN, while achieving many stateoftheart transfer results in terms of diverse operating conditions, fault severities and fault types. 
Deep Variational Canonical Correlation Analysis (VCCA) 
We present deep variational canonical correlation analysis (VCCA), a deep multiview learning model that extends the latent variable model interpretation of linear CCA~\citep{BachJordan05a} to nonlinear observation models parameterized by deep neural networks (DNNs). Marginal data likelihood as well as inference are intractable under this model. We derive a variational lower bound of the data likelihood by parameterizing the posterior density of the latent variables with another DNN, and approximate the lower bound via Monte Carlo sampling. Interestingly, the resulting model resembles that of multiview autoencoders~\citep{Ngiam_11b}, with the key distinction of an additional sampling procedure at the bottleneck layer. We also propose a variant of VCCA called VCCAprivate which can, in addition to the ‘common variables’ underlying both views, extract the ‘private variables’ within each view. We demonstrate that VCCAprivate is able to disentangle the shared and private information for multiview data without hard supervision. 
Deep Visual Explanation (DVE) 
The practical impact of deep learning on complex supervised learning problems has been significant, so much so that almost every Artificial Intelligence problem, or at least a portion thereof, has been somehow recast as a deep learning problem. The applications appeal is significant, but this appeal is increasingly challenged by what some call the challenge of explainability, or more generally the more traditional challenge of debuggability: if the outcomes of a deep learning process produce unexpected results (e.g., less than expected performance of a classifier), then there is little available in the way of theories or tools to help investigate the potential causes of such unexpected behavior, especially when this behavior could impact people’s lives. We describe a preliminary framework to help address this issue, which we call ‘deep visual explanation’ (DVE). ‘Deep,’ because it is the development and performance of deep neural network models that we want to understand. ‘Visual,’ because we believe that the most rapid insight into a complex multidimensional model is provided by appropriate visualization techniques, and ‘Explanation,’ because in the spectrum from instrumentation by inserting print statements to the abductive inference of explanatory hypotheses, we believe that the key to understanding deep learning relies on the identification and exposure of hypotheses about the performance behavior of a learned deep model. In the exposition of our preliminary framework, we use relatively straightforward image classification examples and a variety of choices on initial configuration of a deep model building scenario. By careful but not complicated instrumentation, we expose classification outcomes of deep models using visualization, and also show initial results for one potential application of interpretability. 
Deep VisualSemantic Embedding Model (DeViSE) 
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visualsemantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches stateoftheart performance on the 1000class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zeroshot predictions by up to 65%, achieving hit rates of up to 10% across thousands of novel labels never seen by the visual model. 
Deep Web  The Deep Web, Deep Net, Invisible Web, or Hidden Web, refers to the content on the World Wide Web that is not indexed by standard search engines. Computer scientist Mike Bergman is credited with coining the term in 2000. 
DeepAM  Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a largescale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the stateoftheart approaches. 
DeepArchitect  In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are treestructured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential modelbased optimization (SMBO). We present experiments comparing the different algorithms on CIFAR10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available. 
DeepBalance  Class imbalance problems manifest in domains such as financial fraud detection or network intrusion analysis, where the prevalence of one class is much higher than another. Typically, practitioners are more interested in predicting the minority class than the majority class as the minority class may carry a higher misclassification cost. However, classifier performance deteriorates in the face of class imbalance as oftentimes classifiers may predict every point as the majority class. Methods for dealing with class imbalance include costsensitive learning or resampling techniques. In this paper, we introduce DeepBalance, an ensemble of deep belief networks trained with balanced bootstraps and random feature selection. We demonstrate that our proposed method outperforms baseline resampling methods such as SMOTE and under and oversampling in metrics such as AUC and sensitivity when applied to a highly imbalanced financial transaction data. Additionally, we explore performance and training time implications of various model parameters. Furthermore, we show that our model is easily parallelizable, which can reduce training times. Finally, we present an implementation of DeepBalance in R. 
DEEPBEAM  Multichannel speech enhancement with adhoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would otherwise be too complicated. On the other hand, deep learning based enhancement approaches are able to learn complicated speech distributions and perform efficient inference, but they are unable to deal with variable number of input channels. Also, deep learning approaches introduce a lot of errors, particularly in the presence of unseen noise types and settings. We have therefore proposed an enhancement framework called DEEPBEAM, which combines the two complementary classes of algorithms. DEEPBEAM introduces a beamforming filter to produce natural sounding speech, but the filter coefficients are determined with the help of a monaural speech enhancement neural network. Experiments on synthetic and realworld data show that DEEPBEAM is able to produce clean, dry and natural sounding speech, and is robust against unseen noise. 
DeepBoost  We present a new ensemble learning algorithm, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorithm is a capacityconscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensembles expressed in terms of the Rademacher complexities of the subfamilies composing the base classifier set, and the mixture weight assigned to each subfamily. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance compares favorably to that of AdaBoost and Logistic Regression and their L1regularized variants. DeepBoost 
DeepDetect  DeepDetect is a deep learning API and server written in C++11. It makes state of the art deep learning easy to work with and integrate into existing applications. 
DeepDive  DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domainspecific knowledge and incorporates user feedback to improve the quality of its analysis. 
DeepDSL  In recent years, Deep Learning (DL) has found great success in domains such as multimedia understanding. However, the complex nature of multimedia data makes it difficult to develop DLbased software. The stateofthe art tools, such as Caffe, TensorFlow, Torch7, and CNTK, while are successful in their applicable domains, are programming libraries with fixed user interface, internal representation, and execution environment. This makes it difficult to implement portable and customized DL applications. In this paper, we present DeepDSL, a domain specific language (DSL) embedded in Scala, that compiles deep networks written in DeepDSL to Java source code. Deep DSL provides (1) intuitive constructs to support compact encoding of deep networks; (2) symbolic gradient derivation of the networks; (3) static analysis for memory consumption and error detection; and (4) DSLlevel optimization to improve memory and runtime efficiency. DeepDSL programs are compiled into compact, efficient, customizable, and portable Java source code, which operates the CUDA and CUDNN interfaces running on Nvidia GPU via a Java Native Interface (JNI) library. We evaluated DeepDSL with a number of popular DL networks. Our experiments show that the compiled programs have very competitive runtime performance and memory efficiency compared to the existing libraries. 
DeepER  Entity Resolution (ER) is a fundamental problem with many applications. Machine learning (ML)based and rulebased approaches have been widely studied for decades, with many efforts being geared towards which features/attributes to select, which similarity functions to employ, and which blocking function to use – complicating the deployment of an ER system as a turnkey system. In this paper, we present DeepER, a turnkey ER system powered by deep learning (DL) techniques. The central idea is that distributed representations and representation learning from DL can alleviate the above human efforts for tuning existing ER systems. DeepER makes several notable contributions: encoding a tuple as a distributed representation of attribute values, building classifiers using these representations and a semantic aware blocking based on LSH, and learning and tuning the distributed representations for ER. We evaluate our algorithms on multiple benchmark datasets and achieve competitive results while requiring minimal interaction with experts. 
DeepFeat  A deep feature based saliency model (DeepFeat) is developed to leverage the understanding of the prediction of human fixations. Traditional saliency models often predict the human visual attention relying on few level image cues. Although such models predict fixations on a variety of image complexities, their approaches are limited to the incorporated features. In this study, we aim to provide an intuitive interpretation of convolu tional neural network deep features by combining low and high level visual factors. We exploit four evaluation metrics to evaluate the correspondence between the proposed framework and the groundtruth fixations. The key findings of the results demon strate that the DeepFeat algorithm, incorporation of bottom up and top down saliency maps, outperforms the individual bottom up and top down approach. Moreover, in comparison to nine 9 stateoftheart saliency models, our proposed DeepFeat model achieves satisfactory performance based on all four evaluation metrics. 
DeepFirearm  There are great demands for automatically regulating inappropriate appearance of shocking firearm images in social media or identifying firearm types in forensics. Image retrieval techniques have great potential to solve these problems. To facilitate research in this area, we introduce Firearm 14k, a large dataset consisting of over 14,000 images in 167 categories. It can be used for both finegrained recognition and retrieval of firearm images. Recent advances in image retrieval are mainly driven by finetuning stateoftheart convolutional neural networks for retrieval task. The conventional single margin contrastive loss, known for its simplicity and good performance, has been widely used. We find that it performs poorly on the Firearm 14k dataset due to: (1) Loss contributed by positive and negative image pairs is unbalanced during training process. (2) A huge domain gap exists between this dataset and ImageNet. We propose to deal with the unbalanced loss by employing a double margin contrastive loss. We tackle the domain gap issue with a twostage training strategy, where we first finetune the network for classification, and then finetune it for retrieval. Experimental results show that our approach outperforms the conventional single margin approach by a large margin (up to 88.5% relative improvement) and even surpasses the strong tripletlossbased approach. 
DeepFuse  We present a novel deep learning architecture for fusing static multiexposure images. Current multiexposure fusion (MEF) approaches use handcrafted features to fuse input sequence. However, the weak handcrafted representations are not robust to varying input conditions. Moreover, they perform poorly for extreme exposure image pairs. Thus, it is highly desirable to have a method that is robust to varying input conditions and capable of handling extreme exposure without artifacts. Deep representations have known to be robust to input conditions and have shown phenomenal performance in a supervised setting. However, the stumbling block in using deep learning for MEF was the lack of sufficient training data and an oracle to provide the groundtruth for supervision. To address the above issues, we have gathered a large dataset of multiexposure image stacks for training and to circumvent the need for ground truth images, we propose an unsupervised deep learning framework for MEF utilizing a noreference quality metric as loss function. The proposed approach uses a novel CNN architecture trained to learn the fusion operation without reference ground truth image. The model fuses a set of common low level features extracted from each image to generate artifactfree perceptually pleasing results. We perform extensive quantitative and qualitative evaluation and show that the proposed technique outperforms existing stateoftheart approaches for a variety of natural images. 
DeepGraph  The topological (or graph) structures of realworld networks are known to be predictive of multiple dynamic properties of the networks. Conventionally, a graph structure is represented using an adjacency matrix or a set of handcrafted structural features. These representations either fail to highlight local and global properties of the graph or suffer from a severe loss of structural information. There lacks an effective graph representation, which hinges the realization of the predictive power of network structures. In this study, we propose to learn the represention of a graph, or the topological structure of a network, through a deep learning model. This endtoend prediction model, named DeepGraph, takes the input of the raw adjacency matrix of a realworld network and outputs a prediction of the growth of the network. The adjacency matrix is first represented using a graph descriptor based on the heat kernel signature, which is then passed through a multicolumn, multiresolution convolutional neural network. Extensive experiments on five large collections of realworld networks demonstrate that the proposed prediction model significantly improves the effectiveness of existing methods, including linear or nonlinear regressors that use handcrafted features, graph kernels, and competing deep learning methods. 
DeepJDOT  In computer vision, one is often confronted with problems of domain shifts, which occur when one applies a classifier trained on a source dataset to target data sharing similar characteristics (e.g. same classes), but also different latent data structures (e.g. different acquisition conditions). In such a situation, the model will perform poorly on the new data, since the classifier is specialized to recognize visual cues specific to the source domain. In this work we explore a solution, named DeepJDOT, to tackle this problem: through a measure of discrepancy on joint deep representations/labels based on optimal transport, we not only learn new data representations aligned between the source and target domain, but also simultaneously preserve the discriminative information used by the classifier. We applied DeepJDOT to a series of visual recognition tasks, where it compares favorably against stateoftheart deep domain adaptation methods. 
DeepLab  DeepLab is a stateofart deep learning model for semantic image segmentation, where the goal is to assign semantic labels (e.g., person, dog, cat and so on) to every pixel in the input image. 
DeepLearningKit  In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOSbased mobile apps on iPhone/iPad, tvOSbased apps for the big screen, or OS X desktop applications. The goal is to support using deep learning models trained with popular frameworks such as Caffe, Torch, TensorFlow, Theano, Pylearn, Deeplearning4J and Mocha. Given the massive GPU resources and time required to train Deep Learning models we suggest an App Store like model to distribute and download pretrained and reusable Deep Learning models. 
Deeply Supervised Object Detector (DSOD) 
We present Deeply Supervised Object Detector (DSOD), a framework that can learn object detectors from scratch. Stateoftheart object objectors rely heavily on the offtheshelf networks pretrained on largescale classification datasets like ImageNet, which incurs learning bias due to the difference on both the loss functions and the category distributions between classification and detection tasks. Model finetuning for the detection task could alleviate this bias to some extent but not fundamentally. Besides, transferring pretrained models from classification to detection between discrepant domains is even more difficult (e.g. RGB to depth images). A better solution to tackle these two critical problems is to train object detectors from scratch, which motivates our proposed DSOD. Previous efforts in this direction mostly failed due to much more complicated loss functions and limited training data in object detection. In DSOD, we contribute a set of design principles for training object detectors from scratch. One of the key findings is that deep supervision, enabled by dense layerwise connections, plays a critical role in learning a good detector. Combining with several other principles, we develop DSOD following the singleshot detection (SSD) framework. Experiments on PASCAL VOC 2007, 2012 and MS COCO datasets demonstrate that DSOD can achieve better results than the stateoftheart solutions with much more compact models. For instance, DSOD outperforms SSD on all three benchmarks with realtime detection speed, while requires only 1/2 parameters to SSD and 1/10 parameters to Faster RCNN. Our code and models are available at: https://…/DSOD . 
DeeplyRecursive Network (DRResNet) 
The estimation of crowd count in images has a wide range of applications such as video surveillance, traffic monitoring, public safety and urban planning. Recently, the convolutional neural network (CNN) based approaches have been shown to be more effective in crowd counting than traditional methods that use handcrafted features. However, the existing CNNbased methods still suffer from large number of parameters and large storage space, which require high storage and computing resources and thus limit the realworld application. Consequently, we propose a deeplyrecursive network (DRResNet) based on ResNet blocks for crowd counting. The recursive structure makes the network deeper while keeping the number of parameters unchanged, which enhances network capability to capture statistical regularities in the context of the crowd. Besides, we generate a new dataset from the videomonitoring data of Beijing bus station. Experimental results have demonstrated that proposed method outperforms most stateoftheart methods with far less number of parameters. 
DeeplySupervised Nets (DSN) 
Our proposed deeplysupervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce ‘companion objective’ to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layerwise pretraining). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all stateoftheart results on MNIST, CIFAR10, CIFAR100, and SVHN). 
DeepMatch  We study optimal covariate balance for causal inferences from observational data when rich covariates and complex relationships necessitate flexible modeling with neural networks. Standard approaches such as propensity weighting and matching/balancing fail in such settings due to miscalibrated propensity nets and inappropriate covariate representations, respectively. We propose a new method based on adversarial training of a weighting and a discriminator network that effectively addresses this methodological gap. This is demonstrated through new theoretical characterizations of the method as well as empirical results using both fully connected architectures to learn complex relationships and convolutional architectures to handle image confounders, showing how this new method can enable strong causal analyses in these challenging settings. 
DeepPath  We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multihop relational paths: we use a policybased agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a pathranking based algorithm and knowledge graph embedding methods on Freebase and NeverEnding Language Learning datasets. 
DeepProbe  Information extraction and user intention identification are central topics in modern query understanding and recommendation systems. In this paper, we propose DeepProbe, a generic informationdirected interaction framework which is built around an attentionbased sequence to sequence (seq2seq) recurrent neural network. DeepProbe can rephrase, evaluate, and even actively ask questions, leveraging the generative ability and likelihood estimation made possible by seq2seq models. DeepProbe makes decisions based on a derived uncertainty (entropy) measure conditioned on user inputs, possibly with multiple rounds of interactions. Three applications, namely a rewritter, a relevance scorer and a chatbot for ad recommendation, were built around DeepProbe, with the first two serving as precursory building blocks for the third. We first use the seq2seq model in DeepProbe to rewrite a user query into one of standard query form, which is submitted to an ordinary recommendation system. Secondly, we evaluate DeepProbe’s seq2seq modelbased relevance scoring. Finally, we build a chatbot prototype capable of making active user interactions, which can ask questions that maximize information gain, allowing for a more efficient user intention idenfication process. We evaluate first two applications by 1) comparing with baselines by BLEU and AUC, and 2) human judge evaluation. Both demonstrate significant improvements compared with current stateoftheart systems, proving their values as useful tools on their own, and at the same time laying a good foundation for the ongoing chatbot application. 
DeepProbLog  We introduce DeepProbLog, a probabilistic logic programming language that incorporates deep learning by means of neural predicates. We show how existing inference and learning techniques can be adapted for the new language. Our experiments demonstrate that DeepProbLog supports both symbolic and subsymbolic representations and inference, 1) program induction, 2) probabilistic (logic) programming, and 3) (deep) learning from examples. To the best of our knowledge, this work is the first to propose a framework where generalpurpose neural networks and expressive probabilisticlogical modeling and reasoning are integrated in a way that exploits the full expressiveness and strengths of both worlds and can be trained endtoend based on examples. 
DeepRank  This paper concerns a deep learning approach to relevance ranking in information retrieval (IR). Existing deep IR models such as DSSM and CDSSM directly apply neural networks to generate ranking scores, without explicit understandings of the relevance. According to the human judgement process, a relevance label is generated by the following three steps: 1) relevant locations are detected, 2) local relevances are determined, 3) local relevances are aggregated to output the relevance label. In this paper we propose a new deep learning architecture, namely DeepRank, to simulate the above human judgment process. Firstly, a detection strategy is designed to extract the relevant contexts. Then, a measure network is applied to determine the local relevances by utilizing a convolutional neural network (CNN) or twodimensional gated recurrent units (2DGRU). Finally, an aggregation network with sequential integration and term gating mechanism is used to produce a global relevance score. DeepRank well captures important IR characteristics, including exact/semantic matching signals, proximity heuristics, query term importance, and diverse relevance requirement. Experiments on both benchmark LETOR dataset and a large scale clickthrough data show that DeepRank can significantly outperform learning to ranking methods, and existing deep learning methods. 
DeepSense  Mobile sensing applications usually require timeseries inputs from sensors. Some applications, such as tracking, can use sensed acceleration and rate of rotation to calculate displacement based on physical system models. Other applications, such as activity recognition, extract manually designed features from sensor inputs for classification. Such applications face two challenges. On one hand, ondevice sensor measurements are noisy. For many mobile applications, it is hard to find a distribution that exactly describes the noise in practice. Unfortunately, calculating target quantities based on physical system and noise models is only as accurate as the noise assumptions. Similarly, in classification applications, although manually designed features have proven to be effective, it is not always straightforward to find the most robust features to accommodate diverse sensor noise patterns and user behaviors. To this end, we propose DeepSense, a deep learning framework that directly addresses the aforementioned noise and feature customization challenges in a unified manner. DeepSense integrates convolutional and recurrent neural networks to exploit local interactions among similar mobile sensors, merge local interactions of different sensory modalities into global interactions, and extract temporal relationships to model signal dynamics. DeepSense thus provides a general signal estimation and classification framework that accommodates a wide range of applications. We demonstrate the effectiveness of DeepSense using three representative and challenging tasks: car tracking with motion sensors, heterogeneous human activity recognition, and user identification with biometric motion analysis. DeepSense significantly outperforms the stateoftheart methods for all three tasks. In addition, DeepSense is feasible to implement on smartphones due to its moderate energy consumption and low latency 
DeepSOFA  Traditional methods for assessing illness severity and predicting inhospital mortality among critically ill patients require manual, timeconsuming, and errorprone calculations that are further hindered by the use of static variable thresholds derived from aggregate patient populations. These coarse frameworks do not capture timesensitive individual physiological patterns and are not suitable for instantaneous assessment of patients’ acuity trajectories, a critical task for the ICU where conditions often change rapidly. Furthermore, they are illsuited to capitalize on the emerging availability of streaming electronic health record data. We propose a novel acuity score framework (DeepSOFA) that leverages temporal patient measurements in conjunction with deep learning models to make accurate assessments of a patient’s illness severity at any point during their ICU stay. We compare DeepSOFA with SOFA baseline models using the same predictors and find that at any point during an ICU admission, DeepSOFA yields more accurate predictions of inhospital mortality. 
DeepTag  In many underresourced settings, clinicians lack time and expertise to annotate patients with standard medical diagnosis codes. Veterinary medicine is an example of this and clinical encounters are largely captured in free text notes which are not labeled with diagnosis code. The lack of such standard coding makes it challenging to apply data science to improve patient care. It is also a major impediment to translational research, where, for example, we would like to leverage veterinary data to inform drug development for humans. We develop a deep learning algorithm, DeepTag, to automatically infer diagnosis codes from veterinarian free text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multitask LSTM with an improved hierarchical objective that captures structures between diseases. To foster humanmachine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defer them to human experts, resulting in improved performance of the model. DeepTag accurately infers disease codes from free text even in challenging outofdomain settings where the text comes from different clinics than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal preprocessing. The technical framework in this work can be applied in other medical domains that currently lack medical coding infrastructure. 
DeepThin  As the industry deploys increasingly large and complex neural networks to mobile devices, more pressure is put on the memory and compute resources of those devices. Deep compression, or compression of deep neural network weight matrices, is a technique to stretch resources for such scenarios. Existing compression methods cannot effectively compress models smaller than 12% of their original size. We develop a new compression technique, DeepThin, building on existing research in the area of low rank factorization. We identify and break artificial constraints imposed by low rank approximations by combining rank factorization with a reshaping process that adds nonlinearity to the approximation function. We deploy DeepThin as a pluggable library integrated with TensorFlow that enables users to seamlessly compress models at different granularities. We evaluate DeepThin on two stateoftheart acoustic models, TFKaldi and DeepSpeech, comparing it to previous compression work (Pruning, HashNet, and Rank Factorization), empirical limit study approaches, and handtuned models. For TFKaldi, our DeepThin networks show better word error rates (WER) than competing methods at practically all tested compression rates, achieving an average of 60% relative improvement over rank factorization, 57% over pruning, 23% over handtuned samesize networks, and 6% over the computationally expensive HashedNets. For DeepSpeech, DeepThincompressed networks achieve better test loss than all other compression methods, reaching a 28% better result than rank factorization, 27% better than pruning, 20% better than handtuned samesize networks, and 12% better than HashedNets. DeepThin also provide inference performance benefits ranging from 2X to 14X speedups, depending on the compression ratio and platform cache sizes. 
DeepWalk  We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multilabel network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection. 
DeepXplore  Deep learning (DL) systems are increasingly deployed in securitycritical domains including selfdriving cars and malware detection, where the correctness and predictability of a system’s behavior for cornercase inputs are of great importance. However, systematic testing of largescale DL systems with thousands of neurons and millions of parameters for all possible cornercases is a hard problem. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose different erroneous behaviors for rare inputs. We present DeepXplore, the first whitebox framework for systematically testing realworld DL systems. We address two problems: (1) generating inputs that trigger different parts of a DL system’s logic and (2) identifying incorrect behaviors of DL systems without manual effort. First, we introduce neuron coverage for estimating the parts of DL system exercised by a set of test inputs. Next, we leverage multiple DL systems with similar functionality as crossreferencing oracles and thus avoid manual checking for erroneous behaviors. We demonstrate how finding inputs triggering differential behaviors while achieving high neuron coverage for DL algorithms can be represented as a joint optimization problem and solved efficiently using gradientbased optimization techniques. DeepXplore finds thousands of incorrect cornercase behaviors in stateoftheart DL models trained on five popular datasets. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running on a commodity laptop. The inputs generated by DeepXplore achieved 33.2% higher neuron coverage on average than existing testing methods. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve classification accuracy or identify polluted training data. 
DefenseGAN  In recent years, deep neural network approaches have been widely adopted for machine learning tasks, including classification. However, they were shown to be vulnerable to adversarial perturbations: carefully crafted small perturbations can cause misclassification of legitimate images. We propose DefenseGAN, a new framework leveraging the expressive capability of generative models to defend deep neural networks against such attacks. DefenseGAN is trained to model the distribution of unperturbed images. At inference time, it finds a close output to a given image which does not contain the adversarial changes. This output is then fed to the classifier. Our proposed method can be used with any classification model and does not modify the classifier structure or training procedure. It can also be used as a defense against any attack as it does not assume knowledge of the process for generating the adversarial examples. We empirically show that DefenseGAN is consistently effective against different attack methods and improves on existing defense strategies. Our code has been made publicly available at https://…/defensegan. 
Deferred Acceptance Algorithm (DAA) 
The Deferred Acceptance Algorithm (DAA) goes back to Gale and Shapley (1962). They introduce a rather simple algorithm that finds a stable matching for example for college admissions or in a marriage market. In a marriage market where M men have preferences over W women, and men take the role of the proposing party, the DAA produces what is called the Mstable matching: each man strictly prefers the Mstable matching to any other potential matching. “Stable” means that no couple of a man and a woman could break the matching by choosing another mate. This is quite a strong result. Variations of this algoritm are used in Hospital assignments in the USA, whereby recently graduated doctors submit preferences over hospitals, and hospitals submit preferences over graduates. Another application is the kidney exchange, where the algorithm is used to find the best match between a set of donors and a set of receivers. matchingMarkets 
Define Differential Message Importance Measure  Data collection is a fundamental problem in the scenario of big data, where the size of sampling sets plays a very important role, especially in the characterization of data structure. This paper considers the information collection process by taking message importance into account, and gives a distributionfree criterion to determine how many samples are required in big data structure characterization. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. The DMIM for many common densities is discussed, and highprecision approximate values for normal distribution are given. Moreover, it is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to KolmogorovSmirnov statistic, but it offers a new way to characterize the distribution goodnessoffit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. 
Definition Extraction Tool (DefExt) 
We present DefExt, an easy to use semi supervised Definition Extraction Tool. DefExt is designed to extract from a target corpus those textual fragments where a term is explicitly mentioned together with its core features, i.e. its definition. It works on the back of a Conditional Random Fields based sequential labeling algorithm and a bootstrapping approach. Bootstrapping enables the model to gradually become more aware of the idiosyncrasies of the target corpus. In this paper we describe the main components of the toolkit as well as experimental results stemming from both automatic and manual evaluation. We release DefExt as open source along with the necessary files to run it in any Unix machine. We also provide access to training and test data for immediate use. 
Deflated Deterministic Parallel Analysis (DDPA) 
➘ “Deterministic Parallel Analysis” 
Deformable Convolution  ➘ “Deformable Convolutional Networks” 
Deformable Convolutional Networks  Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained endtoend by standard backpropagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released. 
Deformable RoI Pooling  ➘ “Deformable Convolutional Networks” 
Deformable Volume Network (Devon) 
We propose a lightweight neural network model, Deformable Volume Network (Devon) for learning optical flow. Devon benefits from a multistage framework to iteratively refine its prediction. Each stage is by itself a neural network with an identical architecture. The optical flow between two stages is propagated with a newly proposed module, the deformable cost volume. The deformable cost volume does not distort the original images or their feature maps and therefore avoids the artifacts associated with warping, a common drawback in previous models. Devon only has one million parameters. Experiments show that Devon achieves comparable results to previous neural network models, despite of its small size. 
Degradation Data Analysis  Given that products are more frequently being designed with higher reliability and developed in a shorter amount of time, it is often not possible to test new designs to failure under normal operating conditions. In some cases, it is possible to infer the reliability behavior of unfailed test samples with only the accumulated test time information and assumptions about the distribution. However, this generally leads to a great deal of uncertainty in the results. Another option in this situation is the use of degradation analysis. Degradation analysis involves the measurement of performance data that can be directly related to the presumed failure of the product in question. Many failure mechanisms can be directly linked to the degradation of part of the product, and degradation analysis allows the analyst to extrapolate to an assumed failure time based on the measurements of degradation over time. 
Degree Penalty  Network embedding aims to learn the lowdimensional representations of vertexes in a network, while structure and inherent properties of the network is preserved. Existing network embedding works primarily focus on preserving the microscopic structure, such as the first and secondorder proximity of vertexes, while the macroscopic scalefree property is largely ignored. Scalefree property depicts the fact that vertex degrees follow a heavytailed distribution (i.e., only a few vertexes have high degrees) and is a critical property of realworld networks, such as social networks. In this paper, we study the problem of learning representations for scalefree networks. We first theoretically analyze the difficulty of embedding and reconstructing a scalefree network in the Euclidean space, by converting our problem to the sphere packing problem. Then, we propose the ‘degree penalty’ principle for designing scalefree property preserving network embedding algorithm: punishing the proximity between highdegree vertexes. We introduce two implementations of our principle by utilizing the spectral techniques and a skipgram model respectively. Extensive experiments on six datasets show that our algorithms are able to not only reconstruct heavytailed distributed degree distribution, but also outperform stateoftheart embedding models in various network mining tasks, such as vertex classification and link prediction. 
Degree Weighted Lasso  DWLasso 
DeGrootFriedkin Model (DF) 
The DeGrootFriedkin model in , contains two stages and studies the evolution of selfconfidence, i.e., how confident an individual is for her opinions on a sequence of issues. In the first stage, individuals update their opinions for a particular issue according to the classical DeGroot model, and in the second stage, the selfconfidence for the next issue is governed by the reflected appraisal mechanism studied in ,. Reflected appraisal mechanism, in simple words, describes the phenomenon that individuals’ selfappraisals on some dimension (e.g., selfconfidence, selfesteem) are influenced by the appraisals of other individuals on them. 
Delaunay Diagram  
DelayedAcceptance Markov Chain Monte Carlo (DAMCMC) 
Delayedacceptance Markov chain Monte Carlo (DAMCMC) samples from a probability distribution, via a twostages version of the MetropolisHastings algorithm, by combining the target distribution with a ‘surrogate’ (i.e. an approximate and computationally cheaper version) of said distribution. DAMCMC accelerates MCMC sampling in complex applications, while still targeting the exact distribution. 
DELIP  Partially observable Markov decision processes (POMDPs) are a powerful abstraction for tasks that require decision making under uncertainty, and capture a wide range of real world tasks. Today, effective planning approaches exist that generate effective strategies given blackbox models of a POMDP task. Yet, an open question is how to acquire accurate models for complex domains. In this paper we propose DELIP, an approach to model learning for POMDPs that utilizes amortized structured variational inference. We empirically show that our model leads to effective control strategies when coupled with stateoftheart planners. Intuitively, modelbased approaches should be particularly beneficial in environments with changing reward structures, or where rewards are initially unknown. Our experiments confirm that DELIP is particularly effective in this setting. 
DELSTM  We present a deep learning model, DELSTM, for the simulation of a stochastic process with underlying nonlinear dynamics. The deep learning model aims to approximate the probability density function of a stochastic process via numerical discretization and the underlying nonlinear dynamics is modeled by the Long ShortTerm Memory (LSTM) network. After the numerical discretization by a softmax function, the function estimation problem is solved by a multilabel classification problem. A penalized maximum log likelihood method is proposed to impose smoothness in the predicted probability distribution. It is shown that LSTM is a state space model, where the internal dynamics consists of a system of relaxation processes. A sequential Monte Carlo method is outlined to compute the time evolution of the probability distribution. The behavior of DELSTM is investigated by using the OrnsteinUhlenbeck process and noisy observations of MackeyGlass equation and forced Van der Pol oscillators. While the probability distribution computed by the conventional maximum log likelihood method makes a good prediction of the first and second moments, the KullbackLeibler divergence shows that the penalized maximum log likelihood method results in a probability distribution closer to the ground truth. It is shown that DELSTM makes a good prediction of the probability distribution without assuming any distributional properties of the noise. For a multiplestep forecast, it is found that the prediction uncertainty, denoted by the 95% confidence interval, does not grow monotonically in time. For a chaotic system, MackeyGlass time series, the 95% confidence interval first grows, then exhibits an oscillatory behavior, instead of growing indefinitely, while for the forced Van der Pol oscillator, the prediction uncertainty does not grow in time even for 3,000step forecast. 
Delta Epsilon Alpha Star  Delta Epsilon Alpha Star is a minimal coverage, realtime robotic search algorithm that yields a moderately aggressive search path with minimal backtracking. Search performance is bounded by a placing a combinatorial bound, epsilon and delta, on the maximum deviation from the theoretical shortest path and the probability at which further deviations can occur. Additionally, we formally define the notion of PACadmissibility — a relaxed admissibility criteria for algorithms, and show that PACadmissible algorithms are better suited to robotic search situations than epsilonadmissible or strict algorithms. 
DelugeNets  Human brains are adept at dealing with the deluge of information they continuously receive, by suppressing the nonessential inputs and focusing on the important ones. Inspired by such capability, we propose Deluge Networks (DelugeNets), a novel class of neural networks facilitating massive crosslayer information inflows from preceding layers to succeeding layers. The connections between layers in DelugeNets are efficiently established through crosslayer depthwise convolutional layers with learnable filters, acting as a flexible selection mechanism. By virtue of the massive crosslayer information inflows, DelugeNets can propagate information across many layers with greater flexibility and utilize network parameters more effectively, compared to existing ResNet models. Experiments show the superior performances of DelugeNets in terms of both classification accuracies and parameter efficiencies. Remarkably, a DelugeNet model with just 20.2M parameters achieve stateoftheart error of 19.02% on CIFAR100 dataset, outperforming DenseNet model with 27.2M parameters. Moreover, DelugeNet performs comparably to ResNet200 on ImageNet dataset with merely half of the computations needed by the latter. 
Demand Sensing  Demand Sensing is a next generation forecasting method that leverages new mathematical techniques and near realtime information to create an accurate forecast of demand, based on the current realities of the supply chain. The typical performance of demand sensing systems reduces nearterm forecast error by 30% or more compared to traditional timeseries forecasting techniques. The jump in forecast accuracy helps companies manage the effects of market volatility and gain the benefits of a demanddriven supply chain, including more efficient operations, increased service levels, and a range of financial benefits including higher revenue, better profit margins, less inventory, better perfect order performance and a shorter cashtocash cycle time. Gartner, Inc. insight on demand sensing can be found in its report, “Supply Chain Strategy for Manufacturing Leaders: The Handbook for Becoming Demand Driven.” 
Deming Regression  In statistics, Deming regression, named after W. Edwards Deming, is an errorsinvariables model which tries to find the line of best fit for a twodimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x and the y axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure. deming 
Democratization  Democratization is defined as the action/development of making something accessible to everyone, to the ‘common masses.’ History provides democratization lessons from the Industrial and Information Revolutions. Both of these moments in history were driven by the standardization of parts, tools, architectures, interfaces, designs and trainings that allowed for the creation of common platforms. Instead of being dependent upon a ‘high priesthood’ of specialists to assemble your guns or cars or computer systems, organizations of all sizes where able to leverage common platforms to build their own sources of customer, business and financial differentiation. 
DempsterianShaferian Belief Network  Shenoy and Shafer {Shenoy:90} demonstrated that both for DempsterShafer Theory and probability theory there exists a possibility to calculate efficiently marginals of joint belief distributions (by socalled local computations) provided that the joint distribution can be decomposed (factorized) into a belief network. A number of algorithms exists for decomposition of probabilistic joint belief distribution into a bayesian (belief) network from data. For example Spirtes, Glymour and Schein{Spirtes:90b} formulated a Conjecture that a direct dependence test and a headtohead meeting test would suffice to construe bayesian network from data in such a way that Pearl’s concept of dseparation {Geiger:90} applies. This paper is intended to transfer Spirtes, Glymour and Scheines {Spirtes:90b} approach onto the ground of the DempsterShafer Theory (DST). For this purpose, a frequentionistic interpretation of the DST developed in {Klopotek:93b} is exploited. A special notion of conditionality for DST is introduced and demonstrated to behave with respect to Pearl’s dseparation {Geiger:90} much the same way as conditional probability (though some differences like nonuniqueness are evident). Based on this, an algorithm analogous to that from {Spirtes:90b} is developed. The notion of a partially oriented graph (pog) is introduced and within this graph the notion of pdseparation is defined. If direct dependence test and headtohead meeting test are used to orient the pog then its pdseparation is shown to be equivalent to the Pearl’s dseparation for any compatible dag. 
Dempster’s Rule of Combination  How to combine two independent sets of probability mass assignments in specific situations. In case different sources express their beliefs over the frame in terms of belief constraints such as in case of giving hints or in case of expressing preferences, then Dempster’s rule of combination is the appropriate fusion operator. dst 
Dempster–Shafer Theory (DST) 
The DempsterShafer theory (DST) is a mathematical theory of evidence. It allows one to combine evidence from different sources and arrive at a degree of belief (represented by a belief function) that takes into account all the available evidence. The theory was first developed by Arthur P. Dempster and Glenn Shafer. In a narrow sense, the term DempsterShafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. dst 
Dendrogram  A dendrogram (from Greek dendron “tree” and gramma “drawing”) is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. 
Denoising Autoencoder (dA) 
The idea behind denoising autoencoders is simple. In order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it. The denoising autoencoder is a stochastic version of the autoencoder. Intuitively, a denoising autoencoder does two things: try to encode the input (preserve the information about the input), and try to undo the effect of a corruption process stochastically applied to the input of the autoencoder. The latter can only be done by capturing the statistical dependencies between the inputs. The denoising autoencoder can be understood from different perspectives (the manifold learning perspective, stochastic operator perspective, bottomup – information theoretic perspective, topdown – generative model perspective), all of which are explained in. See also section 7.2 of for an overview of autoencoders. In , the stochastic corruption process randomly sets some of the inputs (as many as half of them) to zero. Hence the denoising autoencoder is trying to predict the corrupted (i.e. missing) values from the uncorrupted (i.e., nonmissing) values, for randomly selected subsets of missing patterns. Note how being able to predict any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables (this is how Gibbs sampling works). To convert the autoencoder class into a denoising autoencoder class, all we need to do is to add a stochastic corruption step operating on the input. The input can be corrupted in many ways, but in this tutorial we will stick to the original corruption mechanism of randomly masking entries of the input by making them zero. 
Denoising Random Forest  This paper proposes a novel type of random forests called a denoising random forests that are robust against noises contained in test samples. Such noisecorrupted samples cause serious damage to the estimation performances of random forests, since unexpected child nodes are often selected and the leaf nodes that the input sample reaches are sometimes far from those for a clean sample. Our main idea for tackling this problem originates from a binary indicator vector that encodes a traversal path of a sample in the forest. Our proposed method effectively employs this vector by introducing denoising autoencoders into random forests. A denoising autoencoder can be trained with indicator vectors produced from clean and noisy input samples, and nonleaf nodes where incorrect decisions are made can be identified by comparing the input and output of the trained denoising autoencoder. Multiple traversal paths with respect to the nodes with incorrect decisions caused by the noises can then be considered for the estimation. 
Dense Adaptive Cascade Forest (daForest) 
Recent research has shown that deep ensemble for forest can achieve a huge increase in classification accuracy compared with the general ensemble learning method. Especially when there are only few training data. In this paper, we decide to take full advantage of this observation and introduce the Dense Adaptive Cascade Forest (daForest), which has better performance than the original one named Cascade Forest. And it is particularly noteworthy that daForest has a powerful ability to handle highdimensional sparse data without any preprocessing on raw data like PCA or any other dimensional reduction methods. Our model is distinguished by three major features: the first feature is the combination of the SAMME.R boosting algorithm in the model, boosting gives the model the ability to continuously improve as the number of layer increases, which is not possible in stacking model or plain cascade forest. The second feature is our model connects each layer to its subsequent layers in a feedforward fashion, to some extent this structure enhances the ability of the model to resist degeneration. When number of layers goes up, accuracy of model goes up a little in the first few layers then drop down quickly, we call this phenomenon degeneration in training stacking model. The third feature is that we add a hyperparameter optimization layer before the first classification layer in the proposed deep model, which can search for the optimal hyperparameter and set up the model in a brief period and nearly halve the training time without having too much impact on the final performance. Experimental results show that daForest performs particularly well on both highdimensional loworder features and lowdimensional highorder features, and in some cases, even better than neural networks and achieves stateoftheart results. 
Dense Convolutional Network (DenseNet) 
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feedforward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the stateoftheart on all five of them (e.g., yielding 3.74% test error on CIFAR10, 19.25% on CIFAR100 and 1.59% on SVHN). 
Dense Transformer Networks  The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixelwise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoderdecoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods. 
Densely Connected Convolutional Network (DenseNet) 
Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in realtime video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (DenseNet), to learn optical flow. This specific architecture is ideal for the problem at hand as it provides shortcut connections throughout the network, which leads to implicit deep supervision. We extend current DenseNet to a fully convolutional network to learn motion estimation in an unsupervised manner. Evaluation results on three standard benchmarks demonstrate that DenseNet is a better fit than other widely adopted CNN architectures for optical flow estimation. 
DenseNMT  Recently, neural machine translation has achieved remarkable progress by introducing welldesigned deep neural networks into its encoderdecoder framework. From the optimization perspective, residual connections are adopted to improve learning performance for both encoder and decoder in most of these deep architectures, and advanced attention connections are applied as well. Inspired by the success of the DenseNet model in computer vision problems, in this paper, we propose a densely connected NMT architecture (DenseNMT) that is able to train more efficiently for NMT. The proposed DenseNMT not only allows dense connection in creating new features for both encoder and decoder, but also uses the dense attention structure to improve attention quality. Our experiments on multiple datasets show that DenseNMT structure is more competitive and efficient. 
Densitybased spatial clustering of applications with noise (DBSCAN) 
Densitybased spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, HansPeter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a densitybased clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the e parameter with a maximum search radius. dbscan 
Deontic Logic  Deontic logic is the field of philosophical logic that is concerned with obligation, permission, and related concepts. Alternatively, a deontic logic is a formal system that attempts to capture the essential logical features of these concepts. Typically, a deontic logic uses OA to mean it is obligatory that A, (or it ought to be (the case) that A), and PA to mean it is permitted (or permissible) that A. Stanford Encyclopedia of Philosophy:Deontic Logic 
Dependence Modeling  ➚ “Copula” http://…/9781466583221 http://…/7699_chap01.pdf 
Dependency Network  The dependency network approach provides a new system level analysis of the activity and topology of directed networks. The approach extracts causal topological relations between the network’s nodes (when the network structure is analyzed), and provides an important step towards inference of causal activity relations between the network nodes (when analyzing the network activity). This methodology has originally been introduced for the study of financial data, it has been extended and applied to other systems, such as the immune system, and semantic networks. In the case of network activity, the analysis is based on partial correlations, which are becoming ever more widely used to investigate complex systems. In simple words, the partial (or residual) correlation is a measure of the effect (or contribution) of a given node, say j, on the correlations between another pair of nodes, say i and k. Using this concept, the dependency of one node on another node, is calculated for the entire network. This results in a directed weighted adjacency matrix, of a fully connected network. Once the adjacency matrix has been constructed, different algorithms can be used to construct the network, such as a threshold network, Minimal Spanning Tree (MST), Planar Maximally Filtered Graph (PMFG), and others. 
DeployR  DeployR is a serverbased framework that provides simple, secure R integration for application developers. It’s available in two editions: DeployR Open, which is free and opensource; and Revolution R Enterprise DeployR, which adds a scalable grid framework and enterprise authentication features for production applications integrated with R. If you’re looking for an overview of what DeployR is and how you can use it to access R from other applications, we’ve just released a new white paper, Using DeployR to Solve the R Integration Problem. DeployR Data I/O 
Depthfirst Search (DFS) 
Depthfirst search (DFS) is an algorithm for traversing or searching tree or graph data structures. One starts at the root (selecting some arbitrary node as the root in the case of a graph) and explores as far as possible along each branch before backtracking. A version of depthfirst search was investigated in the 19th century by French mathematician Charles Pierre Tremaux as a strategy for solving mazes. 
Descriptive Statistics  Descriptive statistics is the discipline of quantitatively describing the main features of a collection of information, or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. 
Design of Experiments (DoE) 
The design of experiments (DOE, DOX, or experimental design) is the design of any task that aims to describe or explain the variation of information under conditions that are hypothesized to reflect the variation. The term is generally associated with experiments in which the design introduces conditions that directly affect the variation, but may also refer to the design of quasiexperiments, in which natural conditions that influence the variation are selected for observation. In its simplest form, an experiment aims at predicting the outcome by introducing a change of the preconditions, which is represented by one or more independent variables, also referred to as ‘input variables’ or ‘predictor variables.’ The change in one or more independent variables is generally hypothesized to result in a change in one or more dependent variables, also referred to as ‘output variables’ or ‘response variables.’ The experimental design may also identify control variables that must be held constant to prevent external factors from affecting the results. Experimental design involves not only the selection of suitable independent, dependent, and control variables, but planning the delivery of the experiment under statistically optimal conditions given the constraints of available resources. There are multiple approaches for determining the set of design points (unique combinations of the settings of the independent variables) to be used in the experiment. Main concerns in experimental design include the establishment of validity, reliability, and replicability. For example, these concerns can be partially addressed by carefully choosing the independent variable, reducing the risk of measurement error, and ensuring that the documentation of the method is sufficiently detailed. Related concerns include achieving appropriate levels of statistical power and sensitivity. Correctly designed experiments advance knowledge in the natural and social sciences and engineering. Other applications include marketing and policy making. 
DesignExecuteExamineDeploy – Framework (DEED) 

Despeckling Residual Neural Network (DRNN) 
Unsupervised Despeckling 
DetailPreserving Pooling (DPP) 
Most convolutional neural networks use some method for gradually downscaling the size of the hidden layers. This is commonly referred to as pooling, and is applied to reduce the number of parameters, improve invariance to certain distortions, and increase the receptive field size. Since pooling by nature is a lossy process, it is crucial that each such layer maintains the portion of the activations that is most important for the network’s discriminability. Yet, simple maximization or averaging over blocks, max or average pooling, or plain downsampling in the form of strided convolutions are the standard. In this paper, we aim to leverage recent results on image downscaling for the purposes of deep learning. Inspired by the human visual system, which focuses on local spatial changes, we propose detailpreserving pooling (DPP), an adaptive pooling method that magnifies spatial changes and preserves important structural detail. Importantly, its parameters can be learned jointly with the rest of the network. We analyze some of its theoretical properties and show its empirical benefits on several datasets and networks, where DPP consistently outperforms previous pooling approaches. 
Detection Network (DetNet) 
In this paper we consider MultipleInputMultipleOutput (MIMO) detection using deep neural networks. We introduce two different deep architectures: a standard fully connected multilayer network, and a Detection Network (DetNet) which is specifically designed for the task. The structure of DetNet is obtained by unfolding the iterations of a projected gradient descent algorithm into a network. We compare the accuracy and runtime complexity of the purposed approaches and achieve stateoftheart performance while maintaining low computational requirements. Furthermore, we manage to train a single network to detect over an entire distribution of channels. Finally, we consider detection with soft outputs and show that the networks can easily be modified to produce soft decisions. 
Determinantal Point Process (DPP) 
In mathematics, a determinantal point process is a stochastic point process, the probability distribution of which is characterized as a determinant of some function. Such processes arise as important tools in random matrix theory, combinatorics, and physics. Improving the Diversity of TopN Recommendation via Determinantal Point Process Optimized Algorithms to Sample Determinantal Point Processes 
Deterministic Heron Inference (Heron) 
Bayesian graphical models have been shown to be a powerful tool for discovering uncertainty and causal structure from realworld data in many application fields. Current inference methods primarily follow different kinds of tradeoffs between computational complexity and predictive accuracy. At one end of the spectrum, variational inference approaches perform well in computational efficiency, while at the other end, Gibbs sampling approaches are known to be relatively accurate for prediction in practice. In this paper, we extend an existing Gibbs sampling method, and propose a new deterministic Heron inference (Heron) for a family of Bayesian graphical models. In addition to the support for nontrivial distributability, one more benefit of Heron is that it is able to not only allow us to easily assess the convergence status but also largely improve the running efficiency. We evaluate Heron against the standard collapsed Gibbs sampler and stateoftheart state augmentation method in inference for wellknown graphical models. Experimental results using publicly available reallife data have demonstrated that Heron significantly outperforms the baseline methods for inferring Bayesian graphical models. 
Deterministic Parallel Analysis (DPA) 
Factor analysis is widely used in many application areas. The first step, choosing the number of factors, remains a serious challenge. One of the most popular methods is parallel analysis (PA), which compares the observed factor strengths to simulated ones under a noiseonly model. % Abstracts are commonly just one paragraph. This paper presents a deterministic version of PA (DPA), which is faster and more reproducible than PA. We show that DPA selects large factors and does not select small factors just like [Dobriban, 2017] shows for PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We develop a deflated version of DPA (DDPA) that counters shadowing. By raising the decision threshold in DDPA, a new method (DDPA+) also improves estimation accuracy. We illustrate our methods on data from the Human Genome Diversity Project (HGDP). There PA and DPA select seemingly too many factors, while DDPA+ selects only a few. A Matlab implementation is available. 
Deterministic Statistical Machine  E.g. you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a prespecified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result. 
Deterministic Stretchy Regression  An extension of the regularized leastsquares in which the estimation parameters are stretchable is introduced and studied in this paper. The solution of this ridge regression with stretchable parameters is given in primal and dual spaces and in closedform. Essentially, the proposed solution stretches the covariance computation by a power term, thereby compressing or amplifying the estimation parameters. To maintain the computation of power root terms within the real space, an input transformation is proposed. The results of an empirical evaluation in both synthetic and realworld data illustrate that the proposed method is effective for compressive learning with highdimensional data. 
Determinized Sparse Partially Observable Tree (DESPOT) 
The partially observable Markov decision process (POMDP) provides a principled general framework for planning under uncertainty, but solving POMDPs optimally is computationally intractable, due to the ‘curse of dimensionality’ and the ‘curse of history’. To overcome these challenges, we introduce the Determinized Sparse Partially Observable Tree (DESPOT), a sparse approximation of the standard belief tree, for online planning under uncertainty. A DESPOT focuses online planning on a set of randomly sampled scenarios and compactly captures the ‘execution’ of all policies under these scenarios. We show that the best policy obtained from a DESPOT is nearoptimal, with a regret bound that depends on the representation size of the optimal policy. Leveraging this result, we give an anytime online planning algorithm, which searches a DESPOT for a policy that optimizes a regularized objective function. Regularization balances the estimated value of a policy under the sampled scenarios and the policy size, thus avoiding overfitting. The algorithm demonstrates strong experimental results, compared with some of the best online POMDP algorithms available. It has also been incorporated into an autonomous driving system for realtime vehicle control. 
DetMCD Algorithm  Most algorithms for highly robust estimators of multivariate location and scatter start by drawing a large number of random subsets. For instance, the FASTMCD algorithm of Rousseeuw and Van Driessen starts in this way, and then takes socalled concentration steps to obtain a more accurate approximation to the MCD. The FASTMCD algorithm is affine equivariant but not permutation invariant. In this article,we present a deterministic algorithm, denoted as DetMCD, which does not use random subsets and is even faster. It computes a small number of deterministic initial estimators, followed by concentration steps. DetMCD is permutation invariant and very close to affine equivariant. DetMCD 
Detrended Fluctuation Analysis (DFA) 
In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical selfaffinity of a signal. It is useful for analysing time series that appear to be longmemory processes (diverging correlation time, e.g. powerlaw decaying autocorrelation function) or 1/f noise. The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are nonstationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform. Peng et al. introduced DFA in 1994 in a paper that has been cited over 2000 times as of 2013 and represents an extension of the (ordinary) fluctuation analysis (FA), which is affected by nonstationarities. 
Deviance  In statistics, deviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where modelfitting is achieved by maximum likelihood. 
Deviance Information Criterion (DIC) 
The deviance information criterion (DIC) is a hierarchical modeling generalization of the AIC (Akaike information criterion) and BIC (Bayesian information criterion, also known as the Schwarz criterion). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. Like AIC and BIC it is an asymptotic approximation as the sample size becomes large. It is only valid when the posterior distribution is approximately multivariate normal. The idea is that models with smaller DIC should be preferred to models with larger DIC. 
Dex  This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention. 
Dfuntest  New ideas in distributed systems (algorithms or protocols) are commonly tested by simulation, because experimenting with a prototype deployed on a realistic platform is cumbersome. However, a prototype not only measures performance but also verifies assumptions about the underlying system. We developed dfuntest – a testing framework for distributed applications that defines abstractions and test structure, and automates experiments on distributed platforms. Dfuntest aims to be jUnit’s analogue for distributed applications; a framework that enables the programmer to write robust and flexible scenarios of experiments. Dfuntest requires minimal bindings that specify how to deploy and interact with the application. Dfuntest’s abstractions allow execution of a scenario on a single machine, a cluster, a cloud, or any other distributed infrastructure, e.g. on PlanetLab. A scenario is a procedure; thus, our framework can be used both for functional tests and for performance measurements. We show how to use dfuntest to deploy our DHT prototype on 60 PlanetLab nodes and verify whether the prototype maintains a correct topology. 
dhSegment  In recent years there have been multiple successful attempts tackling document processing problems separately by designing task specific handtuned strategies. We argue that the diversity of historical document processing tasks prohibits to solve them one at a time and shows a need for designing generic approaches in order to handle the variability of historical series. In this paper, we address multiple tasks simultaneously such as page extraction, baseline extraction, layout analysis or multiple typologies of illustrations and photograph extraction. We propose an opensource implementation of a CNNbased pixelwise predictor coupled with task dependent postprocessing blocks. We show that a single CNNarchitecture can be used across tasks with competitive results. Moreover most of the taskspecific postprecessing steps can be decomposed in a small number of simple and standard reusable operations, adding to the flexibility of our approach. 
Diagonalwise Refactorization  Depthwise convolutions provide significant performance benefits owing to the reduction in both parameters and multadds. However, training depthwise convolution layers with GPUs is slow in current deep learning frameworks because their implementations cannot fully utilize the GPU capacity. To address this problem, in this paper we present an efficient method (called diagonalwise refactorization) for accelerating the training of depthwise convolution layers. Our key idea is to rearrange the weight vectors of a depthwise convolution into a large diagonal weight matrix so as to convert the depthwise convolution into one single standard convolution, which is well supported by the cuDNN library that is highlyoptimized for GPU computations. We have implemented our training method in five popular deep learning frameworks. Evaluation results show that our proposed method gains $15.4\times$ training speedup on Darknet, $8.4\times$ on Caffe, $5.4\times$ on PyTorch, $3.5\times$ on MXNet, and $1.4\times$ on TensorFlow, compared to their original implementations of depthwise convolutions. 
Diagram Generating Function (DGF) 
The recentlyintroduced selflearning Monte Carlo method is a generalpurpose numerical method that speeds up Monte Carlo simulations by training an effective model to propose uncorrelated configurations in the Markov chain. We implement this method in the framework of continuous time Monte Carlo method with auxiliary field in quantum impurity models. We introduce and train a diagram generating function (DGF) to model the probability distribution of auxiliary field configurations in continuous imaginary time, at all orders of diagrammatic expansion. By using DGF to propose global moves in configuration space, we show that the selflearning continuoustime Monte Carlo method can significantly reduce the computational complexity of the simulation. 
Diceware  Diceware is a method for creating passphrases, passwords, and other cryptographic variables using an ordinary die from a pair of dice as a hardware random number generator. For each word in the passphrase, five rolls of the dice are required. The numbers from 1 to 6 that come up in the rolls are assembled as a five digit number, e.g. 43146. That number is then used to look up a word in a word list. In the English list 43146 corresponds to munch. Lists have been compiled for several languages, including English, Finnish, German, Italian, Polish, Romanian, Russian, Spanish and Swedish. A Diceware word list is any list of 6^5 = 7,776 unique words, preferably ones the user will find easy to spell and to remember. The contents of the word list do not have to be protected or concealed in any way, as the security of a Diceware passphrase is in the number of words selected, and the number of words each selected word could be taken from. The level of unpredictability of a Diceware passphrase can be easily calculated: each word adds 12.9 bits of entropy to the passphrase (that is, \log_2( 6^5 ) bits). Originally, in 1995, Diceware creator Arnold Reinhold considered five words (64 bits) the minimum length needed by average users. However, starting in 2014, Reinhold recommends that at least six words (77 bits) should be used. This level of unpredictability assumes that a potential attacker knows both that Diceware has been used to generate the passphrase, the particular word list used, and exactly how many words make up the passphrase. If the attacker has less information, the entropy can be greater than 12.9 bits per word. If words were simply concatenated rather than separated by spaces, concatenating could form words that are already in the word list. For example, ‘in’ and ‘put’ form ‘input’; all three words can be found in the above mentioned word list. This could slightly decrease the entropy, when compared with the recommended method of using spaces to separate each word in the list. riceware 
dIconomy  “dIconomy” = “Digital Economy” 
Dictionary Learning (DL) 
Dictionary Learning Algorithms for Sparse Representation Dictionary Learning 
Dictionary Learning – Separating the Particularity and the Commonality (DLCOPAR) 
Empirically, we find that, despite the classspecific features owned by the objects appearing in the images, the objects from different categories usually share some common patterns, which do not contribute to the discrimination of them. Concentrating on this observation and under the general dictionary learning (DL) framework, we propose a novel method to explicitly learn a common pattern pool (the commonality) and classspecific dictionaries (the particularity) for classification. We call our method DLCOPAR, which can learn the most compact and most discriminative classspecific dictionaries used for classification. The proposed DLCOPAR is extensively evaluated both on synthetic data and on benchmark image databases in comparison with existing DLbased classification methods. The experimental results demonstrate that DLCOPAR achieves very promising performances in various applications, such as face recognition, handwritten digit recognition, scene classification and object recognition. 
DIDMDN  Single image rain streak removal is an extremely challenging problem due to the presence of nonuniform rain densities in images. We present a novel densityaware multistream densely connected convolutional neural networkbased algorithm, called DIDMDN, for joint rain density estimation and deraining. The proposed method enables the network itself to automatically determine the raindensity information and then efficiently remove the corresponding rainstreaks guided by the estimated raindensity label. To better characterize rainstreaks with different scales and shapes, a multistream densely connected deraining network is proposed which efficiently leverages features from different scales. Furthermore, a new dataset containing images with raindensity labels is created and used to train the proposed densityaware network. Extensive experiments on synthetic and real datasets demonstrate that the proposed method achieves significant improvements over the recent stateoftheart methods. In addition, an ablation study is performed to demonstrate the improvements obtained by different modules in the proposed method. Code can be found at: https://…/hezhangsprinter 
Difference Compression  Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: {\em quantization} for low bandwidth networks, and {\em decentralization} for high latency networks. In this paper, we explore a natural question: {\em can the combination of both decentralization and quantization lead to a system that is robust to both bandwidth and latency?} Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: simply quantizing data sent in a decentralized training algorithm would accumulate the error. In this paper, we develop a framework of quantized, decentralized training and propose two different strategies, which we call {\em extrapolation compression} and {\em difference compression}. We analyze both algorithms and prove both converge at the rate of $O(1/\sqrt{nT})$ where $n$ is the number of workers and $T$ is the number of iterations, matching the {\rc convergence} rate for full precision, centralized training. We evaluate our algorithms on training deep learning models, and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with {\em both} high latency and low bandwidth. 
Difference of Convex Functions Algorithm (DCA) 
The DC programming and its DC algorithm (DCA) address the problem of minimizing a function f=gh (with g,h being lower semicontinuous proper convex functions on R n ) on the whole space. Based on local optimality conditions and DC duality, DCA was successfully applied to a lot of different and various nondifferentiable nonconvex optimization problems to which it quite often gave global solutions and proved to be more robust and more efficient than related standard methods, especially in the large scale setting. The computational efficiency of DCA suggests to us a deeper and more complete study on DC programming, using the special class of DC programs (when either g or h is polyhedral convex) called polyhedral DC programs. A DCALike Algorithm and its Accelerated Version with Application in Data Visualization 
DifferencesinDifferences (DID) 
Difference in differences (sometimes ‘DifferenceinDifferences’, ‘DID’, or ‘DD’) is a statistical technique used in econometrics and quantitative sociology, which attempts to mimic an experimental research design using observational study data. It calculates the effect of a treatment (i.e., an explanatory variable or an independent variable) on an outcome (i.e., a response variable or dependent variable) by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. This method may be subject to certain biases (mean reversion bias, etc.), although it is intended to eliminate some of the effect of selection bias. In contrast to a withinsubjects estimate of the treatment effect (which measures differences over time) or a betweensubjects estimate of the treatment effect (which measures the difference between the treatment and control groups), the DID measures the difference in the differences between the treatment and control group over time. 
Differentiable Architecture Search (DARTS) 
This paper addresses the scalability challenge of architecture search by formulating the task in a differentiable manner. Unlike conventional approaches of applying evolution or reinforcement learning over a discrete and nondifferentiable search space, our method is based on the continuous relaxation of the architecture representation, allowing efficient search of the architecture using gradient descent. Extensive experiments on CIFAR10, ImageNet, Penn Treebank and WikiText2 show that our algorithm excels in discovering highperformance convolutional architectures for image classification and recurrent architectures for language modeling, while being orders of magnitude faster than stateoftheart nondifferentiable techniques. 
Differentiable Lasso (dlasso) 
DLASSO 
Differentiable Particle Filter (DPF) 
We present differentiable particle filters (DPFs): a differentiable implementation of the particle filter algorithm with learnable motion and measurement models. Since DPFs are endtoend differentiable, we can efficiently train their models by optimizing endtoend state estimation performance, rather than proxy objectives such as model accuracy. DPFs encode the structure of recursive state estimation with prediction and measurement update that operate on a probability distribution over states. This structure represents an algorithmic prior that improves learning performance in state estimation problems while enabling explainability of the learned model. Our experiments on simulated and real data show substantial benefits from endto end learning with algorithmic priors, e.g. reducing error rates by ~80%. Our experiments also show that, unlike long shortterm memory networks, DPFs learn localization in a policyagnostic way and thus greatly improve generalization. Source code is available at https://…/differentiableparticlefilters. 
Differential Evolution (DE) 
In evolutionary computation, differential evolution (DE) is a method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional realvalued functions but does not use the gradient of the problem being optimized, which means DE does not require for the optimization problem to be differentiable as is required by classic optimization methods such as gradient descent and quasinewton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. DE is originally due to Storn and Price. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. http://…/9783540209508 
Differential Generative Adversarial Network (DGAN) 
In facerelated applications with a public available dataset, synthesizing nonlinear facial variations (e.g., facial expression, headpose, illumination, etc.) through a generative model is helpful in addressing the lack of training data. In reality, however, there is insufficient data to even train the generative model for face synthesis. In this paper, we propose Differential Generative Adversarial Networks (DGAN) that can perform photorealistic face synthesis even when training data is small. Two adversarial networks are devised to ensure the generator to approximate a face manifold, which can express face changes as it wants. Experimental results demonstrate that the proposed method is robust to the amount of training data and synthesized images are useful to improve the performance of a face expression classifier. 
Differential Item Functioning (DIF) 
Differential item functioning (DIF), also referred to as measurement bias, occurs when people from different groups (commonly gender or ethnicity) with the same latent trait (ability/skill) have a different probability of giving a certain response on a questionnaire or test. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are MantelHaenszel, item response theory (IRT) based methods, and logistic regression. difR 
Differential Message Importance Measure (DMIM) 
Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to KolmogorovSmirnov statistic, but it offers a new way to characterize the distribution goodnessoffit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. 
Differential Privacy  In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. 
Differential Privacy Cleaning  Data cleaning, or the process of detecting and repairing inaccurate or corrupt records in the data, is inherently humandriven. State of the art systems assume cleaning experts can access the data (or a sample of it) to tune the cleaning process. However, in many cases, privacy constraints disallow unfettered access to the data. To address this challenge, we observe and provide empirical evidence that data cleaning can be achieved without access to the sensitive data, but with access to a (noisy) query interface that supports a small set of linear counting query primitives. Motivated by this, we present DPClean, a first of a kind system that allows engineers tune data cleaning workflows while ensuring differential privacy. In DPClean, a cleaning engineer can pose sequences of aggregate counting queries with error tolerances. A privacy engine translates each query into a differentially private mechanism that returns an answer with error matching the specified tolerance, and allows the data owner track the overall privacy loss. With extensive experiments using human and simulated cleaning engineers on blocking and matching tasks, we demonstrate that our approach is able to achieve high cleaning quality while ensuring a reasonable privacy loss. 
Differentially Private Regression for DiscreteTime Survival Analysis  In survival analysis, regression models are used to understand the effects of explanatory variables (e.g., age, sex, weight, etc.) to the survival probability. However, for sensitive survival data such as medical data, there are serious concerns about the privacy of individuals in the data set when medical data is used to fit the regression models. The closest work addressing such privacy concerns is the work on Cox regression which linearly projects the original data to a lower dimensional space. However, the weakness of this approach is that there is no formal privacy guarantee for such projection. In this work, we aim to propose solutions for the regression problem in survival analysis with the protection of differential privacy which is a golden standard of privacy protection in data privacy research. To this end, we extend the Output Perturbation and Objective Perturbation approaches which are originally proposed to protect differential privacy for the Empirical Risk Minimization (ERM) problems. In addition, we also propose a novel sampling approach based on the Markov Chain Monte Carlo (MCMC) method to practically guarantee differential privacy with better accuracy. We show that our proposed approaches achieve good accuracy as compared to the nonprivate results while guaranteeing differential privacy for individuals in the private data set. 
Diffix  A longstanding open problem is that of how to get high quality statistics through direct queries to databases containing information about individuals without revealing information specific to those individuals. Diffix is a new framework for anonymous database query that adds noise based on the filter conditions in the query. A previous paper described Diffix for a simplified query semantics. This paper extends that description to include a wide variety of common features found in SQL. It describes attacks associated with various features, and the anonymization steps used to defend against those attacks. This paper describes the version of Diffix used for bounty program sponsored by Aircloak starting December 2017. 
DiffPool  Recently, graph neural networks (GNNs) have revolutionized the field of graph representation learning through effectively learned node embeddings, and achieved stateoftheart results in tasks such as node classification and link prediction. However, current GNN methods are inherently flat and do not learn hierarchical representations of graphs—a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph. Here we propose DiffPool, a differentiable graph pooling module that can generate hierarchical representations of graphs and can be combined with various graph neural network architectures in an endtoend fashion. DiffPool learns a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then form the coarsened input for the next GNN layer. Our experimental results show that combining existing GNN methods with DiffPool yields an average improvement of 510% accuracy on graph classification benchmarks, compared to all existing pooling approaches, achieving a new stateoftheart on four out of five benchmark data sets. 
DiffSharp  DiffSharp is an algorithmic differentiation or automatic differentiation (AD) library for the .NET ecosystem, which is targeted by the C# and F# languages, among others. The library has been designed with machine learning applications in mind, allowing very succinct implementations of models and optimization routines. DiffSharp is implemented in F# and exposes forward and reverse AD operators as general nestable higherorder functions, usable by any .NET language. It provides highperformance linear algebra primitives—scalars, vectors, and matrices, with a generalization to tensors underway—that are fully supported by all the AD operators, and which use a BLAS/LAPACK backend via the highly optimized OpenBLAS library. DiffSharp currently uses operator overloading, but we are developing a transformationbased version of the library using F#’s ‘code quotation’ metaprogramming facility. Work on a CUDAbased GPU backend is also underway. 
Diffusion Based Network Embedding  In network embedding, random walks play a fundamental role in preserving network structures. However, random walk based embedding methods have two limitations. First, random walk methods are fragile when the sampling frequency or the number of node sequences changes. Second, in disequilibrium networks such as highly biases networks, random walk methods often perform poorly due to the lack of global network information. In order to solve the limitations, we propose in this paper a network diffusion based embedding method. To solve the first limitation, our method employs a diffusion driven process to capture both depth information and breadth information. The time dimension is also attached to node sequences that can strengthen information preserving. To solve the second limitation, our method uses the network inference technique based on cascades to capture the global network information. To verify the performance, we conduct experiments on node classification tasks using the learned representations. Results show that compared with random walk based methods, diffusion based models are more robust when samplings under each node is rare. We also conduct experiments on a highly imbalanced network. Results shows that the proposed model are more robust under the biased network structure. 
Diffusion Map  Diffusion maps is a machine learning algorithm introduced by R. R. Coifman and S. Lafon. It computes a family of embeddings of a data set into Euclidean space (often lowdimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the “diffusion distance” between probability distributions centered at those points. Different from other dimensionality reduction methods such as principal component analysis (PCA) and multidimensional scaling (MDS), diffusion maps is a nonlinear method that focuses on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps gives a global description of the dataset. Compared with other methods, the diffusion maps algorithm is robust to noise perturbation and is computationally inexpensive. 
digGOF  This paper concerns the problem of applying the generalized goodnessoffit (gGOF) type tests for analyzing correlated data. The gGOF family broadly covers the maximumbased testing procedures by ordered input $p$values, such as the false discovery rate procedure, the KolmogorovSmirnov type statistics, the $\phi$divergence family, etc. Data analysis framework and a novel $p$value calculation approach is developed under the Gaussian mean model and the generalized linear model (GLM). We reveal the influence of data transformations to the signaltonoise ratio and the statistical power under both sparse and dense signal patterns and various correlation structures. In particular, the innovated transformation (IT), which is shown equivalent to the marginal modelfitting under the GLM, is often preferred for detecting sparse signals in correlated data. We propose a testing strategy called the digGOF, which combines a doubleadaptation procedure (i.e., adapting to both the statistic’s formula and the truncation scheme of the input $p$values) and the IT within the gGOF family. It features efficient computation and robust adaptation to the familyretained advantages for given data. Relevant approaches are assessed by extensive simulations and by genetic studies of Crohn’s disease and amyotrophic lateral sclerosis. Computations have been included into the R package SetTest available on CRAN. 
Digital Analytics  Digital analytics is the analysis of qualitative and quantitative data from your business and the competition to drive a continual improvement of the online experience that your customers and potential customers have which translates to your desired outcomes (both online and offline). One of the most important steps of digital analytics is determining what your ultimate business objectives or outcomes are and how you expect to measure those outcomes. In the online world, there are five common business objectives: · For ecommerce sites, an obvious objective is selling products or services. · For lead generation sites, the goal is to collect user information for sales teams to connect with potential leads. · For content publishers, the goal is to encourage engagement and frequent visitation. · For online informational or support sites, helping users find the information they need at the right time is of primary importance. · For branding, the main objective is to drive awareness, engagement and loyalty. There are key actions on any website or mobile application that tie back to a business’ objectives. The actions can indicate an objective, like a purchase on an ecommerce site, has been fully met. These are ‘macro’ conversions. Some of the actions on a site might also be behavioral indicators that a customer hasn’t fully reached your main objectives but is coming closer, like, in the ecommerce example, signing up to receive an email coupon or a new product notification. These are ‘micro’ conversions. It’s important to measure both micro and macro conversions so that you are equipped with more behavioral data to understand what experiences help drive the right outcomes for your site. 
Digital Analytics Association (DAA) 
The Digital Analytics Association makes analytics professionals more effective and valuable through professional development and community. 
Digital Asset Management (DAM) 
Digital asset management (DAM) consists of management tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution of digital assets. Digital photographs, animations, videos and music exemplify the target areas of media asset management (a subcategory of DAM). Digital asset management systems (DAMS) include computer software and hardware systems that aid in the process of digital asset management. The term “digital asset management” (DAM) also refers to the protocol for downloading, renaming, backing up, rating, grouping, archiving, optimizing, maintaining, thinning, and exporting files. The “media asset management” (MAM) subcategory of digital asset management mainly addresses audio, video and other media content. The more recent concept of enterprise content management (ECM) often deals with solutions which address similar features but in a wider range of industries or applications. 
Digital Hoarding  Digital hoarding (also known as ehoarding) is excessive acquisition and reluctance to delete electronic material no longer valuable to the user. The behavior includes the mass storage of digital artifacts and the retainment of unnecessary or irrelevant electronic data. The term is increasingly common in pop culture, used to describe the habitual characteristics of compulsive hoarding, but in cyberspace. As with physical space in which excess items are described as ‘clutter’ or ‘junk,’ excess digital media is often referred to as ‘digital clutter.’ 
Digital Native (DN) 
The term Digital Native was coined and popularized by education consultant, Marc Prensky in his 2001 article entitled Digital Natives, Digital Immigrants, in which he relates the contemporaneous decline in American education to educators’ failure to understand the needs of modern students. His article posited that ‘the arrival and rapid dissemination of digital technology in the last decade of the 20th century’ had fundamentally changed the way students think and process information, making it impossible for them to excel academically using the outdated teaching methods of the day. In other words, children raised in the postdigital, media saturated world, require a mediarich learning environment to hold their attention. Contextually, his ideas were introduced after a decade of worry over increased diagnosis of children with ADD and ADHD, which itself turned out to be largely overblown. Prensky did not strictly define the Digital Native in his 2001 article, but it was later, somewhat arbitrarily, applied to children born after 1980, due to the fact that computer bulletin board systems, and Usenet were already in use at the time. The idea became popular among educators and parents, whose children fell within Prensky’s definition of a Digital Native, and has since been embraced as an effective marketing tool. 
Digital Twin  Digital twin refers to a digital replica of physical assets, processes and systems that can be used for various purposes. The digital representation provides both the elements and the dynamics of how an Internet of Things device operates and lives throughout its life cycle. Digital Twins integrate artificial intelligence, machine learning and software analytics with data to create living digital Simulation models that update and change as their physical counterparts’ change. A digital twin continuously learns and updates itself from multiple sources to represent their near realtime status, working condition or position. This learning system, learns from itself, using sensor data that conveys various aspects of its operating condition; from human experts, such as engineers with deep and relevant industry domain knowledge; from other similar machines; from other similar fleets of machines; and from the larger systems and environment in which it may be a part of. A digital twin also integrates historical data from past machine usage to factor into its digital model. 
Dijkstra Algorithm  Dijkstra’s algorithm, conceived by computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the singlesource shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree. This algorithm is often used in routing and as a subroutine in other graph algorithms. 
Dilated Recurrent Neural Network (DILATEDRNN) 
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multiresolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching stateoftheart performance (even with Vanilla RNN cells) in tasks involving very longterm dependencies. To provide a theorybased quantification of the architecture’s advantages, we introduce a memory capacity measure – the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures. 
DilatedResidual UNet Deep Learning Network (DRUNET) 
Given that the neural and connective tissues of the optic nerve head (ONH) exhibit complex morphological changes with the development and progression of glaucoma, their simultaneous isolation from optical coherence tomography (OCT) images may be of great interest for the clinical diagnosis and management of this pathology. A deep learning algorithm was designed and trained to digitally stain (i.e. highlight) 6 ONH tissue layers by capturing both the local (tissue texture) and contextual information (spatial arrangement of tissues). The overall dice coefficient (mean of all tissues) was $0.91 \pm 0.05$ when assessed against manual segmentations performed by an expert observer. We offer here a robust segmentation framework that could be extended for the automated parametric study of the ONH tissues. 
Dimensional Clustering  This paper introduces a new clustering technique, called {\em dimensional clustering}, which clusters each data point by its latent {\em pointwise dimension}, which is a measure of the dimensionality of the data set local to that point. Pointwise dimension is invariant under a broad class of transformations. As a result, dimensional clustering can be usefully applied to a wide range of datasets. Concretely, we present a statistical model which estimates the pointwise dimension of a dataset around the points in that dataset using the distance of each point from its $n^{\text{th}}$ nearest neighbor. We demonstrate the applicability of our technique to the analysis of dynamical systems, images, and complex human movements. 
Dimensional Collapse  … When I stared at the plot, I ask myself, why not map the xaxis information of the points to the very first one according to the yaxis ‘connections’. When everything goes well and all done, all the grey points should be mapped along the red arrows to the first marks of the groups, and there should be only 4 marks leave on xaxis: a, b, d and g, instead of 9 marks in the first place. And the yaxis information, after contributing all the ‘connection rules’, can be put away now, since the left xaxis marks are exactly what I want: the final flags. It is why I like to call it ‘Dimensional Collapse’. … 
Dimensionality Reduction  In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. 
DimmWitted  A storage abstraction that captures the access patterns of popular statistical analytics tasks and a prototype called DimmWitted. 
dimple  dimple is a simpletouse charting API powered by D3.js. The aim of dimple is to open up the power and flexibility of d3 to analysts. It aims to give a gentle learning curve and minimal code to achieve something productive. It also exposes the d3 objects so you can pick them up and run to create some really cool stuff. 
DiNoDB  As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient adhoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to reiterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactivespeed query engine for adhoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of adhoc queries compared to alternatives %such as Hive, Stado, SparkSQL and Impala. 
Dionysius  We address the following problem: How do we incorporate user item interaction signals as part of the relevance model in a largescale personalized recommendation system such that, (1) the ability to interpret the model and explain recommendations is retained, and (2) the existing infrastructure designed for the (user profile) contentbased model can be leveraged? We propose Dionysius, a hierarchical graphical model based framework and system for incorporating user interactions into recommender systems, with minimal change to the underlying infrastructure. We learn a hidden fields vector for each user by considering the hierarchy of interaction signals, and replace the user profilebased vector with this learned vector, thereby not expanding the feature space at all. Thus, our framework allows the use of existing recommendation infrastructure that supports content based features. We implemented and deployed this system as part of the recommendation platform at LinkedIn for more than one year. We validated the efficacy of our approach through extensive offline experiments with different model choices, as well as online A/B testing experiments. Our deployment of this system as part of the job recommendation engine resulted in significant improvement in the quality of retrieved results, thereby generating improved user experience and positive impact for millions of users. 
DiracNet  Deep neural networks with skipconnections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them – training deeper networks – does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skipconnections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We’re able to achieve 95.5% accuracy on CIFAR10 with 34layer deep plain network, surpassing 1001layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and nonresidual networks. The code and models for our experiments are available at https://…/diracnets 
Direct Optimization  The Deep Learning (DL) community sees many novel topologies published each year. Achieving high performance on each new topology remains challenging, as each requires some level of manual effort. This issue is compounded by the proliferation of frameworks and hardware platforms. The current approach, which we call ‘direct optimization’, requires deep changes within each framework to improve the training performance for each hardware backend (CPUs, GPUs, FPGAs, ASICs) and requires $\mathcal{O}(fp)$ effort; where $f$ is the number of frameworks and $p$ is the number of platforms. While optimized kernels for deeplearning primitives are provided via libraries like Intel Math Kernel Library for Deep Neural Networks (MKLDNN), there are several compilerinspired ways in which performance can be further optimized. Building on our experience creating neon (a fast deep learning library on GPUs), we developed Intel nGraph, a soon to be opensourced C++ library to simplify the realization of optimized deep learning performance across frameworks and hardware platforms. Initiallysupported frameworks include TensorFlow, MXNet, and Intel neon framework. Initial backends are Intel Architecture CPUs (CPU), the Intel(R) Nervana Neural Network Processor(R) (NNP), and NVIDIA GPUs. Currently supported compiler optimizations include efficient memory management and data layout abstraction. In this paper, we describe our overall architecture and its core components. In the future, we envision extending nGraph API support to a wider range of frameworks, hardware (including FPGAs and ASICs), and compiler optimizations (training versus inference optimizations, multinode and multidevice scaling via efficient subgraph partitioning, and HWspecific compounding of operations). 
Directed Acyclic Graph (DAG) 
In mathematics and computer science, a directed acyclic graph (DAG), is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. dimple 
Directed Acyclic Graph AutoRegressive (DAGAR) 

Directed Exploration Learning (DEL) 
We address reinforcement learning problems with finite state and action spaces where the underlying MDP has some known structure that could be potentially exploited to minimize the exploration of suboptimal (state, action) pairs. For any arbitrary structure, we derive problemspecific regret lower bounds satisfied by any learning algorithm. These lower bounds are made explicit for unstructured MDPs and for those whose transition probabilities and average reward function are Lipschitz continuous w.r.t. the state and action. For Lipschitz MDPs, the bounds are shown not to scale with the sizes $S$ and $A$ of the state and action spaces, i.e., they are smaller than $c \log T$ where $T$ is the time horizon and the constant $c$ only depends on the Lipschitz structure, the span of the bias function, and the minimal action suboptimality gap. This contrasts with unstructured MDPs where the regret lower bound typically scales as $SA \log T$ . We devise DEL (Directed Exploration Learning), an algorithm that matches our regret lower bounds. We further simplify the algorithm for Lipschitz MDPs, and show that the simplified version is still able to efficiently exploit the structure. 
Directed Graph  In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph, or set of nodes connected by edges, where the edges have a direction associated with them. In formal terms, a digraph is a pair G=(V,A) (sometimes G=(V,E)) of: · a set V, whose elements are called vertices or nodes, · a set A of ordered pairs of vertices, called arcs, directed edges, or arrows (and sometimes simply edges with the corresponding set named E instead of A). It differs from an ordinary or undirected graph, in that the latter is defined in terms of unordered pairs of vertices, which are usually called edges. A digraph is called ‘simple’ if it has no loops, and no multiple arcs (arcs with same starting and ending nodes). A directed multigraph, in which the arcs constitute a multiset, rather than a set, of ordered pairs of vertices may have loops (that is, ‘selfloops’ with same starting and ending node) and multiple arcs. Some, but not all, texts allow a digraph, without the qualification simple, to have self loops, multiple arcs, or both. 
directional Bat Algorithm (dBA) 
Bat algorithm (BA) is a recent optimization algorithm based on swarm intelligence and inspiration from the echolocation behavior of bats. One of the issues in the standard bat algorithm is the premature convergence that can occur due to the low exploration ability of the algorithm under some conditions. To overcome this deficiency, directional echolocation is introduced to the standard bat algorithm to enhance its exploration and exploitation capabilities. In addition to such directional echolocation, three other improvements have been embedded into the standard bat algorithm to enhance its performance. The new proposed approach, namely the directional Bat Algorithm (dBA), has been then tested using several standard and nonstandard benchmarks from the CEC’2005 benchmark suite. The performance of dBA has been compared with ten other algorithms and BA variants using nonparametric statistical tests. The statistical test results show the superiority of the directional bat algorithm. 
Directional Statistics  Directional statistics is the subdiscipline of statistics that deals with directions (unit vectors in Rn), axes (lines through the origin in Rn) or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds. The fact that 0 degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of 2 degrees and 358 degrees, provides one illustration that special statistical methods are required for the analysis of some types of data (in this case, angular data). Other examples of data that may be regarded as directional include statistics involving temporal periods (e.g. time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations and so on. Directional 
Dirichlet Distribution  When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible usergenerated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data. 
Dirichlet Lasso (DLASSO) 
Selection of the most important predictor variables in regression analysis is one of the key problems statistical research has been concerned with for long time. In this article, we propose the methodology, Dirichlet Lasso (abbreviated as DLASSO) to address this issue in a Bayesian framework. In many modern regression settings, large set of predictor variables are grouped and the coefficients belonging to any one of these groups are either all redundant or all important in predicting the response; we say in those cases that the predictors exhibit a group structure. We show that DLASSO is particularly useful where the group structure is not fully known. We exploit the clustering property of Dirichlet Process priors to infer the possibly missing group information. The Dirichlet Process has the advantage of simultaneously clustering the variable coefficients and selecting the best set of predictor variables. We compare the predictive performance of DLASSO to Group Lasso and ordinary Lasso with real data and simulation studies. Our results demonstrate that the predictive performance of DLASSO is almost as good as that of Group Lasso when group label information is given; and superior to the ordinary Lasso for missing group information. For high dimensional data (e.g., genetic data) with missing group information, DLASSO will be a powerful approach of variable selection since it provides a superior predictive performance and higher statistical accuracy. 
Dirichlet Process (DP) 
In probability theory, a Dirichlet process is a way of assigning a probability distribution over probability distributions. That is, a Dirichlet process is a probability distribution whose domain is itself a set of probability distributions. The probability distributions in the domain are almost surely discrete and may be infinite dimensional. Assigning an arbitrary probability distribution over a domain of infinite dimensional probability distributions would require an infinite amount of computational resources. The main function of the Dirichlet process is that it allows the specification of a distribution over infinite dimensional distributions in a way that uses only finite resources. 
Dirichlet Process Mixture Model (DPMM) 
The Dirichlet process is a family of nonparametric Bayesian models which are commonly used for density estimation, semiparametric modelling and model selection/averaging. The Dirichlet processes are nonparametric in a sense that they have infinite number of parameters. Since they are treated in a Bayesian approach we are able to construct large models with infinite parameters which we integrate out to avoid overfitting. 
DirNet  Recurrent neural networks (RNNs) achieve cuttingedge performance on a variety of problems. However, due to their high computational and memory demands, deploying RNNs on resource constrained mobile devices is a challenging task. To guarantee minimum accuracy loss with higher compression rate and driven by the mobile resource requirement, we introduce a novel model compression approach DirNet based on an optimized fast dictionary learning algorithm, which 1) dynamically mines the dictionary atoms of the projection dictionary matrix within layer to adjust the compression rate 2) adaptively changes the sparsity of sparse codes cross the hierarchical layers. Experimental results on language model and an ASR model trained with a 1000h speech dataset demonstrate that our method significantly outperforms prior approaches. Evaluated on offtheshelf mobile devices, we are able to reduce the size of original model by eight times with realtime model inference and negligible accuracy loss. 
Disaggregation  Disaggregation is the breakdown of observations, usually within a common branch of a hierarchy, to a more detailed level to that at which detailed observations are taken. 
Disciplined Convex Optimization  An objectoriented modeling language for disciplined convex programming (DCP). It allows the user to formulate convex optimization problems in a natural way following mathematical convention and DCP rules. The system analyzes the problem, verifies its convexity, converts it into a canonical form, and hands it off to an appropriate solver to obtain the solution. ➘ “Disciplined Convex Programming” CVXR 
Disciplined Convex Programming (DCP) 
Convex programming is a subclass of nonlinear programming (NLP) that unifies and generalizes least squares (LS), linear programming (LP), and convex quadratic programming (QP). It has become quite popular recently for a number of reasons, including its attractive theoretical properties, efficient numerical algorithms, and practical applications. Nevertheless, there remains a significant impediment to the more widespread adoption of convex programming: the high level of expertise required to use it. We introduce a new modeling methodology called disciplined convex programming. As the term ‘disciplined’ suggests, the methodology imposes a set of conventions that one must follow when constructing convex programs. The conventions are simple and teachable, taken from basic principles of convex analysis, and inspired by the practices of those who regularly study and apply convex optimization today. The conventions do not limit generality; but they do allow much of the manipulation and transformation required to analyze and solve convex programs to be automated. ➘ “Disciplined Convex Optimization” 
Discounted Cumulative Gain (DCG) 
Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks. 
Discourse Analysis (DA) 
Discourse analysis (DA), or discourse studies, is a general term for a number of approaches to analyzing written, vocal, or sign language use or any significant semiotic event. The objects of discourse analysis – discourse, writing, conversation, communicative event – are variously defined in terms of coherent sequences of sentences, propositions, speech, or turnsattalk. Contrary to much of traditional linguistics, discourse analysts not only study language use ‘beyond the sentence boundary’, but also prefer to analyze ‘naturally occurring’ language use, and not invented examples. Text linguistics is related. The essential difference between discourse analysis and text linguistics is that it aims at revealing sociopsychological characteristics of a person/persons rather than text structure. Discourse analysis has been taken up in a variety of social science disciplines, including linguistics, education, sociology, anthropology, social work, cognitive psychology, social psychology, area studies, cultural studies, international relations, human geography, communication studies, and translation studies, each of which is subject to its own assumptions, dimensions of analysis, and methodologies. 
Discover, Access, Distill (DAD) 
DAD is comprised of: · Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts) · Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes inmemory within a database. · Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves · Exploring the data (creating a data dictionary and exploratory analysis) · Cleaning (removing impurities) · Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization) Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling · Presenting results or integrating results in some automated process 
Discrete Choice  In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such choices contrast with standard consumption models in which the quantity of each good consumed is assumed to be a continuous variable. In the continuous case, calculus methods (e.g. firstorder conditions) can be used to determine the optimum amount chosen, and demand can be modeled empirically using regression analysis. On the other hand, discrete choice analysis examines situations in which the potential outcomes are discrete, such that the optimum is not characterized by standard firstorder conditions. Thus, instead of examining ‘how much’ as in problems with continuous choice variables, discrete choice analysis examines ‘which one.’ However, discrete choice analysis can also be used to examine the chosen quantity when only a few distinct quantities must be chosen from, such as the number of vehicles a household chooses to own [1] and the number of minutes of telecommunications service a customer decides to purchase.[2] Techniques such as logistic regression and probit regression can be used for empirical analysis of discrete choice. Book: Discrete Choice Analysis 
Discrete Dantzig Selector  We propose a new highdimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients, subject to a budget on the maximal absolute correlation between the features and the residuals. We show that the estimator can be expressed as a solution to a Mixed Integer Linear Optimization (MILO) problem—a computationally tractable framework that enables the computation of provably optimal global solutions. Our approach has the appealing characteristic that even if we terminate the optimization problem at an early stage, it exits with a certificate of suboptimality on the quality of the solution. We develop new discrete first order methods, motivated by recent algorithmic developments in first order continuous convex optimization, to obtain high quality feasible solutions for the Discrete Dantzig Selector problem. Our proposal leads to advantages over the offtheshelf stateoftheart integer programming algorithms, which include superior upper bounds obtained for a given computational budget. When a solution obtained from the discrete first order methods is passed as a warmstart to a MILO solver, the performance of the latter improves significantly. Exploiting problem specific information, we propose enhanced MILO formulations that further improve the algorithmic performance of the MILO solvers. We demonstrate, both theoretically and empirically, that, in a wide range of regimes, the statistical properties of the Discrete Dantzig Selector are superior to those of popular $\ell_{1}$based approaches. For problem instances with $p \approx 2500$ features and $n \approx 900$ observations, our computational framework delivers optimal solutions in a few minutes and certifies optimality within an hour. 
Discrete Event Simulation (DES) 
In the field of simulation, a discreteevent simulation (DES), models the operation of a system as a discrete sequence of events in time. Each event occurs at a particular instant in time and marks a change of state in the system. Between consecutive events, no change in the system is assumed to occur; thus the simulation can directly jump in time from one event to the next. This contrasts with continuous simulation in which the simulation continuously tracks the system dynamics over time. Instead of being eventbased, this is called an activitybased simulation; time is broken up into small time slices and the system state is updated according to the set of activities happening in the time slice. Because discreteevent simulations do not have to simulate every time slice, they can typically run much faster than the corresponding continuous simulation. Another alternative to eventbased simulation is processbased simulation. In this approach, each activity in a system corresponds to a separate process, where a process is typically simulated by a thread in the simulation program. In this case, the discrete events, which are generated by threads, would cause other threads to sleep, wake, and update the system state. A more recent method is the threephased approach to discrete event simulation (Pidd, 1998). In this approach, the first phase is to jump to the next chronological event. The second phase is to execute all events that unconditionally occur at that time (these are called Bevents). The third phase is to execute all events that conditionally occur at that time (these are called Cevents). The three phase approach is a refinement of the eventbased approach in which simultaneous events are ordered so as to make the most efficient use of computer resources. The threephase approach is used by a number of commercial simulation software packages, but from the user’s point of view, the specifics of the underlying simulation method are generally hidden. simmer,DES 
Discrete Fourier Cosine Quadrature Transform (FCQT) 
The Hilbert transform (HT) and associated Gabor analytic signal (GAS) representation are wellknown and widely used mathematical formulations for modeling and analysis of signals in various applications. In this study, like the HT, to obtain quadrature component of a signal, we propose the novel discrete Fourier cosine quadrature transforms (FCQTs) and discrete Fourier sine quadrature transforms (FSQTs), designated as Fourier quadrature transforms (FQTs). Using these FQTs, we propose sixteen FourierSingh analytic signal (FSAS) representations with following properties: (1) real part of eight FSAS representations is the original signal and imaginary part is the FCQT of the real part, (2) imaginary part of eight FSAS representations is the original signal and real part is the FSQT of the real part, (3) like the GAS, Fourier spectrum of the all FSAS representations has only positive frequencies, however unlike the GAS, the real and imaginary parts of the proposed FSAS representations are not orthogonal to each other. The Fourier decomposition method (FDM) is an adaptive data analysis approach to decompose a signal into a set of small number of Fourier intrinsic band functions which are AMFM components. This study also proposes a new formulation of the FDM using the discrete cosine transform (DCT) with the GAS and FSAS representations, and demonstrate its efficacy for improved timefrequencyenergy representation and analysis of nonlinear and nonstationary time series. 
Discrete Morse Theory  Discrete Morse theory is a tool for determining equivalences between topological spaces arising from discrete mathematical structures. This theory was developed by Robin Forman in the 1990s as a combinatorial analog to Morse theory, developed by Marston Morse in the 1920s. The original theory deals with analyzing such equivalences for general topological spaces, while discrete Morse theory provides similar methods of analysis for topological spaces endowed with additional, discrete structure. For these structures, applications of the discrete theory are often more natural, as well as simpler and more straightforward to apply. Discrete Morse theory has applications throughout many fields of pure and applied mathematics. Within pure mathematics, for example, the theory has been widely applied to problems in geometry, topology, and knot theory; and within computer science, the theory has been used to evaluate data compression algorithms and to bound the complexity of algorithms that determine whether graphs have certain properties – for example, whether all components of a graph are connected. If we wish to know whether a given property holds for a certain topological space, our question can often be reduced to the question of whether the space is equivalent to another space for which the property holds. For example, whether a simple algorithm exists for determining if a graph is connected depends on whether the structure that represents the space of notconnected graphs can be shrunken to a point. Alas, it cannot, so any algorithm for testing graph connectedness must, at least in some cases, conduct an exhaustive search. This result has realworld implications: for example, it means that if we want to test a communications system – say, immediately after a disaster – to determine whether it is still connected, there is no guaranteed way of finding the answer without testing every component individually. http://…/Discrete_Morse_theory http://…/Morse_theory http://…/s48forman.pdf TDAmapper 
Discrete Sparklines  
DiscreteEvent Systems  A discrete event system is a dynamic system with discrete states the transitions of which are triggered by events. This provides a general framework for many manmade systems where the system dynamics not only follow physical laws but also the manmade rules. It is difficult to describe the dynamics of these systems using closedform expressions. In many cases simulation is the only faithful way to describe the system dynamics and for performance evaluation. 
DiscreteTime Method of Successive Approximations (MSA) 
Deep learning is formulated as a discretetime optimal control problem. This allows one to characterize necessary conditions for optimality and develop training algorithms that do not rely on gradients with respect to the trainable parameters. In particular, we introduce the discretetime method of successive approximations (MSA), which is based on the Pontryagin’s maximum principle, for training neural networks. A rigorous error estimate for the discrete MSA is obtained, which sheds light on its dynamics and the means to stabilize the algorithm. The developed methods are applied to train, in a rather principled way, neural networks with weights that are constrained to take values in a discrete set. We obtain competitive performance and interestingly, very sparse weights in the case of ternary networks, which may be useful in model deployment in lowmemory devices. 
Discriminant Analysis  Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. This method is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. HiDimDA 
Discriminant Function Analysis  Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification – the act of distributing things into groups, classes or categories of the same type. Moreover, it is a useful followup procedure to a MANOVA instead of doing a series of oneway ANOVAs, for ascertaining how the groups differ on the composite of dependent variables. In this case, a significant F test allows classification based on a linear combination of predictor variables. Terminology can get confusing here, as in MANOVA, the dependent variables are the predictor variables, and the independent variables are the grouping variables. 
Discriminative kshot learning  This paper introduces a probabilistic framework for kshot image classification. The goal is to generalise from an initial largescale classification task to a separate task comprising new classes and small numbers of examples. The new approach not only leverages the featurebased representation learned by a neural network from the initial task (representational transfer), but also information about the form of the classes (concept transfer). The concept information is encapsulated in a probabilistic model for the final layer weights of the neural network which then acts as a prior when probabilistic kshot learning is performed. Surprisingly, simple probabilistic models and inference schemes outperform many existing kshot learning approaches and compare favourably with the stateoftheart method in terms of errorrate. The new probabilistic methods are also able to accurately model uncertainty, leading to well calibrated classifiers, and they are easily extensible and flexible, unlike many recent approaches to kshot learning. 
Discriminative Model  Discriminative models, also called conditional models, are a class of models used in machine learning for modeling the dependence of an unobserved variable y on an observed variable x. Within a probabilistic framework, this is done by modeling the conditional probability distribution P(yx), which can be used for predicting y from x. Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of x and y. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance. On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model. 
Discriminative Optimization (DO) 
Many computer vision problems are formulated as the optimization of a cost function. This approach faces two main challenges: (i) designing a cost function with a local optimum at an acceptable solution, and (ii) developing an efficient numerical method to search for one (or multiple) of these local optima. While designing such functions is feasible in the noiseless case, the stability and location of local optima are mostly unknown under noise, occlusion, or missing data. In practice, this can result in undesirable local optima or not having a local optimum in the expected place. On the other hand, numerical optimization algorithms in highdimensional spaces are typically local and often rely on expensive first or second order information to guide the search. To overcome these limitations, this paper proposes Discriminative Optimization (DO), a method that learns search directions from data without the need of a cost function. Specifically, DO explicitly learns a sequence of updates in the search space that leads to stationary points that correspond to desired solutions. We provide a formal analysis of DO and illustrate its benefits in the problem of 3D point cloud registration, camera pose estimation, and image denoising. We show that DO performed comparably or outperformed stateoftheart algorithms in terms of accuracy, robustness to perturbations, and computational efficiency. 
Discriminative PCA (dPCA) 
Principal component analysis (PCA) is widely used for feature extraction and dimensionality reduction, with documented merits in diverse tasks involving highdimensional data. Standard PCA copes with one dataset at a time, but it is challenged when it comes to analyzing multiple datasets jointly. In certain data science settings however, one is often interested in extracting the most discriminative information from one dataset of particular interest (a.k.a. target data) relative to the other(s) (a.k.a. background data). To this end, this paper puts forth a novel approach, termed discriminative (d) PCA, for such discriminative analytics of multiple datasets. Under certain conditions, dPCA is proved to be leastsquares optimal in recovering the component vector unique to the target data relative to background data. To account for nonlinear data correlations, (linear) dPCA models for one or multiple background datasets are generalized through kernelbased learning. Interestingly, all dPCA variants admit an analytical solution obtainable with a single (generalized) eigenvalue decomposition. Finally, corroborating dimensionality reduction tests using both synthetic and real datasets are provided to validate the effectiveness of the proposed methods. 
DisentAngled Representation Learning Agent (DARLA) 
Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multistage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA’s vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts – even with no access to the target domain. DARLA significantly outperforms conventional baselines in zeroshot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC). 
DisguiseNet  This paper describes our approach for the Disguised Faces in the Wild (DFW) 2018 challenge. The task here is to verify the identity of a person among disguised and impostors images. Given the importance of the task of face verification it is essential to compare methods across a common platform. Our approach is based on VGGface architecture paired with Contrastive loss based on cosine distance met ric. For augmenting the data set, we source more data from the internet. The experiments show the effectiveness of the approach on the DFW data. We show that adding extra data to the DFW dataset with noisy labels also helps in increasing the gen 11 eralization performance of the network. The proposed network achieves 27.13% absolute increase in accuracy over the DFW baseline. 
DISPATCH  This work presents the first algorithm for the problem of weighted online perfect bipartite matching with i.i.d. arrivals. Previous work only considered adversarial arrival sequences. In this problem, we are given a known set of workers, a distribution over job types, and nonnegative utility weights for each worker, job type pair. At each time step, a job is drawn i.i.d. from the distribution over job types. Upon arrival, the job must be irrevocably assigned to a worker. The goal is to maximize the expected sum of utilities after all jobs are assigned. Our work is motivated by the application of ridehailing, where jobs represent passengers and workers represent drivers. We introduce \algname{}, a 0.5competitive, randomized algorithm and prove that 0.5competitive is the best possible. \algname{} first selects a ‘preferred worker’ and assign the job to this worker if it is available. The preferred worker is determined based on an optimal solution to a fractional transportation problem. If the preferred worker is not available, \algname{} randomly selects a worker from the available workers. We show that \algname{} maintains a uniform distribution over the workers even when the distribution over the job types is nonuniform. 
DISsimilarity COefficient Networks (DISCO Nets) 
We present a new type of probabilistic model which we call DISsimilarity COefficient Networks (DISCO Nets). DISCO Nets allow us to efficiently sample from a posterior distribution parametrised by a neural network. During training, DISCO Nets are learned by minimising the dissimilarity coefficient between the true distribution and the estimated distribution. This allows us to tailor the training to the loss related to the task at hand. We empirically show that (i) by modeling uncertainty on the output value, DISCO Nets outperform equivalent nonprobabilistic predictive networks and (ii) DISCO Nets accurately model the uncertainty of the output, outperforming existing probabilistic models based on deep neural networks. 
Dissimilarity Measure  If features are given or can be defined they can be used to define a distance measure between objects. This can be understood as the euclidean distance in a properly scaled feature space. For good features this will also result in good dissimilarities. However, as long a dissimilarities are based on features their performance will be determined by the quality of these. As features do not describe the full objects it is possible that two objects that are different have a zero distance based on the available features. This is an essential cause of class overlap in feature spaces. Dissimilarities offer the possibility to overcome this. If the dissimilarity measure is defined in such a way that objects have a zero distance to itself or to entirely identical copies of themselves (which thereby should belong to the same class) there is no class overlap. 
Distance Based on Conditional Ordered List (DCOL) 
nlnet 
Distance Correlation  In statistics and in probability theory, distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson productmoment correlation coefficient. These distancebased measures can be put into an indirect relationship to the ordinary moments by an alternative formulation (described below) using ideas related to Brownian motion, and this has led to the use of names such as Brownian covariance and Brownian distance covariance. cdcsis 
Distance Metric Learning (DML) 
Distance metric learning (DML), which learns a distance metric from labeled ‘similar’ and ‘dissimilar’ data pairs, is widely utilized. Recently, several works investigate orthogonalitypromoting regularization (OPR), which encourages the projection vectors in DML to be close to being orthogonal, to achieve three effects: (1) high balancedness — achieving comparable performance on both frequent and infrequent classes; (2) high compactness — using a small number of projection vectors to achieve a ‘good’ metric; (3) good generalizability — alleviating overfitting to training data. While showing promising results, these approaches suffer three problems. First, they involve solving nonconvex optimization problems where achieving the global optimal is NPhard. Second, it lacks a theoretical understanding why OPR can lead to balancedness. Third, the current generalization error analysis of OPR is not directly on the regularizer. In this paper, we address these three issues by (1) seeking convex relaxations of the original nonconvex problems so that the global optimal is guaranteed to be achievable; (2) providing a formal analysis on OPR’s capability of promoting balancedness; (3) providing a theoretical analysis that directly reveals the relationship between OPR and generalization performance. Experiments on various datasets demonstrate that our convex methods are more effective in promoting balancedness, compactness, and generalization, and are computationally more efficient, compared with the nonconvex methods. 
Distance Multivariance  We introduce two new measures for the dependence of $n \ge 2$ random variables: `distance multivariance’ and `total distance multivariance’. Both measures are based on the weighted $L^2$distance of quantities related to the characteristic functions of the underlying random variables. They extend distance covariance (introduced by Szekely, Rizzo and Bakirov) and generalized distance covariance (introduced in part I) from pairs of random variables to $n$tuplets of random variables. We show that total distance multivariance can be used to detect the independence of $n$ random variables and has a simple finitesample representation in terms of distance matrices of the sample points, where distance is measured by a continuous negative definite function. Based on our theoretical results, we present a test for independence of multiple random vectors which is consistent against all alternatives. multivariance 
Distance Preservation to Local Mean (DPLM) 
In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of Symmetric Positive Definite (SPD) matrices that considers the geometry of SPD matrices and provides a low dimensional representation of the manifold with high class discrimination. The proposed algorithm, tries to preserve the local structure of the data by preserving distance to local mean (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples and may use the label information when they are available in order to performance improvement in classification tasks. We performed several experiments on the multiclass dataset IIa from BCI competition IV. The results show that our approach as dimensionality reduction technique – leads to superior results in comparison with other competitor in the related literature because of its robustness against outliers. The experiments confirm that the combination of DPLM with FGMDM as the classifier leads to the state of the art performance on this dataset. 
Distance to Kernel and Embedding (D2KE) 
For many machine learning problem settings, particularly with structured inputs such as sequences or sets of objects, a distance measure between inputs can be specified more naturally than a feature representation. However, most standard machine models are designed for inputs with a vector feature representation. In this work, we consider the estimation of a function $f:\mathcal{X} \rightarrow \R$ based solely on a dissimilarity measure $d:\mathcal{X}\times\mathcal{X} \rightarrow \R$ between inputs. In particular, we propose a general framework to derive a family of \emph{positive definite kernels} from a given dissimilarity measure, which subsumes the widelyused \emph{representativeset method} as a special case, and relates to the wellknown \emph{distance substitution kernel} in a limiting case. We show that functions in the corresponding Reproducing Kernel Hilbert Space (RKHS) are Lipschitzcontinuous w.r.t. the given distance metric. We provide a tractable algorithm to estimate a function from this RKHS, and show that it enjoys better generalizability than NearestNeighbor estimates. Our approach draws from the literature of Random Features, but instead of deriving feature maps from an existing kernel, we construct novel kernels from a random feature map, that we specify given the distance measure. We conduct classification experiments with such disparate domains as strings, time series, and sets of vectors, where our proposed framework compares favorably to existing distancebased learning methods such as $k$nearestneighbors, distancesubstitution kernels, pseudoEuclidean embedding, and the representativeset method. 
Distance to Measure (DTM) 
Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features (e.g., Betti numbers, normals) of this subset from the approximating point cloud data. It appears that the study of distance functions allows one to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers or with background noise. In this paper, we show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by measures, we introduce a notion of distance function to a probability distribution in R d . These functions share many properties with classical distance functions, which make them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, we show that it is possible to reconstruct offsets of sampled shapes with topological guarantees even in the presence of outliers. Moreover, in settings where empirical measures are considered, these functions can be easily evaluated, making them of particular practical interest. 
Distance Weighted Discrimination (DWD) 
High Dimension Low Sample Size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, it is seen that the appealing discrimination method called the Support Vector Machine can be improved. The revealing concept is ‘data piling’ at the margin. This leads naturally to the development of ‘Distance Weighted Discrimination’, which also is based on modern computationally intensive optimization methods, and seems to give improved ‘generalizability’. Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS sdwd 
DistanceBased kMedoids  This paper proposes a new algorithm for Kmedoids clustering which runs like the Kmeans algorithm and tests several methods for selecting initial medoids. The proposed algorithm calculates the distance matrix once and uses it for finding new medoids at every iterative step. To evaluate the proposed algorithm, we use some real and artificial data sets and compare with the results of other algorithms in terms of the adjusted Rand index. Experimental results show that the proposed algorithm takes a significantly reduced time in computation with comparable performance against the partitioning around medoids. kmed 
Distill and Transfer Learning (Distral) 
Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a ‘distilled’ policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable—attributes that are critical in deep reinforcement learning. 
DistMult  Knowledge Base Completion: Baselines Strike Back 
Distributed Computing  Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOAbased systems to massively multiplayer online games to peertopeer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including RPClike connectors and message queues. A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional nondistributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be ‘papered over’ by attempts at ‘transparency’ – see CAP theorem. Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. 
Distributed Coordinated MultiAgent Bidding (DCMAB) 
Realtime advertising allows advertisers to bid for each impression for a visiting user. To optimize a specific goal such as maximizing the revenue led by ad placements, advertisers not only need to estimate the relevance between the ads and user’s interests, but most importantly require a strategic response with respect to other advertisers bidding in the market. In this paper, we formulate bidding optimization with multiagent reinforcement learning. To deal with a large number of advertisers, we propose a clustering method and assign each cluster with a strategic bidding agent. A practical Distributed Coordinated MultiAgent Bidding (DCMAB) has been proposed and implemented to balance the tradeoff between the competition and cooperation among advertisers. The empirical study on our industryscaled realworld data has demonstrated the effectiveness of our modeling methods. Our results show that a cluster based bidding would largely outperform singleagent and bandit approaches, and the coordinated bidding achieves better overall objectives than the purely selfinterested bidding agents. 
Distributed Data Shuffling  Data shuffling of training data among different computing nodes (workers) has been identified as a core element to improve the statistical performance of modern large scale machine learning algorithms. Data shuffling is often considered one of the most significant bottlenecks in such systems due to the heavy communication load. Under a masterworker architecture (where a master has access to the entire dataset and only communications between the master and workers is allowed) coding has been recently proved to considerably reduce the communication load. In this work, we consider a different communication paradigm referred to as distributed data shuffling, where workers, connected by a shared link, are allowed to communicate with one another while no communication between the master and workers is allowed. Under the constraint of uncoded cache placement, we first propose a general coded distributed data shuffling scheme, which achieves the optimal communication load within a factor two. Then, we propose an improved scheme achieving the exact optimality for either large memory size or at most four workers in the system. 
Distributed Dynamic Dataintensive Science (D3 Science) 
A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often timedependent. Collectively, these characteristics give rise to dynamic distributed dataintensive applications. While ‘static’ data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and dataintensive applications have received relatively less attention. This paper surveys several representative dynamic distributed dataintensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications. 
Distributed ExpectationMaximization (DEM) 
The family of ExpectationMaximization (EM) algorithms provides a general approach to fitting flexible models for large and complex data. The expectation (E) step of EMtype algorithms is timeconsuming in massive data applications because it requires multiple passes through the full data. We address this problem by proposing an asynchronous and distributed generalization of the EM called the Distributed EM (DEM). Using DEM, existing EMtype algorithms are easily extended to massive data settings by exploiting the divideandconquer technique and widely available computing power, such as grid computing. The DEM algorithm reserves two groups of computing processes called \emph{workers} and \emph{managers} for performing the E step and the maximization step (M step), respectively. The samples are randomly partitioned into a large number of disjoint subsets and are stored on the worker processes. The E step of DEM algorithm is performed in parallel on all the workers, and every worker communicates its results to the managers at the end of local E step. The managers perform the M step after they have received results from a $\gamma$fraction of the workers, where $\gamma$ is a fixed constant in $(0, 1]$. The sequence of parameter estimates generated by the DEM algorithm retains the attractive properties of EM: convergence of the sequence of parameter estimates to a local mode and linear global rate of convergence. Across diverse simulations focused on linear mixedeffects models, the DEM algorithm is significantly faster than competing EMtype algorithms while having a similar accuracy. The DEM algorithm maintains its superior empirical performance on a movie ratings database consisting of 10 million ratings. 
Distributed Lag Model  A distributedlag model is a dynamic model in which the effect of a regressor x on y occurs over time rather than all at once. In the simple case of one explanatory variable and a linear relationship. This form is very similar to the infinitemovingaverage representation of an ARMA process, except that the lag polynomial on the righthand side is applied to the explanatory variable x rather than to a whitenoise process e. The individual coefficients ßs are called lag weights and the collectively comprise the lag distribution. They define the pattern of how x affects y over time. 
Distributed LanceWilliam Clustering Algorithm  One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage. 
Distributed Machine Learning Toolkit (DMTK) 
Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models, because the task usually requires a large number of computation resources. In order to enable the training of big models using just a modest cluster and in an efficient manner, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient and flexible. 
Distributed Matrix  A distributed matrix has longtyped row and column indices and doubletyped values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Three types of distributed matrices have been implemented so far. The basic type is called RowMatrix. A RowMatrix is a roworiented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. An IndexedRowMatrix is similar to a RowMatrix but with row indices, which can be used for identifying rows and executing joins. A CoordinateMatrix is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries. 
Distributed Robust Algorithm for Countbased Learning (Dracula) 

Distributed Stream Data Processing System (DSDPS) 
In this paper, we focus on generalpurpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or nearreal time. A fundamental problem in a DSDPS is the scheduling problem with the objective of minimizing average endtoend tuple processing time. A widelyused solution is to distribute workload evenly over machines in the cluster in a roundrobin manner, which is obviously not efficient due to lack of consideration for communication delay. Modelbased approaches do not work well either due to the high complexity of the system environment. We aim to develop a novel modelfree approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling modelfree control in DSDPSs; and present design, implementation and evaluation of a novel and highly effective DRLbased control framework, which minimizes average endtoend tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks. To validate and evaluate the proposed framework, we implemented it based on a widelyused DSDPS, Apache Storm, and tested it with three representative applications. Extensive experimental results show 1) Compared to Storm’s default scheduler and the stateoftheart modelbased method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs. 
Distributed Stream Processing Engines (DSPE) 
Distributed stream processing engines (DSPEs) are a new emergent family of MapReduce inspired technologies that address this issue. These engines allow to express parallel computation on streams, and combine the scalability of distributed processing with the efficiency of streaming algorithms. Examples of these engines include Storm, S4, and Samza. 
Distribution Regression  Linear regression is a fundamental and popular statistical method. There are various kinds of linear regression, such as mean regression and quantile regression. In this paper, we propose a new one called distribution regression, which allows broadspectrum of the error distribution in the linear regression. Our method uses nonparametric technique to estimate regression parameters. Our studies indicate that our method provides a better alternative than mean regression and quantile regression under many settings, particularly for asymmetrical heavytailed distribution or multimodal distribution of the error term. Under some regular conditions, our estimator is $\sqrt n$consistent and possesses the asymptotically normal distribution. The proof of the asymptotic normality of our estimator is very challenging because our nonparametric likelihood function cannot be transformed into sum of independent and identically distributed random variables. Furthermore, penalized likelihood estimator is proposed and enjoys the socalled oracle property with diverging number of parameters. Numerical studies also demonstrate the effectiveness and the flexibility of the proposed method. 
Distribution Regression Network (DRN) 
We introduce our Distribution Regression Network (DRN) which performs regression from input probability distributions to output probability distributions. Compared to existing methods, DRN learns with fewer model parameters and easily extends to multiple input and multiple output distributions. On synthetic and realworld datasets, DRN performs similarly or better than the stateoftheart. Furthermore, DRN generalizes the conventional multilayer perceptron (MLP). In the framework of MLP, each node encodes a real number, whereas in DRN, each node encodes a probability distribution. 
Distribution Separation Method (DSM) 

Distributional Adversarial Networks  We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and twosample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various experimental results show that generators trained with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with pointwise prediction discriminators. The application of our framework to domain adaptation also results in considerable improvement over recent stateoftheart. 
Distributional Variant of Gradient TemporalDifference (Distributional GTD2) 
We devise a distributional variant of gradient temporaldifference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study \citep{bellemare2017distributional}. In our paper, we design two new algorithms called distributional GTD2 and distributional TDC using the Cram{\’e}r distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. We prove the asymptotic almostsure convergence to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the reallife RL problems. In each step, the computational complexity is linear w.r.t.\ the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks. 
Distributionally Robust Stochastic Optimization (DRSO) 
A central question in statistical learning is to design algorithms that not only perform well on training data, but also generalize to new and unseen data. In this paper, we tackle this question by formulating a distributionally robust stochastic optimization (DRSO) problem, which seeks a solution that minimizes the worstcase expected loss over a family of distributions that are close to the empirical distribution in Wasserstein distances. We establish a connection between such Wasserstein DRSO and regularization. More precisely, we identify a broad class of loss functions, for which the Wasserstein DRSO is asymptotically equivalent to a regularization problem with a gradientnorm penalty. Such relation provides new interpretations for problems involving regularization, including a great number of statistical learning problems and discrete choice models (e.g. multinomial logit). The connection suggests a principled way to regularize highdimensional, nonconvex problems. This is demonstrated through two applications: the training of Wasserstein generative adversarial networks (WGANs) in deep learning, and learning heterogeneous consumer preferences with mixed logit choice model. 
DIVAE  The encoderdecoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoderdecoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DIVAE and DIVST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on realworld dialog datasets to discover semantic representations and enhance encoderdecoder models with interpretable generation. 
Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATE) 
DECORATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples) builds an ensemble of J48 trees by recursively adding artificial samples of the training data (‘Melville, P., & Mooney, R. J. (2005). Creating diversity in ensembles using artificial data. Information Fusion, 6(1), 99111. doi:10.1016/j.inffus.2004.04.001’). DecorateR 
Diverse Online Feature Selection  Online feature selection has been an active research area in recent years. We propose a novel diverse online feature selection method based on Determinantal Point Processes (DPP). Our model aims to provide diverse features which can be composed in either a supervised or unsupervised framework. The framework aims to promote diversity based on the kernel produced on a feature level, through at most three stages: feature sampling, local criteria and global criteria for feature selection. In the feature sampling, we sample incoming stream of features using conditional DPP. The local criteria is used to assess and select streamed features (i.e. only when they arrive), we use unsupervised scale invariant methods to remove redundant features and optionally supervised methods to introduce label information to assess relevant features. Lastly, the global criteria uses regularization methods to select a global optimal subset of features. This three stage procedure continues until there are no more features arriving or some predefined stopping condition is met. We demonstrate based on experiments conducted on that this approach yields better compactness, is comparable and in some instances outperforms other stateoftheart online feature selection methods. 
Diversity Index  A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of types increases and when evenness increases. For a given number of types, the value of a diversity index is maximized when all types are equally abundant. 
DIVST  The encoderdecoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoderdecoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DIVAE and DIVST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on realworld dialog datasets to discover semantic representations and enhance encoderdecoder models with interpretable generation. 
Dixon’s Q Test  In statistics, Dixon’s Q test, or simply the Q test, is used for identification and rejection of outliers. This assumes normal distribution and per Dean and Dixon, and others, this test should be used sparingly and never more than once in a data set. To apply a Q test for bad data, arrange the data in order of increasing values and calculate Q as defined: Q = gap/range, Where gap is the absolute difference between the outlier in question and the closest number to it. If Q > Qtable, where Qtable is a reference value corresponding to the sample size and confidence level, then reject the questionable point. Note that only one point may be rejected from a data set using a Q test. 
Django  Django is a highlevel Python Web framework that encourages rapid development and clean, pragmatic design. Built by experienced developers, it takes care of much of the hassle of Web development, so you can focus on writing your app without needing to reinvent the wheel. It’s free and open source. 
DKPro Similarity  DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of fullfledged language processing pipelines and fully customizable processing steps. 
Dlib  Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. Dlib’s open source licensing allows you to use it in any application, free of charge. dlib 
DLVHEX System  The DLVHEX system implements the HEXsemantics, which integrates answer set programming (ASP) with arbitrary external sources. Since its first release ten years ago, significant advancements were achieved. Most importantly, the exploitation of properties of external sources led to efficiency improvements and flexibility enhancements of the language, and technical improvements on the system side increased user’s convenience. In this paper, we present the current status of the system and point out the most important recent enhancements over early versions. While existing literature focuses on theoretical aspects and specific components, a bird’s eye view of the overall system is missing. In order to promote the system for realworld applications, we further present applications which were already successfully realized on top of DLVHEX. This paper is under consideration for acceptance in Theory and Practice of Logic Programming. 
DNAGAN  Disentangling factors of variation has always been a challenging problem in representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, bad quality of generated images from encodings, lack of identity information, etc. In this paper, we propose a supervised algorithm called DNAGAN trying to disentangle different attributes of images. The latent representations of images are DNAlike, in which each individual piece represents an independent factor of variation. By annihilating the recessive piece and swapping a certain piece of two latent representations, we obtain another two different representations which could be decoded into images. In order to obtain realistic images and also disentangled representations, we introduce the discriminator for adversarial training. Experiments on MultiPIE and CelebA datasets demonstrate the effectiveness of our method and the advantage of overcoming limitations existing in other methods. 
Docker  Build, Ship and RunAny App, Anywhere. Docker – An open platform for distributed applications for developers and sysadmins. Docker is a relatively new open source application and service, which is seeing interest across a number of areas. It uses recent Linux kernel features (containers, namespaces) to shield processes. While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine. Docker offers a standard interface to creation, distribution and deployment. The shipping container analogy is apt: just how shipping containers (via their standard size and “interface”) allow global trade to prosper, Docker is aiming for nothing less for deployment. A Dockerfile provides a concise, extensible, and executable description of the computational environment. Docker software then builds a Docker image from the Dockerfile. Docker images are analogous to virtual machine images, but smaller and built in discrete, extensible and reuseable layers. Images can be distributed and run on any machine that has Docker software installed—including Windows, OS X and of course Linux. Running instances are called Docker containers. A single machine can run hundreds of such containers, including multiple containers running the same image. 
DocTag2Vec  Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec—two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple $k$nearest neighbor search to predict tags for unseen documents. In contrast to previous multilabel learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against stateoftheart methods. 
Document Classification  Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done ‘manually’ (or ‘intellectually’) or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach. 
Document Term Matrix  ➘ “Term Document Matrix” 
DocumentContext Language Model (DCLM) 
Text documents are structured on multiple levels of detail: individual words are related by syntax, and larger units of text are related by discourse structure. Existing language models generally fail to account for discourse structure, but it is crucial if we are to have language models that reward coherence and generate coherent texts. We present and empirically evaluate a set of multilevel recurrent neural network language models, called DocumentContext Language Models (DCLMs), which incorporate contextual information both within and beyond the sentence. In comparison with wordlevel recurrent neural network language models, the DCLMs obtain slightly better predictive likelihoods, and considerably better assessments of document coherence. 
Dodgson Score  Dodgson’s method is a voting system proposed by the author, mathematician and logician Charles Dodgson, better known as Lewis Carroll. The method is to extend the Condorcet method by swapping candidates until a Condorcet winner is found. The winner is the candidate which requires the minimum number of swaps. Dodgson proposed this voting scheme in his 1876 work ‘A method of taking votes on more than two issues’. Given an integer k and an election, it is NPcomplete to determine whether or not a candidate can become a Condorcet winner with fewer than k swaps. In Dodgson’s method, each voter submits an ordered list of all candidates according to their own preference (from best to worst). The winner is defined to be the candidate for whom we need to perform the minimum number of pairwise swaps (added over all candidates) before they become a Condorcet winner. In particular, if there is already a Condorcet winner, they win the election. In short, we must find the voting profile with minimum Kendall tau distance from the input, such that it has a Condorcet winner; they are declared the victor. Computing the winner or even the Dodgson score of a candidate (the number of swaps needed to make him a winner) is a PNPcomplete problem. Efficient DodgsonScore Calculation Using Heuristics and Parallel Computing 
Domain Adaptation (DA) 
Domain Adaptation is a field associated with machine learning and transfer learning. This scenario arises when we aim at learning from a source data distribution a well performing model on a different (but related) target data distribution. For instance, one of the tasks of the common spam filtering problem consists in adapting a model from one user (the source distribution) to a new one who receives significantly different emails (the target distribution). Note that, when more than one source distribution is available the problem is referred to as multisource domain adaptation. Domain Adaptation with Randomized Expectation Maximization 
Domain Adaptive Low Rank  Deep Neural Networks trained on large datasets can be easily transferred to new domains with far fewer labeled examples by a process called finetuning. This has the advantage that representations learned in the large source domain can be exploited on smaller target domains. However, networks designed to be optimal for the source task are often prohibitively large for the target task. In this work we address the compression of networks after domain transfer. We focus on compression algorithms based on lowrank matrix decomposition. Existing methods base compression solely on learned network weights and ignore the statistics of network activations. We show that domain transfer leads to large shifts in network activations and that it is desirable to take this into account when compressing. We demonstrate that considering activation statistics when compressing weights leads to a rankconstrained regression problem with a closedform solution. Because our method takes into account the target domain, it can more optimally remove the redundancy in the weights. Experiments show that our Domain Adaptive Low Rank (DALR) method significantly outperforms existing lowrank compression techniques. With our approach, the fc6 layer of VGG19 can be compressed more than 4x more than using truncated SVD alone — with only a minor or no loss in accuracy. When applied to domaintransferred networks it allows for compression down to only 520% of the original number of parameters with only a minor drop in performance. 
Domain Generating Algorithm (DGA) 
Domain generation algorithm (DGA) are algorithms seen in various families of malware that are used to periodically generate a large number of domain names that can be used as rendezvous points with their controllers. The large number of potential rendezvous points makes it difficult for law enforcement to effectively shut down botnets since infected computers will attempt to contact some of these domain names every day to receive updates or commands. By using publickey cryptography, it is unfeasible for law enforcement and other actors to mimic commands from the malware controllers as some worms will automatically reject any updates not signed by the malware controllers. dga 
Domain Knowledgedriven Methodology (DoKnowMe) 
Software engineering considers performance evaluation to be one of the key portions of software quality assurance. Unfortunately, there seems to be a lack of standard methodologies for performance evaluation even in the scope of experimental computer science. Inspired by the concept of ‘instantiation’ in objectoriented programming, we distinguish the generic performance evaluation logic from the distributed and adhoc relevant studies, and develop an abstract evaluation methodology (by analogy of ‘class’) we name Domain Knowledgedriven Methodology (DoKnowMe). By replacing five predefined domainspecific knowledge artefacts, DoKnowMe could be instantiated into specific methodologies (by analogy of ‘object’) to guide evaluators in performance evaluation of different software and even computing systems. We also propose a generic validation framework with four indicators (i.e.~usefulness, feasibility, effectiveness and repeatability), and use it to validate DoKnowMe in the Cloud services evaluation domain. Given the positive and promising validation result, we plan to integrate more common evaluation strategies to improve DoKnowMe and further focus on the performance evaluation of Cloud autoscaler systems. 
Donut  To ensure undisrupted business, large Internet companies need to closely monitor various KPIs (e.g., Page Views, number of online users, and number of orders) of its Web applications, to accurately detect anomalies and trigger timely troubleshooting/mitigation. However, anomaly detection for these seasonal KPIs with various patterns and data quality has been a great challenge, especially without labels. In this paper, we proposed Donut, an unsupervised anomaly detection algorithm based on VAE. Thanks to a few of our key techniques, Donut greatly outperforms a stateofarts supervised ensemble approach and a baseline VAE approach, and its best Fscores range from 0.75 to 0.9 for the studied KPIs from a top global Internet company. We come up with a novel KDE interpretation of reconstruction for Donut, making it the first VAEbased anomaly detection algorithm with solid theoretical explanation. 
Double Deep Machine Learning  ➘ “ReKopedia” 
Double Path Networks for Sequence to Sequence Learning (DPNS2S) 
Encoderdecoder based Sequence to Sequence learning (S2S) has made remarkable progress in recent years. Different network architectures have been used in the encoder/decoder. Among them, Convolutional Neural Networks (CNN) and Self Attention Networks (SAN) are the prominent ones. The two architectures achieve similar performances but use very different ways to encode and decode context: CNN use convolutional layers to focus on the local connectivity of the sequence, while SAN uses selfattention layers to focus on global semantics. In this work we propose Double Path Networks for Sequence to Sequence learning (DPNS2S), which leverage the advantages of both models by using double path information fusion. During the encoding step, we develop a double path architecture to maintain the information coming from different paths with convolutional layers and selfattention layers separately. To effectively use the encoded context, we develop a cross attention module with gating and use it to automatically pick up the information needed during the decoding step. By deeply integrating the two paths with cross attention, both types of information are combined and well exploited. Experiments show that our proposed method can significantly improve the performance of sequence to sequence learning over stateoftheart systems. 
DOuble Sparsity Kernel (DOSK) 
Learning with Reproducing Kernel Hilbert Spaces (RKHS) has been widely used in many scientific disciplines. Because a RKHS can be very flexible, it is common to impose a regularization term in the optimization to prevent overfitting. Standard RKHS learning employs the squared norm penalty of the learning function. Despite its success, many challenges remain. In particular, one cannot directly use the squared norm penalty for variable selection or data extraction. Therefore, when there exists noise predictors, or the underlying function has a sparse representation in the dual space, the performance of standard RKHS learning can be suboptimal. In the literature,work has been proposed on how to perform variable selection in RKHS learning, and a data sparsity constraint was considered for data extraction. However, how to learn in a RKHS with both variable selection and data extraction simultaneously remains unclear. In this paper, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. An efficient algorithm is provided to solve the corresponding optimization problem. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Simulated and real data results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning. 
Double Wedge Plot  A graphical display which indicates outliers and potential level shifts in time series. 
DouglasRachford Algorithm  The DouglasRachford algorithm is a very popular splitting technique for finding a zero of the sum of two maximally monotone operators. However, the behaviour of the algorithm remains mysterious in the general inconsistent case, i.e., when the sum problem has no zeros. More than a decade ago, however, it was shown that in the (possibly inconsistent) convex feasibility setting, the shadow sequence remains bounded and it is weak cluster points solve a best approximation problem. In this paper, we advance the understanding of the inconsistent case significantly by providing a complete proof of the full weak convergence in the convex feasibility setting. In fact, a more general sufficient condition for the weak convergence in the general case is presented. Several examples illustrate the results. On the generalized DouglasRachford algorithm for feasibility problems 
DPPNet  Recent breakthroughs in Neural Architectural Search (NAS) have achieved stateoftheart performances in applications such as image classification and language modeling. However, these techniques typically ignore devicerelated objectives such as inference time, memory usage, and power consumption. Optimizing neural architecture for devicerelated objectives is immensely crucial for deploying deep networks on portable devices with limited computing resources. We propose DPPNet: Deviceaware Progressive Search for Paretooptimal Neural Architectures, optimizing for both devicerelated (e.g., inference time and memory usage) and deviceagnostic (e.g., accuracy and model size) objectives. DPPNet employs a compact search space inspired by current stateoftheart mobile CNNs, and further improves search efficiency by adopting progressive search (Liu et al. 2017). Experimental results on CIFAR10 are poised to demonstrate the effectiveness of Paretooptimal networks found by DPPNet, for three different devices: (1) a workstation with Titan X GPU, (2) NVIDIA Jetson TX1 embedded system, and (3) mobile phone with ARM CortexA53. Compared to CondenseNet and NASNetA (Mobile), DPPNet achieves better performances: higher accuracy and shorter inference time on various devices. Additional experimental results show that models found by DPPNet also achieve considerablygood performance on ImageNet as well. 
Dpush  Herein this paper is presented a novel invention – called Dpush – that enables truly scalable spam resistant uncensorable automatically encrypted and inherently authenticated messaging; thus restoring our ability to exert our right to private communication, and thus a step forward in restoring an uncorrupted democracy. Using a novel combination of a distributed hash table (DHT) and a proof of work (POW), combined in a way that can only be called a synergy, the emergent property of a scalable and spam resistant unsolicited messaging protocol elegantly emerges. Notable is that the receiver does not need to be online at the time the message is sent. This invention is already implemented and operating within the package that is called MORPHiS – which is a Sybil resistant enhanced Kademlia DHT implementation combined with an already functioning implementation of Dpush, as well as a polished HTTP Dmail interface to send and receive such messages today. 
DRACO  Distributed model training is vulnerable to worstcase system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To tolerate node failures and adversarial attacks, recent work suggests using variants of the geometric median to aggregate distributed updates at the PS, in place of bulk averaging. Although medianbased update rules are robust to adversarial nodes, their computational cost can be prohibitive in largescale settings and their convergence guarantees often require relatively strong assumptions. In this work, we present DRACO, a scalable framework for robust distributed training that uses ideas from coding theory. In DRACO, each compute node evaluates redundant gradients that are then used by the parameter server to eliminate the effects of adversarial updates. We present problemindependent robustness guarantees for DRACO and show that the model it produces is identical to the one trained in the adversaryfree setup. We provide extensive experiments on real datasets and distributed setups across a variety of largescale models, where we show that DRACO is several times to orders of magnitude faster than medianbased approaches. 
Dragon King Theory  Jump to search The cover of a collection of articles about Dragon Kings Dragon king (DK) is double metaphor for an event that is both extremely large in size or impact (a ‘king’) and born of unique origins (a ‘dragon’) relative to its peers (other events from the same system). DK events are generated by / correspond to mechanisms such as positive feedback, tipping points, bifurcations, and phase transitions, that tend to occur in nonlinear and complex systems, and serve to amplify DK events to extreme levels. By understanding and monitoring these dynamics, some predictability of such events may be obtained. The theory has been developed by Prof. Didier Sornette, who hypothesizes that many of the crises that we face are in fact DK rather than black swans – i.e., they may be predictable to some degree. Given the importance of crises to the longterm organization of a variety of systems, the DK theory urges that special attention be given to the study and monitoring of extremes, and that a dynamic view be taken. From a scientific viewpoint, such extremes are interesting because they may reveal underlying, often hidden, organizing principles. Practically speaking, one should ambitiously study extreme risks, but not forget that significant uncertainty will almost always be present, and should be rigorously considered in decisions regarding risk management and design. The theory of DK is related to concepts such as: black swan theory, outliers, complex systems, nonlinear dynamics, power laws, extreme value theory, prediction, extreme risks, risk management, etc. 
DREBot  This paper describes an architecture for controlling nonplayer characters (NPC) in the First Person Shooter (FPS) game Unreal Tournament 2004. Specifically, the DREBot architecture is made up of three reinforcement learners, Danger, Replenish and Explore, which use the tabular Sarsa({\lambda}) algorithm. This algorithm enables the NPC to learn through trial and error building up experience over time in an approach inspired by human learning. Experimentation is carried to measure the performance of DREBot when competing against fixed strategy bots that ship with the game. The discount parameter, {\gamma}, and the trace parameter, {\lambda}, are also varied to see if their values have an effect on the performance. 
Drift Analysis  Drift analysis is one of the major tools for analysing evolutionary algorithms and natureinspired search heuristics. In this chapter we give an introduction to drift analysis and give some examples of how to use it for the analysis of evolutionary algorithms. 
Dropout  Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from coadapting too much. During training, dropout samples from an exponential number of different ‘thinned’ networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining stateoftheart results on many benchmark data sets. 
DropoutDAgger  While imitation learning is becoming common practice in robotics, this approach often suffers from data mismatch and compounding errors. DAgger is an iterative algorithm that addresses these issues by continually aggregating training data from both the expert and novice policies, but does not consider the impact of safety. We present a probabilistic extension to DAgger, which uses the distribution over actions provided by the novice policy, for a given observation. Our method, which we call DropoutDAgger, uses dropout to train the novice as a Bayesian neural network that provides insight to its confidence. Using the distribution over the novice’s actions, we estimate a probabilistic measure of safety with respect to the expert action, tuned to balance exploration and exploitation. The utility of this approach is evaluated on the MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved performance and safety compared to other DAgger variants and classic imitation learning. 
Dropping Network  In natural language understanding, many challenges require learning relationships between two sequences for various tasks such as similarity, relatedness, paraphrasing and question matching. Some of these challenges are inherently closer in nature, hence the knowledge acquired from one task to another is easier acquired and adapted. However, transferring all knowledge might be undesired and can lead to suboptimal results due to \textit{negative} transfer. Hence, this paper focuses on the transferability of both instances and parameters across natural language understanding tasks using an ensemblebased transfer learning method to circumvent such issues. The primary contribution of this paper is the combination of both \textit{Dropout} and \textit{Bagging} for improved transferability in neural networks, referred to as \textit{Dropping} herein. Secondly, we present a straightforward yet novel approach to incorporating source \textit{Dropping} Networks to a target task for fewshot learning that mitigates \textit{negative} transfer. This is achieved by using a decaying parameter chosen according to the slope changes of a smoothed spline error curve at subintervals during training. We compare the approach over the hard parameter sharing, soft parameter sharing and singletask learning to compare its effectiveness. The aforementioned adjustment leads to improved transfer learning performance and comparable results to the current state of the art only using few instances from the target task. 
DSDP  The DSDP software is a free open source implementation of an interiorpoint method for semidefinite programming. It provides primal and dual solutions, exploits lowrank structure and sparsity in the data, and has relatively low memory requirements for an interiorpoint method. It allows feasible and infeasible starting points and provides approximate certificates of infeasibility when no feasible solution exists. The dualscaling algorithm implemented in this package has a convergence proof and worstcase polynomial complexity under mild assumptions on the data. The software can be used as a set of subroutines, through Matlab, or by reading and writing to data files. Furthermore, the solver offers scalable parallel performance for large problems and a well documented interface. Some of the most popular applications of semidefinite programming and linear matrix inequalities (LMI) are model control, truss topology design, and semidefinite relaxations of combinatorial and global optimization problems. Rdsdp 
DSLATS  Through the last decade, we have witnessed a surge of Internet of Things (IoT) devices, and with that a greater need to choreograph their actions across both time and space. Although these two problems, namely time synchronization and localization, share many aspects in common, they are traditionally treated separately or combined on centralized approaches that results in an ineffcient use of resources, or in solutions that are not scalable in terms of the number of IoT devices. Therefore, we propose DSLATS, a framework comprised of three different and independent algorithms to jointly solve time synchronization and localization problems in a distributed fashion. The First two algorithms are based mainly on the distributed Extended Kalman Filter (EKF) whereas the third one uses optimization techniques. No fusion center is required, and the devices only communicate with their neighbors. The proposed methods are evaluated on custom UltraWideband communication Testbed and a quadrotor, representing a network of both static and mobile nodes. Our algorithms achieve up to three microseconds time synchronization accuracy and 30 cm localization error. 
Dual Asymmetric Deep Hashing Learning  Due to the impressive learning power, deep learning has achieved a remarkable performance in supervised hash function learning. In this paper, we propose a novel asymmetric supervised deep hashing method to preserve the semantic structure among different categories and generate the binary codes simultaneously. Specifically, two asymmetric deep networks are constructed to reveal the similarity between each pair of images according to their semantic labels. The deep hash functions are then learned through two networks by minimizing the gap between the learned features and discrete codes. Furthermore, since the binary codes in the Hamming space also should keep the semantic affinity existing in the original space, another asymmetric pairwise loss is introduced to capture the similarity between the binary codes and realvalue features. This asymmetric loss not only improves the retrieval performance, but also contributes to a quick convergence at the training phase. By taking advantage of the twostream deep structures and two types of asymmetric pairwise functions, an alternating algorithm is designed to optimize the deep features and highquality binary codes efficiently. Experimental results on three realworld datasets substantiate the effectiveness and superiority of our approach as compared with stateoftheart. 
Dual Discriminator Generative Adversarial net (D2GAN) 
We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the KullbackLeibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multimodes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and realworld largescale datasets (MNIST, CIFAR10, STL10, ImageNet), where we have made our best effort to compare our D2GAN with the latest stateoftheart GAN’s variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database. 
Dual Lasso Selector  We consider the problem of model selection and estimation in sparse high dimensional linear regression models with strongly correlated variables. First, we study the theoretical properties of the dual Lasso solution, and we show that joint consideration of the Lasso primal and its dual solutions are useful for selecting correlated active variables. Second, we argue that correlations among active predictors are not problematic, and we derive a new weaker condition on the design matrix, called Pseudo Irrepresentable Condition (PIC). Third, we present a new variable selection procedure, Dual Lasso Selector, and we prove that the PIC is a necessary and sufficient condition for consistent variable selection for the proposed method. Finally, by combining the dual Lasso selector further with the Ridge estimation even better prediction performance is achieved. We call the combination (DLSelect+Ridge), it can be viewed as a new combined approach for inference in highdimensional regression models with correlated variables. We illustrate DLSelect+Ridge method and compare it with popular existing methods in terms of variable selection, prediction accuracy, estimation accuracy and computation speed by considering various simulated and real data examples. 
Dual Learning for Machine Translation (dualNMT) 
While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a duallearning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a duallearning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., EnglishtoFrench translation (primal) versus FrenchtoEnglish translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the duallearning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the languagemodel likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dualNMT}. Experiments show that dualNMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the FrenchtoEnglish translation task. 
Dual Path Network (DPN) 
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the stateoftheart Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature reusage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over stateofthearts. In particular, on the ImagNet1k dataset, a shallow DPN surpasses the best ResNeXt101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN131) further pushes the stateoftheart single model performance with more than 3 times faster training speed. Experiments on the Places365 largescale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications. 
Dual Principal Component Pursuit (DPCP) 
We extend the theoretical analysis of a recently proposed single subspace learning algorithm, called Dual Principal Component Pursuit (DPCP), to the case where the data are drawn from of a union of hyperplanes. To gain insight into the properties of the $\ell_1$ nonconvex problem associated with DPCP, we develop a geometric analysis of a closely related continuous optimization problem. Then transferring this analysis to the discrete problem, our results state that as long as the hyperplanes are sufficiently separated, the dominant hyperplane is sufficiently dominant and the points are uniformly distributed inside the associated hyperplanes, then the nonconvex DPCP problem has a unique global solution, equal to the normal vector of the dominant hyperplane. This suggests the correctness of a sequential hyperplane learning algorithm based on DPCP. A thorough experimental evaluation reveals that hyperplane learning schemes based on DPCP dramatically improve over the stateoftheart methods for the case of synthetic data, while are competitive to the stateoftheart in the case of 3D plane clustering for Kinect data. 
Dual Rectified Linear Units (DReLU) 
Rectified Linear Units (ReLUs) are widely used in feedforward neural networks, and in convolutional neural networks in particular. However, they can be rarely found in recurrent neural networks due to the unboundedness and the positive image of the rectified linear activation function. In this paper, we introduce Dual Rectified Linear Units (DReLUs), a novel type of rectified unit that comes with a positive and negative image that is unbounded. We show that we can successfully replace the tanh activation function in the recurrent step of quasi recurrent neural networks. In addition, DReLUs are less prone to the vanishing gradient problem, they are noise robust, and they induce sparse activations. Therefore, we are able to stack up to eight quasi recurrent layers, making it possible to improve the current stateoftheart in characterlevel language modeling over architectures based on shallow Long ShortTerm Memory (LSTM). 
Dual Supervised Learning  Many supervised learning tasks are emerged in dual forms, e.g., EnglishtoFrench translation vs. FrenchtoEnglish translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach \emph{dual supervised learning}. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis. 
DualPrimal Graph CNN  In recent years, there has been a surge of interest in developing deep learning methods for nonEuclidean structured data such as graphs. In this paper, we propose DualPrimal Graph CNN, a graph convolutional architecture that alternates convolutionlike operations on the graph and its dual. Our approach allows to learn both vertex and edge features and generalizes the previous graph attention (GAT) model. We provide extensive experimental validation showing stateoftheart results on a variety of tasks tested on established graph benchmarks, including CORA and Citeseer citation networks as well as MovieLens, Flixter, Douban and Yahoo Music graphguided recommender systems. 
DualState Recurrent Network (DSRN) 
Advances in image superresolution (SR) have recently benefited significantly from rapid developments in deep neural networks. Inspired by these recent discoveries, we note that many stateoftheart deep SR architectures can be reformulated as a singlestate recurrent neural network (RNN) with finite unfoldings. In this paper, we explore new structures for SR based on this compact RNN view, leading us to a dualstate design, the DualState Recurrent Network (DSRN). Compared to its single state counterparts that operate at a fixed spatial resolution, DSRN exploits both lowresolution (LR) and highresolution (HR) signals jointly. Recurrent signals are exchanged between these states in both directions (both LR to HR and HR to LR) via delayed feedback. Extensive quantitative and qualitative evaluations on benchmark datasets and on a recent challenge demonstrate that the proposed DSRN performs favorably against stateoftheart algorithms in terms of both memory consumption and predictive accuracy. 
DunningKruger Effect  The DunningKruger effect is a cognitive bias wherein unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than is accurate. This bias is attributed to a metacognitive inability of the unskilled to recognize their ineptitude. Conversely, highly skilled individuals tend to underestimate their relative competence, erroneously assuming that tasks which are easy for them are also easy for others. 
Duration Analysis  ➘ “Survival Analysis” Duration Analysis and its Applications (Finance) spduration 
Duration and Interval Hidden Markov Model (DIHMM) 
Analysis of sequential event data has been recognized as one of the essential tools in data modeling and analysis field. In this paper, after the examination of its technical requirements and issues to model complex but practical situation, we propose a new sequential data model, dubbed Duration and Interval Hidden Markov Model (DIHMM), that efficiently represents ‘state duration’ and ‘state interval’ of data events. This has significant implications to play an important role in representing practical timeseries sequential data. This eventually provides an efficient and flexible sequential data retrieval. Numerical experiments on synthetic and real data demonstrate the efficiency and accuracy of the proposed DIHMM. 
Dwarf  ➚ “BigDataBench” 
Dyadic Data  Dyadic data refers to a domain with two nite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domainindependent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains. http://…/gonzalezgriffin2012dyadicch.pdf dmm 
Dyadic Network Analysis  dyads 
Dygraphs  dygraphs is a fast, flexible open source JavaScript charting library. It allows users to explore and interpret dense data sets. 
DYNAMIC  In this paper we present DYNAMIC, an opensource C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gapencoded bitvectors, and entropy/runlength compressed strings and FMindexes. We prove closetooptimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of four recentlypublished compression algorithms implemented using DYNAMIC with those of stateoftheart tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more spaceefficient (albeit slower) than classical ones performing the same tasks. 
Dynamic Adaptive Network Intelligence (DANI) 
Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report stateoftheart results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015). 
Dynamic AuthorPersona Topic Model (DAP) 
Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic AuthorPersona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona — where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model’s personas to be distinct. Our results show significant improvements over competing topic models — particularly after regularization, and highlight the DAP model’s unique ability to capture common journeys shared by different authors. 
Dynamic Bayesian Network (DBN) 
A Dynamic Bayesian Network (DBN) is a Bayesian Network which relates variables to each other over adjacent time steps. This is often called a TwoTimeslice BN (2TBN) because it says that at any point in time T, the value of a variable can be calculated from the internal regressors and the immediate prior value (time T1). DBNs are common in robotics, and have shown potential for a wide range of data mining applications. For example, they have been used in speech recognition, digital forensics, protein sequencing, and bioinformatics. DBN is a generalization of hidden Markov models and Kalman filters. https://…/thesis.pdf http://…/0000006a.pdf 
Dynamic Capacity Network (DCN) 
We introduce the Dynamic Capacity Network (DCN), a neural network that can adaptively assign its capacity across different portions of the input data. This is achieved by combining modules of two types: lowcapacity subnetworks and highcapacity subnetworks. The lowcapacity subnetworks are applied across most of the input, but also provide a guide to select a few portions of the input on which to apply the highcapacity subnetworks. The selection is made using a novel gradientbased attention mechanism, that efficiently identifies the modules and input features that are most likely to impact the DCN’s output and to which we’d like to devote more capacity. We focus our empirical evaluation on the cluttered MNIST and SVHN image datasets. Our findings indicate that DCNs are able to drastically reduce the number of computations, compared to traditional convolutional neural networks, while maintaining similar performance. 
Dynamic Clustering (DC) 

Dynamic Continuous Indexing (DCI) 

Dynamic Correlation Analysis (DCA) 
In highthroughput data, dynamic correlation between genes, i.e. changing correlation patterns under different biological conditions, can reveal important regulatory mechanisms. Given the complex nature of dynamic correlation, and the underlying conditions for dynamic correlation may not manifest into clinical observations, it is difficult to recover such signal from the data. Current methods seek underlying conditions for dynamic correlation by using certain observed genes as surrogates, which may not faithfully represent true latent conditions. In this study we develop a new method that directly identifies strong latent signals that regulate the dynamic correlation of many pairs of genes, named DCA: Dynamic Correlation Analysis. At the center of the method is a new metric for the identification of gene pairs that are highly likely to be dynamically correlated, without knowing the underlying conditions of the dynamic correlation. We validate the performance of the method with extensive simulations. In real data analysis, the method reveals novel latent factors with clear biological meaning, bringing new insights into the data. 
Dynamic Decision Network (DDN) 
A fully observable dynamic decision network consists of: · a set of state features, each with a domain; · a set of possible actions forming a decision node A, with domain the set of actions; · a twostage belief network with an action node A, nodes F0 and F1 for each feature F (for the features at time 0 and time 1, respectively), and a conditional probability P(F1parents(F1)) such that the parents of F1 can include A and features at times 0 and 1 as long as the resulting network is acyclic; and · a reward function that can be a function of the action and any of the features at times 0 or 1. 
Dynamic Deep Neural Networks (D2NN) 
We introduce Dynamic Deep Neural Networks (D2NN), a new type of feedforward deep neural network that allow selective execution. Given an input, only a subset of D2NN neurons are executed, and the particular subset is determined by the D2NN itself. By pruning unnecessary computation depending on input, D2NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2NN augments a regular feedforward deep neural network (directed acyclic graph of differentiable modules) with one or more controller modules. Each controller module is a subnetwork whose output is a decision that controls whether other modules can execute. A D2NN is trained end to end. Both regular modules and controller modules in a D2NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2NN architectures on image classification tasks, we demonstrate that D2NNs are general and flexible, and can effectively optimize accuracyefficiency tradeoffs. 
Dynamic Differentiable Reasoning (DDR) 
We present a novel Dynamic Differentiable Reasoning (DDR) framework for jointly learning branching programs and the functions composing them; this resolves a significant nondifferentiability inhibiting recent dynamic architectures. We apply our framework to two settings in two highly compact and data efficient architectures: DDRprog for CLEVR Visual Question Answering and DDRstack for reverse Polish notation expression evaluation. DDRprog uses a recurrent controller to jointly predict and execute modular neural programs that directly correspond to the underlying question logic; it explicitly forks subprocesses to handle logical branching. By effectively leveraging additional structural supervision, we achieve a large improvement over previous approaches in subtask consistency and a small improvement in overall accuracy. We further demonstrate the benefits of structural supervision in the RPN setting: the inclusion of a stack assumption in DDRstack allows our approach to generalize to long expressions where an LSTM fails the task. 
Dynamic Emulation Algorithm (DEA) 
We consider solution of stochastic storage problems through regression Monte Carlo (RMC) methods. Taking a statistical learning perspective, we develop the dynamic emulation algorithm (DEA) that unifies the different existing approaches in a single modular template. We then investigate the two central aspects of regression architecture and experimental design that constitute DEA. For the regression piece, we discuss various nonparametric approaches, in particular introducing the use of Gaussian process regression in the context of stochastic storage. For simulation design, we compare the performance of traditional design (grid discretization), against spacefilling, and several adaptive alternatives. The overall DEA template is illustrated with multiple examples drawing from natural gas storage valuation and optimal control of backup generator in a microgrid. 
Dynamic Filter Network (DFN) 
In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operations can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach stateoftheart performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation. 
Dynamic Graph Convolutional Networks  Many different classification tasks need to manage structured data, which are usually modeled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph may change during time. Our goal is to jointly exploit structured data and temporal information through the use of a neural network model. To the best of our knowledge, this task has not been addressed using these kind of architectures. For this reason, we propose two novel approaches, which combine Long ShortTerm Memory networks and Graph Convolutional Networks to learn long shortterm dependencies together with graph structure. The quality of our methods is confirmed by the promising results achieved. 
Dynamic Graphical Models  Dynamic graphical models for multivariate time series data to estimate directed dynamic networks in functional magnetic resonance imaging (fMRI), see Schwab et al. (2017) <doi:10.1101/198887>. DGM 
Dynamic Linear Model (DLM) 
Dynamic Linear Models (DLMs) or State Space Models define a very general class of nonstationary time series models. DLMs may include terms to model trends, seasonality, covariates and autoregressive components. Other time series models like ARMA models are particular DLMs. The main goals are shortterm forecasting, intervention analysis and monitoring. dlm 
Dynamic Panel Threshold Model  Dynamic threshold panel model suggested by (Stephanie Kremer, Alexander Bick and Dieter Nautz (2013) <doi:10.1007/s0018101205539>) in which they extended the (Hansen (1999) <doi: 10.1016/S03044076(99)000251>) original static panel threshold estimation and the Caner and (Hansen (2004) <doi:10.1017/S0266466604205011>) crosssectional instrumental variable threshold model, where generalized methods of moments type estimators are used. dtp 
Dynamic Partition Models  We present a new approach for learning compact and intuitive distributed representations with binary encoding. Rather than summing up expert votes as in products of experts, we employ for each variable the opinion of the most reliable expert. Data points are hence explained through a partitioning of the variables into expert supports. The partitions are dynamically adapted based on which experts are active. During the learning phase we adopt a smoothed version of this model that uses separate mixtures for each data dimension. In our experiments we achieve accurate reconstructions of highdimensional data points with at most a dozen experts. 
Dynamic Principal Components  
Dynamic Programming  In mathematics, computer science, economics, and bioinformatics, dynamic programming is a method for solving a complex problem by breaking it down into a collection of simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure (described below). When applicable, the method takes far less time than naive methods that don’t take advantage of the subproblem overlap (like depthfirst search). In order to solve a given problem, using a dynamic programming approach, we need to solve different parts of the problem (subproblems), then combine the solutions of the subproblems to reach an overall solution. Often when using a more naive method, many of the subproblems are generated and solved many times. The dynamic programming approach seeks to solve each subproblem only once, thus reducing the number of computations: once the solution to a given subproblem has been computed, it is stored or “memoized”: the next time the same solution is needed, it is simply looked up. This approach is especially useful when the number of repeating subproblems grows exponentially as a function of the size of the input. Dynamic programming algorithms are used for optimization (for example, finding the shortest path between two points, or the fastest way to multiply many matrices). A dynamic programming algorithm will examine the previously solved subproblems and will combine their solutions to give the best solution for the given problem. The alternatives are many, such as using a greedy algorithm, which picks the locally optimal choice at each branch in the road. The locally optimal choice may be a poor choice for the overall solution. While a greedy algorithm does not guarantee an optimal solution, it is often faster to calculate. Fortunately, some greedy algorithms (such as minimum spanning trees) are proven to lead to the optimal solution. 
Dynamic Regression in the Presence of Autocorrelated Residuals (DREGAR) 
DREGAR 
Dynamic Sampling Convolutional Neural Network (DSCNN) 
We present Dynamic Sampling Convolutional Neural Networks (DSCNN), where the positionspecific kernels learn from not only the current position but also multiple sampled neighbour regions. During sampling, residual learning is introduced to ease training and an attention mechanism is applied to fuse features from different samples. And the kernels are further factorized to reduce parameters. The multiple sampling strategy enlarges the effective receptive fields significantly without requiring more parameters. While DSCNNs inherit the advantages of DFN, namely avoiding feature map blurring by positionspecific kernels while keeping translation invariance, it also efficiently alleviates the overfitting issue caused by much more parameters than normal CNNs. Our model is efficient and can be trained endtoend via standard backpropagation. We demonstrate the merits of our DSCNNs on both sparse and dense prediction tasks involving object detection and flow estimation. Our results show that DSCNNs enjoy stronger recognition abilities and achieve 81.7% in VOC2012 detection dataset. Also, DSCNNs obtain much sharper responses in flow estimation on FlyingChairs dataset compared to multiple FlowNet models’ baselines. 
Dynamic Tensor Clustering  Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a generalorder tensor. Also there is often a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we aim to bridge this gap by proposing a new dynamic tensor clustering method, which takes into account both sparsity and fusion structures, and enjoys strong statistical guarantees as well as high computational efficiency. Our proposal is based upon a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. In theory, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with a high probability. Moreover, our proposed method can be naturally extended to coclustering of multiple modes of the tensor data. The efficacy of our approach is illustrated via simulations and a brain dynamic functional connectivity analysis from an Autism spectrum disorder study. 
Dynamic Time Warping (DTW) 
In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. For instance, similarities in walking patterns could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data – indeed, any data which can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. Also it is seen that it can be used in partial shape matching application. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restrictions. The sequences are ‘warped’ nonlinearly in the time dimension to determine a measure of their similarity independent of certain nonlinear variations in the time dimension. This sequence alignment method is often used in time series classification. Although DTW measures a distancelike quantity between two given sequences, it does’t guarantee the triangle inequality to hold. Dynamic programming algorithm optimization for spoken word recognition dtwclust,dtwSat,IncDTW 
Dynamic Treatment Regimens (DTR) 
In medical research, a dynamic treatment regime (DTR), adaptive intervention, or adaptive treatment strategy is a set of rules for choosing effective treatments for individual patients. Historically, medical research and the practice of medicine tended to rely on an acute care model for the treatment of all medical problems, including chronic illness. Treatment choices made for a particular patient under a dynamic regime are based on that individual’s characteristics and history, with the goal of optimizing his or her longterm clinical outcome. A dynamic treatment regime is analogous to a policy in the field of reinforcement learning, and analogous to a controller in control theory. While most work on dynamic treatment regimes has been done in the context of medicine, the same ideas apply to timevarying policies in other fields, such as education, marketing, and economics. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage by potentially timevarying patient features and intermediate outcomes observed in previous stages. There are 3 main type methods, Olearning, Qlearning and Plearning to learn the optimal Dynamic Treatment Regimes with continuous variables. DTRlearn 
Dynamic Variable Effort Deep Neural Networks (DyVEDeep) 
Deep Neural Networks (DNNs) have advanced the stateoftheart in a variety of machine learning tasks and are deployed in increasing numbers of products and services. However, the computational requirements of training and evaluating largescale DNNs are growing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. In this work, we propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep) to reduce the computational requirements of DNNs during inference. Previous efforts propose specialized hardware implementations for DNNs, statically prune the network, or compress the weights. Complementary to these approaches, DyVEDeep is a dynamic approach that exploits the heterogeneity in the inputs to DNNs to improve their compute efficiency with comparable classification accuracy. DyVEDeep equips DNNs with dynamic effort mechanisms that, in the course of processing an input, identify how critical a group of computations are to classify the input. DyVEDeep dynamically focuses its compute effort only on the critical computa tions, while skipping or approximating the rest. We propose 3 effort knobs that operate at different levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep versions for 5 popular image recognition benchmarks – one for CIFAR10 and four for ImageNet (AlexNet, OverFeat and VGG16, weightcompressed AlexNet). Across all benchmarks, DyVEDeep achieves 2.1x2.6x reduction in the number of scalar operations, which translates to 1.8x2.3x performance improvement over a Caffebased implementation, with < 0.5% loss in accuracy. 
Dynamical Atoms Network (DYAN) 
The ability to anticipate the future is essential when making real time critical decisions, provides valuable information to understand dynamic natural scenes, and can help unsupervised video representation learning. Stateofart video prediction is based on LSTM recursive networks and/or generative adversarial network learning. These are complex architectures that need to learn large numbers of parameters, are potentially hard to train, slow to run, and may produce blurry predictions. In this paper, we introduce DYAN, a novel network with very few parameters and easy to train, which produces accurate, high quality frame predictions, significantly faster than previous approaches. DYAN owes its good qualities to its encoder and decoder, which are designed following concepts from systems identification theory and exploit the dynamicsbased invariants of the data. Extensive experiments using several standard video datasets show that DYAN is superior generating frames and that it generalizes well across domains. 
Dynamically Expandable Network (DEN) 
We propose a novel deep network architecture for lifelong learning which we refer to as Dynamically Expandable Network (DEN), that can dynamically decide its network capacity as it trains on a sequence of tasks, to learn a compact overlapping knowledge sharing structure among tasks. DEN is efficiently trained in an online manner by performing selective retraining, dynamically expands network capacity upon arrival of each task with only the necessary number of units, and effectively prevents semantic drift by splitting/duplicating units and timestamping them. We validate DEN on multiple public datasets in lifelong learning scenarios on multiple public datasets, on which it not only significantly outperforms existing lifelong learning methods for deep networks, but also achieves the same level of performance as the batch model with substantially fewer number of parameters. 
Dynamically Routed Network (SkipNet) 
Increasing depth and complexity in convolutional neural networks has enabled significant progress in visual perception tasks. However, incremental improvements in accuracy are often accompanied by exponentially deeper models that push the computational limits of modern hardware. These incremental improvements in accuracy imply that only a small fraction of the inputs require the additional model complexity. As a consequence, for any given image it is possible to bypass multiple stages of computation to reduce the cost of forward inference without affecting accuracy. We exploit this simple observation by learning to dynamically route computation through a convolutional network. We introduce dynamically routed networks (SkipNets) by adding gating layers that route images through existing convolutional networks and formulate the routing problem in the context of sequential decision making. We propose a hybrid learning algorithm which combines supervised learning and reinforcement learning to address the challenges of inherently nondifferentiable routing decisions. We show SkipNet reduces computation by 30 – 90% while preserving the accuracy of the original model on four benchmark datasets. We compare SkipNet with SACT and ACT to show SkipNet achieves better accuracy with lower computation. 
DyNet  We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet’s dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Dynamic declaration thus facilitates the implementation of more complicated network architectures, and DyNet is specifically designed to allow users to implement their models in a way that is idiomatic in their preferred programming language (C++ or Python). One challenge with dynamic declaration is that because the symbolic computation graph is defined anew for every training example, its construction must have low overhead. To achieve this, DyNet has an optimized C++ backend and lightweight graph representation. Experiments show that DyNet’s speeds are faster than or comparable with static declaration toolkits, and significantly faster than Chainer, another dynamic declaration toolkit. DyNet is released opensource under the Apache 2.0 license and available at http://…/dynet. 
DynMat  To survive in the dynamicallyevolving world, we accumulate knowledge and improve our skills based on experience. In the process, gaining new knowledge does not disrupt our vigilance to external stimuli. In other words, our learning process is ‘accumulative’ and ‘online’ without interruption. However, despite the recent success, artificial neural networks (ANNs) must be trained offline, and they suffer catastrophic interference between old and new learning, indicating that ANNs’ conventional learning algorithms may not be suitable for building intelligent agents comparable to our brain. In this study, we propose a novel neural network architecture (DynMat) consisting of dual learning systems, inspired by the complementary learning system (CLS) theory suggesting that the brain relies on short and longterm learning systems to learn continuously. Our experiments show that 1) DynMat can learn a new class without catastrophic interference and 2) it does not strictly require offline training. 
Advertisements