d3.compose  Compose complex, datadriven visualizations from reusable charts and components with d3. • Get started quickly with standard charts and components • Layout charts and components automatically • Powerful foundation for creating custom charts and components 
Daleel  In this paper we present Daleel, a multicriteria adaptive decision making framework that is developed to find the optimal IaaS deployment strategy. 
DamerauLevenshtein Distance  In information theory and computer science, the DamerauLevenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a string metric for measuring the edit distance between two sequences. Informally, the DamerauLevenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. The DamerauLevenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations in addition to the three classical singlecharacter edit operations (insertions, deletions and substitutions). In his seminal paper, Damerau stated that these four operations correspond to more than 80% of all human misspellings. Damerau’s paper considered only misspellings that could be corrected with at most one edit operation. While the original motivation was to measure distance between human misspellings to improve applications such as spell checkers, DamerauLevenshtein distance has also seen uses in biology to measure the variation between protein sequences. 
Damped LeastSquares (DLS) 
In mathematics and computing, the LevenbergMarquardt algorithm (LMA), also known as the damped leastsquares (DLS) method, is used to solve nonlinear least squares problems. These minimization problems arise especially in least squares curve fitting. The LMA interpolates between the GaussNewton algorithm (GNA) and the method of gradient descent. The LMA is more robust than the GNA, which means that in many cases it finds a solution even if it starts very far off the final minimum. For wellbehaved functions and reasonable starting parameters, the LMA tends to be a bit slower than the GNA. LMA can also be viewed as GaussNewton using a trust region approach. The LMA is a very popular curvefitting algorithm used in many software applications for solving generic curvefitting problems. However, as for many fitting algorithms, the LMA finds only a local minimum, which is not necessarily the global minimum. onls 
DancingLines  Nowadays, events usually burst and are propagated online through multiple modern media like social networks and search engines. There exists various research discussing the event dissemination trends on individual medium, while few studies focus on event popularity analysis from a crossplatform perspective. Challenges come from the vast diversity of events and media, limited access to aligned datasets across different media and a great deal of noise in the datasets. In this paper, we design DancingLines, an innovative scheme that captures and quantitatively analyzes event popularity between pairwise text media. It contains two models: TFSW, a semanticaware popularity quantification model, based on an integrated weight coefficient leveraging Word2Vec and TextRank; and wDTWCD, a pairwise event popularity time series alignment model matching different event phases adapted from Dynamic Time Warping. We also propose three metrics to interpret event popularity trends between pairwise social platforms. Experimental results on eighteen realworld event datasets from an influential social network and a popular search engine validate the effectiveness and applicability of our scheme. DancingLines is demonstrated to possess broad application potentials for discovering the knowledge of various aspects related to events and different media. 
Dark Data  The total amount of data in every organization is far, far greater than anyone including, most crucially, their Information Technology group knows about. Moreover this “missing” data that can’t be seen and currently can’t be made use of is also the very stuff that holds the organization together. This is what we call “Dark Data.” 
Dark Knowledge  A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give an image of a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This ‘dark knowledge’, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier. http://…/darkknowledgeneuralnetwork.html http://…/geoffhintonsdarkknowledge http://…/1503.02531v1.pdf 
DARPA Open Catalog  Welcome to the DARPA Open Catalog, which contains a curated list of DARPAsponsored software and peerreviewed publications. DARPA sponsors fundamental and applied research in a variety of areas that may lead to experimental results and reusable technology designed to benefit multiple government domains. The DARPA Open Catalog organizes publicly releasable material from DARPA programs. DARPA has an open strategy to help increase the impact of government investments. DARPA is interested in building communities around governmentfunded research. DARPA plans to continue to make available information generated by DARPA programs, including software, publications, data, and experimental results. The table on this page lists the programs currently participating in the catalog. 
dasksearchcv  This library provides implementations of ScikitLearn’s GridSearchCV and RandomizedSearchCV. They implement many (but not all) of the same parameters, and should be a dropin replacement for the subset that they do implement. For certain problems, these implementations can be more efficient than those in ScikitLearn, as they can avoid expensive repeated computations. 
Dat  Build data pipelines – Dat is an open source project that provides a streaming interface between every file format and data storage backend. 
Data Acceleration  Data technologies are evolving rapidly, but organizations have adopted most of these in piecemeal fashion. As a result, enterprise data – whether related to customer interactions, business performance, computer notifications, or external events in the business environment – is vastly underutilized. Moreover, companies’ data ecosystems have become complex and littered with data silos. This makes the data more difficult to access, which in turn limits the value that organizations can get out of it. Indeed, according to a recent Gartner, Inc. report, 85 percent of Fortune 500 organizations will be unable to exploit Big Data for competitive advantage through 2015. Furthermore, a recent Accenture study found that half of all companies have concerns about the accuracy of their data, and the majority of executives are unclear about the business outcomes they are getting from their data analytics programs. To unlock the value hidden in their data, companies must start treating data as a supply chain, enabling it to flow easily and usefully through the entire organization – and eventually throughout each company’s ecosystem of partners, including suppliers and customers. The time is right for this approach. For one thing, new external data sources are becoming available, providing fresh opportunities for data insights. In addition, the tools and technology required to build a better data platform are available and in use. These provide a foundation on which companies can construct an integrated, endtoend data supply chain. 
Data Acquisition  Data acquisition is the process of sampling signals that measure real world physical conditions and converting the resulting samples into digital numeric values that can be manipulated by a computer. Data acquisition systems (abbreviated with the acronym DAS or DAQ) typically convert analog waveforms into digital values for processing. 
Data Aggregation  In statistics, aggregate data describes data combined from several measurements. When data are aggregated, groups of observations are replaced with summary statistics based on those observations. In economics, aggregate data or data aggregates describes highlevel data that is composed from a multitude or combination of other more individual data. 
Data Analysis  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. 
Data Analytics  Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. Data analytics (DA) is the science of examining raw data with the purpose of drawing conclusions about that information. Data analytics is used in many industries to allow companies and [organizations] to make better business decisions and in the sciences to verify or disprove existing models or theories. Definition 
Data Archaeology  Data archaeology refers to the art and science of recovering computer data encoded and/or encrypted in now obsolete media or formats. Data archaeology can also refer to recovering information from damaged electronic formats after natural or man made disasters. 
Data as a Service (DaaS) 
Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of serviceoriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. Traditionally, most enterprises have used data stored in a selfcontained repository, for which software was specifically developed to access and present the data in a humanreadable form. One result of this paradigm is the bundling of both the data and the software needed to interpret it into a single package, sold as a consumer product. As the number of bundled software/data packages proliferated and required interaction among one another, another layer of interface was required. These interfaces, collectively known as enterprise application integration (EAI), often tended to encourage vendor lockin, as it is generally easy to integrate applications that are built upon the same foundation technology. The result of the combined software/data consumer package and required EAI middleware has been an increased amount of software for organizations to manage and maintain, simply for the use of particular data. In addition to routine maintenance costs, a cascading amount of software updates are required as the format of the data changes. The existence of this situation contributes to the attractiveness of DaaS to data consumers, because it allows for the separation of data cost and usage from that of a specific software or platform. 
Data Assimilation  Data assimilation is the process by which observations are incorporated into a computer model of a real system. Applications of data assimilation arise in many fields of geosciences, perhaps most importantly in weather forecasting and hydrology. Data assimilation proceeds by analysis cycles. In each analysis cycle, observations of the current (and possibly past) state of a system are combined with the results from a numerical model (the forecast) to produce an analysis, which is considered as ‘the best’ estimate of the current state of the system. This is called the analysis step. Essentially, the analysis step tries to balance the uncertainty in the data and in the forecast. The model is then advanced in time and its result becomes the forecast in the next analysis cycle. Book: Data Assimilation Book: Data Assimilation Data Assimilation Lecture Notes 
Data Augmentation  Data augmentation adds value to base data by adding information derived from internal and external sources within an enterprise. Data is one of the core assets for an enterprise, making data management essential. Data augmentation can be applied to any form of data, but may be especially useful for customer data, sales patterns, product sales, where additional information can help provide more indepth insight. Data augmentation can help reduce the manual interventation required to developed meaningful information and insight of business data, as well as significantly enhance data quality. Data augmentation is of the last steps done in enterprise data management after monitoring, profiling and integration Some of the common techniques used in data augmentation include: • Extrapolation Technique: Based on heuristics. The relevant fields are updated or provided with values. • Tagging Technique: Common records are tagged to a group, making it easier to understand and differentiate for the group. • Aggregation Technique: Using mathematical values of averages and means, values are estimated for relevant fields if needed • Probability Technique: Based on heuristics and analytical statistics, values are populated based on the probability of events. https://…/01jcgsart.pdf 
Data Blending  Data blending is the process of combining data from multiple sources to reveal deeper intelligence that drives better business decisionmaking. Data blending differs from data integration and data warehousing in that its primary use is not to create the single, unified version of the truth that is stored in systems of record. Rather, business and data analysts use data blending to build an analytic dataset to assist in answering a specific business questions and driving a particular business process. 
Data Broker / Information Broker  An information broker (independent information professional, information consultant, or data broker) collects information, often about individual people. The data are then sold to companies that use it to target advertising and marketing towards specific groups, to verify a person’s identity including for purposes of fraud detection, and to sell to individuals and organizations so they can research particular individuals. Critics, including consumer protection organizations, say the industry is secretive and unaccountable, and should be better regulated. 
Data Business Model  According to Wikipedia, a business model “describes the rationale of how an organization creates, delivers, and captures value.” A Data Business Model is a business model where data is an indispensable component. If you remove the data, the business fails (or at least suffers greatly). To take one example, Amazon’s data is core to their business. Their historical transaction data helps them figure out how much inventory to hold and how to price products. Additionally, data about product views and purchases powers the recommendation engine, which drives a large portion of sales. Furthermore, product reviews drive traffic and SEO. As icing on the cake, all of this is a virtuous cycle: recommendations drive purchases, which result in more reviews, which lead to better SEO and more traffic, which results in more visitors and better recommendations. If Amazon wasn’t so effective as using data, it would be a much smaller company. The best part of data business models is that they often have the same kind of positive feedback loop as Amazon. In each business model, the more you use data to make money, the more data you get as a result, which helps you make more money in the future. 
Data Communications  Data Communications concerns the transmission of digital messages to devices external to the message source. “External” devices are generally thought of as being independently powered circuitry that exists beyond the chassis of a computer or other digital message source. As a rule, the maximum permissible transmission rate of a message is directly proportional to signal power, and inversely proportional to channel noise. It is the aim of any communications system to provide the highest possible transmission rate at the lowest possible power and with the least possible noise. 
Data Cube Materialization  Data cube materialization is a classical database operator introduced in Gray et al.~(Data Mining and Knowledge Discovery, Vol.~1), which is critical for many analysis tasks. Nandi et al.~(Transactions on Knowledge and Data Engineering, Vol.~6) first studied cube materialization for large scale datasets using the MapReduce framework, and proposed a sophisticated modification of a simple broadcast algorithm to handle a dataset with a 216GB cube size within 25 minutes with 2k machines in 2012. 
Data Curation  Data curation is a term used to indicate management activities required to maintain research data longterm such that it is available for reuse and preservation. In science, data curation may indicate the process of extraction of important information from scientific texts, such as research articles by experts, to be converted into an electronic format, such as an entry of a biological database. The term is also used in the humanities, where increasing cultural and scholarly data from digital humanities projects requires the expertise and analytical practices of data curation. In broad terms, curation means a range of activities and processes done to create, manage, maintain, and validate a component. 
Data Decorations  Alberto Cairo left a comment about ‘data decorations’. This is a name he’s using to describe something like the windshieldwiper chart I discussed the other day. It seems like the visual elements were purely ornamental and adds nothing to the experience – one might argue that the experience was worse than just staring at the data table. 
Data Driven Business Model (DDBM) 
This paper contributes by providing a definition of a datadriven business model as a business model that relies on data as a key resource. 
Data Driven Documents (D3) 
D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a datadriven approach to DOM manipulation. Visualizing Data with D3.js D3 Tips and Tricks Awesome D3 
Data Envelopment Analysis (DEA) 
Data envelopment analysis (DEA) is a nonparametric method in operations research and economics for the estimation of production frontiers. It is used to empirically measure productive efficiency of decision making units (or DMUs). Although DEA has a strong link to production theory in economics, the tool is also used for benchmarking in operations management, where a set of measures is selected to benchmark the performance of manufacturing and service operations. Data Envelopment Analysis rDEA 
Data Federation  In most cases, if the term federation is used, it refers to combining autonomously operating objects. For example, states can be federated to form one country. If we apply this common explanation to data federation, it means combining autonomous data stores to form one large data store. Therefore, we propose the following definition ‘Data federation is a form of data virtualization where the data stored in a heterogeneous set of autonomous data stores is made accessible to data consumers as one integrated data store by using ondemand data integration.’ This definition is based on the following concepts: • Data virtualization: Data federation is a form of data virtualization. Note that not all forms of data virtualization imply data federation. For example, if an organization wants to virtualize the database of one application, no need exists for data federation. But data federation always results in data virtualization. • Heterogeneous set of data stores: Data federation should make it possible to bring data together from data stores using different storage structures, different access languages, and different APIs. An application using data federation should be able to access different types of database servers and files with various formats; it should be able to integrate data from all those data sources; it should offer features for transforming the data; and it should allow the applications and tools to access the data through various APIs and languages. • Autonomous data stores: Data stores accessed by data federation are able to operate independently; in other words, they can be used outside the scope of data federation. • One integrated data store: Regardless of how and where data is stored, it should be presented as one integrated data set. This implies that data federation involves transformation, cleansing, and possibly even enrichment of data. • Ondemand integration: This refers to when the data from a heterogeneous set of data stores is integrated. With data federation, integration takes place on the fly, and not in batch. When the data consumers ask for data, only then data is accessed and integrated. So the data is not stored in an integrated way, but remains in its original location and format. Spark Reaches for the Holy Grail: Federated Queries 
Data Fusion  Data fusion is the process of integration of multiple data and knowledge representing the same realworld object into a consistent, accurate, and useful representation. Data fusion processes are often categorized as low, intermediate or high, depending on the processing stage at which fusion takes place. Low level data fusion combines several sources of raw data to produce new raw data. The expectation is that fused data is more informative and synthetic than the original inputs. For example, sensor fusion is also known as (multisensor) data fusion and is a subset of information fusion. 
Data Hoarding 
http://…/Wikipedia:Avoid_datahoarding http://…hoardingpcavisualizationdecisions.html 
Data Impartment  
Data Journalism / Data Driven Journalism  Datadriven journalism, often shortened to “ddj”, is a term in use since 2009/2010, to describe a journalistic process based on analyzing and filtering large data sets for the purpose of creating a news story. Main drivers for this process are newly available resources such as “open source” software and “open data”. This approach to journalism builds on older practices, most notably on CAR (acronym for “computerassisted reporting”) a label used mainly in the US for decades. Other labels for partially similar approaches are “precision journalism”, based on a book by Philipp Meyer, published in 1972, where he advocated the use of techniques from social sciences in researching stories. 
Data Justice  As a handful of data platforms generate massive amounts of user data, the barriers to entry rise since potential competitors have little data themselves to entice advertisers compared to the incumbents who have both the concentrated processing power and supply of user data to dominate particular sectors. The upshot of this market power by big data platforms is that the marketplace is doing little to create options for consumers that might alleviate the misuse of consumer data or encourage big data platforms to better compensate users who are willing to share their data. Data Justice has been launched as a project to promote public education and new alliances to challenge the danger of big data to workers, consumers and the public. 
Data Lake  A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question. The term data lake is often associated with Hadooporiented object storage. In such a scenario, an organization’s data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop’s cluster nodes of commodity computers. Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product that supports Hadoop. Increasingly, however, the term is being accepted as a way to describe any large data pool in which the schema and data requirements are not defined until the data is queried. 
Data Leakage  Data Leakage is the creation of unexpected additional information in the training data, allowing a model or machine learning algorithm to make unrealistically good predictions. Leakage is a pervasive challenge in applied machine learning, causing models to overrepresent their generalization error and often rendering them useless in the real world. It can caused by human or mechanical error, and can be intentional or unintentional in both cases. 
Data Lineage  Data lineage is generally defined as a kind of data life cycle that includes the data’s origins and where it moves over time. This term can also describe what happens to data as it goes through diverse processes. Data lineage can help with efforts to analyze how information is used and to track key bits of information that serve a particular purpose. How to track and visualize data lineage 
Data Lineage Analysis  “Data lineage is defined as a data life cycle that includes the data’s origins and where it moves over time.” It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources. It also enables replaying specific portions or inputs of the dataflow for stepwise debugging or regenerating lost output. In fact, database systems have used such information, called data provenance, to address similar validation and debugging challenges already. Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as datadependency analysis, error/compromise detection and recovery, and auditing and compliance analysis. “Lineage is a simple type of why provenance.” 
Data Literacy  A statistical understanding and experience of applying analysis techniques to real data through code and visualization. Data literacy will be the fundamental skill for the 21st century. It’s also extremely easy to learn. The best way to develop this skill is to simply work with datasets. 
Data Mining (DM) 
Data mining (the analysis step of the “Knowledge Discovery in Databases” process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, postprocessing of discovered structures, visualization, and online updating. 
Data Mining Reality Check (DMRC) 
Is a means for ensuring the validity of data mining results. Data Mining Reality Check is technology for giving you the probability distribution against which to compare the best performance from your data mining exercise – that is, the probability distribution of your best network relative to the benchmark, viewed as a random variable generated by a random process in which really there is nothing better than the benchmark. 
Data Normalization  Data normalization is the process of reducing data to its canonical form. For instance, Database normalization is the process of organizing the fields and tables of a relational database to minimize redundancy and dependency. In the field of software security, a common vulnerability is unchecked malicious input. The mitigation for this problem is proper input validation. Before input validation may be performed, the input must be normalized, i.e., eliminating encoding (for instance HTML encoding) and reducing the input data to a single common character set. 
Data Paring  The problem that needs to be more discussed is data paring. The need for this is fairly obvious: data is growing exponentially, and growing your compute data exponentially will require budgets that aren’t realistic. One of the keys to winning at Big Data will be ignoring the noise. As the amount of data increases exponentially, the amount of interesting data doesn’t; I would bet that for most purposes the interesting data added is a tiny percentage of the new data that is added to the overall pool of data. 
Data Partitioning  Data partitioning in data mining is the division of the whole data available into two or three non overlapping sets: the training set , the validation set , and the test set. If the data set is very large, often only a portion of it is selected for the partitions. Partitioning is normally used when the model for the data at hand is being chosen from a broad set of models. The basic idea of data partitioning is to keep a subset of available data out of analysis, and to use it later for verification of the model. 
Data Pattern Processing  
Data Plumbing  
Data Preprocessing  Data preprocessing is an important step in the data mining process. The phrase “garbage in, garbage out” is particularly applicable to data mining and machine learning projects. Datagathering methods are often loosely controlled, resulting in outofrange values (e.g., Income: 100), impossible data combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc. Analyzing data that has not been carefully screened for such problems can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. 
Data Profiling  Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to: 1. Find out whether existing data can easily be used for other purposes 2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category 3. Give metrics on data quality including whether the data conforms to particular standards or patterns 4. Assess the risk involved in integrating data for new applications, including the challenges of joins 5. Assess whether metadata accurately describes the actual values in the source database 6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns. 7. Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality. 
Data Science  Data science is a buzz word reflecting the application of statistics by advances in computer science. Data science is the study of the generalizable extraction of knowledge from data, yet the key word is science. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition and learning, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of data science. 
Data Science Maturity Model (DSMM) 
Many organizations have been underwhelmed by the return on their investment in data science. This is due to a narrow focus on tools, rather than a broader consideration of how data science teams work and how they fit within the larger organization. To help data science practitioners and leaders identify their existing gaps and direct future investment, Domino has developed a framework called the Data Science Maturity Model (DSMM). The DSMM assesses how reliably and sustainably a data science team can deliver value for their organization. The model consists of four levels of maturity and is split along five dimensions that apply to all analytical organizations. By design, the model is not specific to any given industry — it applies as much to an insurance company as it does to a manufacturer. 
Data Science Virtual Machine (DSVM) 
The Data Science Virtual Machine runs on Windows Server 2012 and contains popular tools for data exploration, modeling and development activities. The main tools included are Microsoft R Server Developer Edition (An enterprise ready scalable R framework), Anaconda Python distribution, Julia Pro developer edition, Jupyter notebooks for R, Python and Julia, Visual Studio Community Edition with Python, R and node.js tools, Power BI desktop, SQL Server 2016 Developer edition including support InDatabase analytics using Microsoft R Server. It also includes open source deep learning tools like Microsoft Cognitive Toolkit (CNTK 2.0) and mxnet; ML algorithms like xgboost, Vowpal Wabbit. The Azure SDK and libraries on the VM allows you to build your applications using various services in the cloud that are part of the Cortana Analytics Suite which includes Azure Machine Learning, Azure data factory, Stream Analytics and SQL Datawarehouse, Hadoop, Data Lake, Spark and more. You can deploy models as web services in the cloud on Azure Machine Learning OR deploy them either on the cloud or onpremises using the Microsoft R Server operationalization. 
Data ScienceasaService (DSaaS) 
Data Science as a Service is Analyze’s unique approach to providing today’s business and government with the most advanced and efficient big data and data science analytics. With more than 25 years combined experience in data science and cybersecurity, Analyze helps organizations stay ahead of the competition, increase revenue, and improve operational efficiency. 
Data Standardization  When approaching data for modeling, some standard procedures should be used to prepare the data for modeling: 1.First the data should be filtered, and any outliers removed from the data (watch for a future post on how to scrub your raw data removing only legitimate outliers). 2.The data should be normalized or standardized to bring all of the variables into proportion with one another. For example, if one variable is 100 times larger than another (on average), then your model may be better behaved if you normalize/standardize the two variables to be approximately equivalent. Technically though, whether normalized/standardized, the coefficients associated with each variable will scale appropriately to adjust for the disparity in the variable sizes. 
Data Stewardship  In metadata, a data steward is a person that is responsible for maintaining a data element in a metadata registry. A data steward is a broad job role that incorporates processes, policies, guidelines and responsibilities for administering organizations’ entire data in compliance with business and/or regulatory obligations. A data steward’s responsibility stems from an understanding of the business domain and the interaction of business processes with data entities/elements. A data steward ensures that there are documented procedures and guidelines for data access and use. A data steward may share some responsibilities with a data custodian, and work with database/warehouse administrators and other related staff to plan and execute an enterprisewide data governance, control and compliance policy. Data stewardship roles are common when organizations are attempting to exchange data precisely and consistently between computer systems and reuse datarelated resources. Master data management often makes references to the need for data stewardship for its implementation to succeed. 
Data Stream Mining  Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery. In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Often, concepts from the field of incremental learning, a generalization of Incremental heuristic search are applied to cope with structural changes, online learning and realtime demands. In many applications, especially operating within nonstationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift. 
Data Structure Graph  A Data Structure Graph is a group of atomic entities that are related to each other, stored in a repository, then moved from one persistence layer to another, rendered as a Graph. 
Data Version Control (DVC) 
DVC makes your data science projects reproducible by automatically building data dependency graph (DAG). Your code and the dependencies could be easily shared by Git, and data – through cloud storage (AWS S3, GCP) in a single DVC environment. 
Data Visualization  Data visualization or data visualisation is viewed by many disciplines as a modern equivalent of visual communication. It is not owned by any one field, but rather finds interpretation across many (e.g. it is viewed as a modern branch of descriptive statistics by some, but also as a grounded theory development tool by others). It involves the creation and study of the visual representation of data, meaning ‘information that has been abstracted in some schematic form, including attributes or variables for the units of information’. A primary goal of data visualization is to communicate information clearly and efficiently to users via the information graphics selected, such as tables and charts. Effective visualization helps users in analyzing and reasoning about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will lookup a specific measure of a variable, while charts of various types are used to show patterns or relationships in the data for one or more variables. Data visualization is both an art and a science. The rate at which data is generated has increased, driven by an increasingly informationbased economy. Data created by internet activity and an expanding number of sensors in the environment, such as satellites and traffic cameras, are referred to as ‘Big Data’. Processing, analyzing and communicating this data present a variety of ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists have emerged to help address this challenge. 
Data Warehouse (DW) 
In computing, a data warehouse (DW, DWH), or an enterprise data warehouse (EDW), is a database used for reporting and data analysis. Integrating data from one or more disparate sources creates a central repository of data, a data warehouse (DW). Data warehouses store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons. 
DataasaService (DaaS) 
Data as a Service, or DaaS, is a cousin of software as a service. Like all members of the “as a Service” (aaS) family, DaaS is based on the concept that the product, data in this case, can be provided on demand to the user regardless of geographic or organizational separation of provider and consumer. Additionally, the emergence of serviceoriented architecture (SOA) has rendered the actual platform on which the data resides also irrelevant. This development has enabled the recent emergence of the relatively new concept of DaaS. Data provided as a service was at first primarily used in Web mashups, but now is being increasingly employed both commercially and, less commonly, within organisations such as the UN. 
DataDriven Threshold Machine (DTM) 
We present a novel distributionfree approach, the datadriven threshold machine (DTM), for a fundamental problem at the core of many learning tasks: choose a threshold for a given prespecified level that bounds the tail probability of the maximum of a (possibly dependent but stationary) random sequence. We do not assume data distribution, but rather relying on the asymptotic distribution of extremal values, and reduce the problem to estimate three parameters of the extreme value distributions and the extremal index. We specially take care of data dependence via estimating extremal index since in many settings, such as scan statistics, changepoint detection, and extreme bandits, where dependence in the sequence of statistics can be significant. Key features of our DTM also include robustness and the computational efficiency, and it only requires one sample path to form a reliable estimate of the threshold, in contrast to the Monte Carlo sampling approach which requires drawing a large number of sample paths. We demonstrate the good performance of DTM via numerical examples in various dependent settings. 
Dataflow Matrix Machine (DMM) 
Dataflow matrix machines generalize neural nets by replacing streams of numbers with streams of vectors (or other kinds of linear streams admitting a notion of linear combination of several streams) and adding a few more changes on top of that, namely arbitrary input and output arities for activation functions, countablesized networks with finite dynamically changeable active part capable of unbounded growth, and a very expressive selfreferential mechanism. While recurrent neural networks are Turingcomplete, they form an esoteric programming platform, not conductive for practical generalpurpose programming. Dataflow matrix machines are more suitable as a generalpurpose programming platform, although it remains to be seen whether this platform can be made fully competitive with more traditional programming platforms currently in use. At the same time, dataflow matrix machines retain the key property of recurrent neural networks: programs are expressed via matrices of real numbers, and continuous changes to those matrices produce arbitrarily small variations in the programs associated with those matrices. Spaces of vectorlike elements are of particular importance in this context. In particular, we focus on the vector space $V$ of finite linear combinations of strings, which can be also understood as the vector space of finite prefix trees with numerical leaves, the vector space of ‘mixed rank tensors’, or the vector space of recurrent maps. This space, and a family of spaces of vectorlike elements derived from it, are sufficiently expressive to cover all cases of interest we are currently aware of, and allow a compact and streamlined version of dataflow matrix machines based on a single space of vectorlike elements and variadic neurons. We call elements of these spaces Vvalues. Their role in our context is somewhat similar to the role of Sexpressions in Lisp. 
Dataflow Reuse Algorithms  Distributed Stream Processing Systems (DSPS) like Apache Storm and Spark Streaming enable composition of continuous dataflows that execute persistently over data streams. They are used by Internet of Things (IoT) applications to analyze sensor data from Smart City cyberinfrastructure, and make active utility management decisions. As the ecosystem of such IoT applications that leverage shared urban sensor streams continue to grow, applications will perform duplicate preprocessing and analytics tasks. This offers the opportunity to collaboratively reuse the outputs of overlapping dataflows, thereby improving the resource efficiency. In this paper, we propose \emph{dataflow reuse algorithms} that given a submitted dataflow, identifies the intersection of reusable tasks and streams from a collection of running dataflows to form a \emph{merged dataflow}. Similar algorithms to unmerge dataflows when they are removed are also proposed. We implement these algorithms for the popular Apache Storm DSPS, and validate their performance and resource savings for 35 synthetic dataflows based on public OPMW workflows with diverse arrival and departure distributions, and on 21 real IoT dataflows from RIoTBench. 
DatatoDecisions (D2D) 

Datification / Datafication  A concept that tracks the conception, development, storage and marketing of all types of data, both for business and life. It has grown in popularity of late to capture how data measures things and organizations in order to compete and win. It is about making business visible. 
Dato  Dato (formerly known as GraphLab): Your app drives business. From inspiration to production, build intelligent apps fast with the power of Dato’s machine learning platform. Data science at scale has never been easier. 
DawidSkene Algorithm (DSA) 
More and more online communities classify contributions based on collaborative ratings of these contributions. A popular method for such a ratingbased classification is the DawidSkene algorithm (DSA). However, despite its popularity, DSA has two major shortcomings: (1) It is vulnerable to raters with a low competence, i.e., a low probability of rating correctly. (2) It is defenseless against collusion attacks. In a collusion attack, raters coordinate to rate the same data objects with the same value to artificially increase their remuneration. Error Rate Analysis of Labeling by Crowdsourcing 
DC.js  dc.js is a javascript charting library with native crossfilter support and allowing highly efficient exploration on large multidimensional dataset (inspired by crossfilter’s demo). It leverages d3 engine to render charts in css friendly svg format. Charts rendered using dc.js are naturally data driven and reactive therefore providing instant feedback on user’s interaction. The main objective of this project is to provide an easy yet powerful javascript library which can be utilized to perform data visualization and analysis in browser as well as on mobile device. 
DCDistance  Text Mining is a field that aims at extracting information from textual data. One of the challenges of such field of study comes from the preprocessing stage in which a vector (and structured) representation should be extracted from unstructured data. The common extraction creates large and sparse vectors representing the importance of each term to a document. As such, this usually leads to the curseofdimensionality that plagues most machine learning algorithms. To cope with this issue, in this paper we propose a new supervised feature extraction and reduction algorithm, named DCDistance, that creates features based on the distance between a document to a representative of each class label. As such, the proposed technique can reduce the features set in more than 99% of the original set. Additionally, this algorithm was also capable of improving the classification accuracy over a set of benchmark datasets when compared to traditional and stateoftheart features selection algorithms. 
DCM Bandits  Search engines recommend a list of web pages. The user examines this list, from the first page to the last, and may click on multiple attractive pages. This type of user behavior can be modeled by the \emph{dependent click model (DCM)}. In this work, we propose \emph{DCM bandits}, an online learning variant of the DCM model where the objective is to maximize the probability of recommending a satisfactory item. The main challenge of our problem is that the learning agent does not observe the reward. It only observes the clicks. This imbalance between the feedback and rewards makes our setting challenging. We propose a computationallyefficient learning algorithm for our problem, which we call dcmKLUCB; derive gapdependent upper bounds on its regret under reasonable assumptions; and prove a matching lower bound up to logarithmic factors. We experiment with dcmKLUCB on both synthetic and realworld problems. Our algorithm outperforms a range of baselines and performs well even when our modeling assumptions are violated. To the best of our knowledge, this is the first regretoptimal online learning algorithm for learning to rank with multiple clicks in a cascadelike model. 
De Bruijn Entropy  De Bruijn entropy and string similarity 
Debagging  It is easy to convert a sentence into a bag of words, but it is much harder to convert a bag of words into a meaningful sentence. We name the latter the debagging problem. 
Decentralized HighDimensional Bayesian Optimization (DECHBO) 
This paper presents a novel decentralized highdimensional Bayesian optimization (DECHBO) algorithm that, in contrast to existing HBO algorithms, can exploit the interdependent effects of various input components on the output of the unknown objective function f for boosting the BO performance and still preserve scalability in the number of input dimensions without requiring prior knowledge or the existence of a low (effective) dimension of the input space. To realize this, we propose a sparse yet rich factor graph representation of f to be exploited for designing an acquisition function that can be similarly represented by a sparse factor graph and hence be efficiently optimized in a decentralized manner using distributed message passing. Despite richly characterizing the interdependent effects of the input components on the output of f with a factor graph, DECHBO can still guarantee noregret performance asymptotically. Empirical evaluation on synthetic and realworld experiments (e.g., sparse Gaussian process model with 1811 hyperparameters) shows that DECHBO outperforms the stateoftheart HBO algorithms. 
Decision Analysis (DA) 
Decision analysis (DA) is the discipline comprising the philosophy, theory, methodology, and professional practice necessary to address important decisions in a formal manner. Decision analysis includes many procedures, methods, and tools for identifying, clearly representing, and formally assessing important aspects of a decision, for prescribing a recommended course of action by applying the maximum expected utility action axiom to a wellformed representation of the decision, and for translating the formal representation of a decision and its corresponding recommendation into insight for the decision maker and other stakeholders. 
Decision Model and Notation (DMN) 
The primary goal of DMN is to provide an industry standard modelling notation for decision management and business rules that is readily understandable by all business users: from the business analysts who need to create initial decision requirements and then more detailed decision models, to the technical developers responsible for automating the decisions in processes, and finally, to the business people who will manage and monitor those decisions. The submission has been designed to be complementary to and useable alongside the OMG Business Process Model & Notation (BPMN) standard and will ensure that decision models are interchangeable across organizations. 
Decision Scientist  Decision Scientists build decision support tools to enable decision makers to make decisions, or take action, under uncertainty with a datacentric bias. Traditional analytics falls under this domain. Often decision makers like linear solutions that provide simple, explainable, socializable decision making frameworks. That is, they are looking for a rationale. Data Scientists build machines to make decisions about largescale complex dynamical processes that are typically too fast (velocity, veracity, volume, etc.) for a human operator/manager. They typically don’t concern themselves with whether the algorithm is explainable or socializable, but are more concerned with whether it is functional, reliable, accurate, and robust. 
Decision Stream  Various modifications of decision trees have been extensively used during the past years due to their high efficiency and interpretability. Selection of relevant features for spitting the tree nodes is a key property of their architecture, at the same time being their major shortcoming: the recursive nodes partitioning leads to geometric reduction of data quantity in the leaf nodes, which causes an excessive model complexity and data overfitting. In this paper, we present a novel architecture – a Decision Stream, – aimed to overcome this problem. Instead of building an acyclic tree structure during the training process, we propose merging nodes from different branches based on their similarity that is estimated with twosample test statistics. To evaluate the proposed solution, we test it on several common machine learning problems~— credit scoring, twitter sentiment analysis, aircraft flight control, MNIST and CIFAR image classification, synthetic data classification and regression. Our experimental results reveal that the proposed approach significantly outperforms the standard decision tree method on both regression and classification tasks, yielding a prediction error decrease up to 35%. 
Decision Stump  A decision stump is a machine learning model consisting of a onelevel decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a prediction based on the value of just a single input feature. Sometimes they are also called 1rules. 
Decision Support (DS) 
The term Decision Support (DS) is used often and in a variety of contexts related to decision making. Recently, for example, it is often mentioned in connection with Data Warehouses and OnLine Analytical Processing (OLAP). Another recent trend is to associate DS with Data Mining. This is the case in the project SolEuNet , which attempts to exploit these two approaches in a complementary way in order to support difficult reallife problem solving. Unfortunately, although the term ‘Decision Support’ seems rather intuitive and simple, it is in fact very loosely defined. It means different things to different people and in different contexts. Also, its meaning has shifted during the recent history. Nowadays, DS is probably most often associated with Data Warehouses and OLAP. A decade ago, it was coupled with Decision Support Systems (DSS). Still before that, there was a close link with Operations Research (OR) and Decision Analysis (DA). This causes a lot of confusion and misunderstanding, and provokes requests for clarification. The confusion is further exemplified by the multitude of related terms and acronyms that are either equal to, or start with ‘DS’: Decision Support, Decision Sciences, Decision Systems, Decision Support Systems, etc. This paper attempts to clarify these issues. We take the viewpoint that Decision Support is a broad, generic term that encompasses all aspects related to supporting people in making decisions. First, we present the results of a survey of WWW documents related to DS. On this basis, and on the basis of relevant literature and our previous experience in the field of DS, we provide a classification of DS and related disciplines. DS itself is given a role within Decision Making and Decision Sciences. Some most prominent DS disciplines are briefly overviewed: Operations Research, Decision Analysis, Decision Support Systems, Data Warehousing and OLAP, and Group Decision Support. 
Decision Support System (DSS) 
A Decision Support System (DSS) is a computerbased information system that supports business or organizational decisionmaking activities. DSSs serve the management, operations, and planning levels of an organization (usually mid and higher management) and help to make decisions, which may be rapidly changing and not easily specified in advance (Unstructured and SemiStructured decision problems). Decision support systems can be either fully computerized, human or a combination of both. While academics have perceived DSS as a tool to support decision making process, DSS users see DSS as a tool to facilitate organizational processes. Some authors have extended the definition of DSS to include any system that might support decision making. Sprague (1980) defines DSS by its characteristics: 1. DSS tends to be aimed at the less well structured, underspecified problem that upper level managers typically face; 2. DSS attempts to combine the use of models or analytic techniques with traditional data access and retrieval functions; 3. DSS specifically focuses on features which make them easy to use by noncomputer people in an interactive mode; and 4. DSS emphasizes flexibility and adaptability to accommodate changes in the environment and the decision making approach of the user. DSSs include knowledgebased systems. A properly designed DSS is an interactive softwarebased system intended to help decision makers compile useful information from a combination of raw data, documents, and personal knowledge, or business models to identify and solve problems and make decisions. Typical information that a decision support application might gather and present includes: • inventories of information assets (including legacy and relational data sources, cubes, data warehouses, and data marts), • comparative sales figures between one period and the next, • projected revenue figures based on product sales assumptions. 
Decision Theory  Decision theory or theory of choice in economics, psychology, philosophy, mathematics, computer science, and statistics is concerned with identifying the values, uncertainties and other issues relevant in a given decision, its rationality, and the resulting optimal decision. It is closely related to the field of game theory; decision theory is concerned with the choices of individual agents whereas game theory is concerned with interactions of agents whose decisions affect each other. 
Decision Tree Based Missing Value Imputation Technique (DMI) 
Decision tree based Missing value Imputation technique’ (DMI) makes use of an EM algorithm and a decision tree (DT) algorithm. 
Decision Tree Learning / Classification and Regression Trees (CART) 
Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modelling approaches used in statistics, data mining and machine learning. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. 
Declarative Statistics  In this work we introduce declarative statistics, a suite of declarative modelling tools for statistical analysis. Statistical constraints represent the key building block of declarative statistics. First, we introduce a range of relevant counting and matrix constraints and associated decompositions, some of which novel, that are instrumental in the design of statistical constraints. Second, we introduce a selection of novel statistical constraints and associated decompositions, which constitute a selfcontained toolbox that can be used to tackle a wide range of problems typically encountered by statisticians. Finally, we deploy these statistical constraints to a wide range of application areas drawn from classical statistics and we contrast our framework against established practices. 
Deconvolutional Paragraph Representation Learning  Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNNbased decoding (reconstruction) decreases with the length of the text. We propose a sequencetosequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semisupervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data. 
Deducer  An R Graphical User Interface (GUI) for Everyone: Deducer is designed to be a free easy to use alternative to proprietary data analysis software such as SPSS, JMP, and Minitab. It has a menu system to do common data manipulation and analysis tasks, and an excellike spreadsheet in which to view and edit data frames. The goal of the project is two fold. 1. Provide an intuitive graphical user interface (GUI) for R, encouraging nontechnical users to learn and perform analyses without programming getting in their way. 2. Increase the efficiency of expert R users when performing common tasks by replacing hundreds of keystrokes with a few mouse clicks. Also, as much as possible the GUI should not get in their way if they just want to do some programming. Deducer is designed to be used with the Java based R console JGR, though it supports a number of other R environments (e.g. Windows RGUI and RTerm). 
Deduplication with Hadoop (Dedoop) 
Entity Matching for Big Data: Automatically matching entities (objects) and ontologies are key technologies to semantically integrate heterogeneous data. These match techniques are needed to identify equivalent data objects (duplicates) or semantically equivalent metadata elements (ontology concepts, schema attributes). The proposed techniques demand very high resources that limit their applicability to largescale (Big Data) problems unless a powerful cloud infrastructure can be utilized. This is because the (fuzzy) match approaches basically have a quadratic complexity to compare the all elements to be matched with each other. For sufficient match quality, multiple match algorithms need to be applied and combined within socalled match workflows adding further resource requirements as well as a significant optimization problem to select matchers and configure their combination. 
Deep Abstract QNetwork  We examine the problem of learning and planning on highdimensional domains with long horizons and sparse rewards. Recent approaches have shown great successes in many Atari 2600 domains. However, domains with long horizons and sparse rewards, such as Montezuma’s Revenge and Venture, remain challenging for existing methods. Methods using abstraction (Dietterich 2000; Sutton, Precup, and Singh 1999) have shown to be useful in tackling longhorizon problems. We combine recent techniques of deep reinforcement learning with existing modelbased approaches using an expertprovided state abstraction. We construct toy domains that elucidate the problem of long horizons, sparse rewards and highdimensional inputs, and show that our algorithm significantly outperforms previous methods on these domains. Our abstractionbased approach outperforms Deep QNetworks (Mnih et al. 2015) on Montezuma’s Revenge and Venture, and exhibits backtracking behavior that is absent from previous methods. 
Deep Alignment Network (DAN) 
In this paper, we propose Deep Alignment Network (DAN), a robust face alignment method based on a deep neural network architecture. DAN consists of multiple stages, where each stage improves the locations of the facial landmarks estimated by the previous stage. Our method uses entire face images at all stages, contrary to the recently proposed face alignment methods that rely on local patches. This is possible thanks to the use of landmark heatmaps which provide visual information about landmark locations estimated at the previous stages of the algorithm. The use of entire face images rather than patches allows DAN to handle face images with large variation in head pose and difficult initializations. An extensive evaluation on two publicly available datasets shows that DAN reduces the stateoftheart failure rate by up to 70%. Our method has also been submitted for evaluation as part of the Menpo challenge. 
Deep Approximately Orthogonal Nonnegative Matrix Factorization  Nonnegative Matrix Factorization (NMF) is a widely used technique for data representation. Inspired by the expressive power of deep learning, several NMF variants equipped with deep architectures have been proposed. However, these methods mostly use the only nonnegativity while ignoring taskspecific features of data. In this paper, we propose a novel deep approximately orthogonal nonnegative matrix factorization method where both nonnegativity and orthogonality are imposed with the aim to perform a hierarchical clustering by using different level of abstractions of data. Experiment on two face image datasets showed that the proposed method achieved better clustering performance than other deep matrix factorization methods and stateoftheart single layer NMF variants. 
Deep Asymmetric Multitask Feature Learning (DeepAMTFL) 
We propose Deep Asymmetric Multitask Feature Learning (DeepAMTFL) which can learn deep representations shared across multiple tasks while effectively preventing negative transfer that may happen in the feature sharing process. Specifically, we introduce an asymmetric autoencoder term that allows predictors for the confident tasks to have high contribution to the feature learning while suppressing the influences of less confident task predictors. This allows learning less noisy representations, and allows weak predictors to exploit knowledge from the strong predictors via the shared latent features. Such asymmetric knowledge transfer through shared features is also more scalable and efficient than intertask asymmetric transfer. We validate our DeepAMTFL model on multiple benchmark datasets for multitask learning and image classification, on which it significantly outperforms existing symmetric and asymmetric multitask learning models, by effectively preventing negative transfer in deep feature learning. 
Deep Belief Networks (DBN) 
In machine learning, a deep belief network (DBN) is a generative graphical model, or alternatively a type of deep neural network, composed of multiple layers of latent variables (“hidden units”), with connections between the layers but not between units within each layer. When trained on a set of examples in an unsupervised way, a DBN can learn to probabilistically reconstruct its inputs. The layers then act as feature detectors on inputs. After this learning step, a DBN can be further trained in a supervised way to perform classification. DBNs can be viewed as a composition of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, where each subnetwork’s hidden layer serves as the visible layer for the next. This also leads to a fast, layerbylayer unsupervised training procedure, where contrastive divergence is applied to each subnetwork in turn, starting from the “lowest” pair of layers (the lowest visible layer being a training set). The observation, due to Hinton’s student Teh, that DBNs can be trained greedily, one layer at a time, has been called a breakthrough in deep learning. 
Deep Broad Learning (DBL) 
Deep learning has demonstrated the power of detailed modeling of complex highorder (multivariate) interactions in data. For some learning tasks there is power in learning models that are not only Deep but also Broad. By Broad, we mean models that incorporate evidence from large numbers of features. This is of especial value in applications where many different features and combinations of features all carry small amounts of information about the class. The most accurate models will integrate all that information. In this paper, we propose an algorithm for Deep Broad Learning called DBL. The proposed algorithm has a tunable parameter $n$, that specifies the depth of the model. It provides straightforward paths towards outofcore learning for large data. We demonstrate that DBL learns models from large quantities of data with accuracy that is highly competitive with the stateoftheart. 
Deep Canonical Correlation Analysis (DCCA) 

Deep Coherence Model (DCM) 
In this paper, we propose a novel deep coherence model (DCM) using a convolutional neural network architecture to capture the text coherence. The text coherence problem is investigated with a new perspective of learning sentence distributional representation and text coherence modeling simultaneously. In particular, the model captures the interactions between sentences by computing the similarities of their distributional representations. Further, it can be easily trained in an endtoend fashion. The proposed model is evaluated on a standard Sentence Ordering task. The experimental results demonstrate its effectiveness and promise in coherence assessment showing a significant improvement over the stateoftheart by a wide margin. 
Deep Collaborative Autoencoder (DCAE) 
In recent years, deep neural networks have yielded stateoftheart performance on several tasks. Although some recent works have focused on combining deep learning with recommendation, we highlight three issues of existing works. First, most works perform deep content feature learning and resort to matrix factorization, which cannot effectively model the highly complex useritem interaction function. Second, due to the difficulty on training deep neural networks, existing models utilize a shallow architecture, and thus limit the expressiveness potential of deep learning. Third, neural network models are easy to overfit on the implicit setting, because negative interactions are not taken into account. To tackle these issues, we present a novel recommender framework called Deep Collaborative Autoencoder (DCAE) for both explicit feedback and implicit feedback, which can effectively capture the relationship between interactions via its nonlinear expressiveness. To optimize the deep architecture of DCAE, we develop a threestage pretraining mechanism that combines supervised and unsupervised feature learning. Moreover, we propose a popularitybased error reweighting module and a sparsityaware dataaugmentation strategy for DCAE to prevent overfitting on the implicit setting. Extensive experiments on three realworld datasets demonstrate that DCAE can significantly advance the stateoftheart. 
Deep Convolutional Decision Jungle (CDJ) 
We propose a novel method called deep convolutional decision jungle (CDJ) and its learning algorithm for image classification. The CDJ maintains the structure of standard convolutional neural networks (CNNs), i.e. multiple layers of multiple response maps fully connected. Each response mapor nodein both the convolutional and fullyconnected layers selectively respond to class labels s.t. each data sample travels via a specific soft route of those activated nodes. The proposed method CDJ automatically learns features, whereas decision forests and jungles require predefined feature sets. Compared to CNNs, the method embeds the benefits of using datadependent discriminative functions, which better handles multimodal/heterogeneous data; further,the method offers more diverse sparse network responses, which in turn can be used for costeffective learning/classification. The network is learnt by combining conventional softmax and proposed entropy losses in each layer. The entropy loss,as used in decision tree growing, measures the purity of data activation according to the class label distribution. The backpropagation rule for the proposed loss function is derived from stochastic gradient descent (SGD) optimization of CNNs. We show that our proposed method outperforms stateoftheart methods on three public image classification benchmarks and one face verification dataset. We also demonstrate the use of auxiliary data labels, when available, which helps our method to learn more discriminative routing and representations and leads to improved classification. 
Deep Convolutional Generative Adversarial Networks (DCGAN) 
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. 
Deep Convolutional Neural Network (DCN) 
A grand challenge in machine learning is the development of computational algorithms that match or outperform humans in perceptual inference tasks such as visual object and speech recognition. The key factor complicating such tasks is the presence of numerous nuisance variables, for instance, the unknown object position, orientation, and scale in object recognition or the unknown voice pronunciation, pitch, and speed in speech recognition. Recently, a new breed of deep learning algorithms have emerged for highnuisance inference tasks; they are constructed from many layers of alternating linear and nonlinear processing units and are trained using largescale algorithms and massive amounts of training data. The recent success of deep learning systems is impressive — they now routinely yield pattern recognition systems with nearor superhuman capabilities — but a fundamental question remains: Why do they work? Intuitions abound, but a coherent framework for understanding, analyzing, and synthesizing deep learning architectures has remained elusive. We answer this question by developing a new probabilistic framework for deep learning based on a Bayesian generative probabilistic model that explicitly captures variation due to nuisance variables. The graphical structure of the model enables it to be learned from data using classical expectationmaximization techniques. Furthermore, by relaxing the generative model to a discriminative one, we can recover two of the current leading deep learning systems, deep convolutional neural networks (DCNs) and random decision forests (RDFs), providing insights into their successes and shortcomings as well as a principled route to their improvement. 
Deep CoSpace (DCS) 
Aiming at improving performance of visual classification in a costeffective manner, this paper proposes an incremental semisupervised learning paradigm called Deep CoSpace (DCS). Unlike many conventional semisupervised learning methods usually performing within a fixed feature space, our DCS gradually propagates information from labeled samples to unlabeled ones along with deep feature learning. We regard deep feature learning as a series of steps pursuing feature transformation, i.e., projecting the samples from a previous space into a new one, which tends to select the reliable unlabeled samples with respect to this setting. Specifically, for each unlabeled image instance, we measure its reliability by calculating the category variations of feature transformation from two different neighborhood variation perspectives, and merged them into an unified sample mining criterion deriving from Hellinger distance. Then, those samples keeping stable correlation to their neighboring samples (i.e., having small category variation in distribution) across the successive feature space transformation, are automatically received labels and incorporated into the model for incrementally training in terms of classification. Our extensive experiments on standard image classification benchmarks (e.g., Caltech256 and SUN397) demonstrate that the proposed framework is capable of effectively mining from largescale unlabeled images, which boosts image classification performance and achieves promising results compared to other semisupervised learning methods. 
Deep Data  What we call ‘deep data’ is a combination of experts’ domain knowledge of the area … combined with data science. 
Deep Density Networks (DDN) 
Building robust online content recommendation systems requires learning complex interactions between user preferences and content features. The field has evolved rapidly in recent years from traditional multiarm bandit and collaborative filtering techniques, with new methods integrating Deep Learning models that enable to capture nonlinear feature interactions. Despite progress, the dynamic nature of online recommendations still poses great challenges, such as finding the delicate balance between exploration and exploitation. In this paper we provide a novel method, Deep Density Networks (DDN) which deconvolves measurement and data uncertainties and predicts probability density of CTR (Click Through Rate), enabling us to perform more efficient exploration of the feature space. We show the usefulness of using DDN online in a real world content recommendation system that serves billions of recommendations per day, and present online and offline results to evaluate the benefit of using DDN. 
Deep Discrete Supervised Hashing (DDSH) 
Hashing has been widely used for largescale search due to its low storage cost and fast query speed. By using supervised information, supervised hashing can significantly outperform unsupervised hashing. Recently, discrete supervised hashing and deep hashing are two representative progresses in supervised hashing. On one hand, hashing is essentially a discrete optimization problem. Hence, utilizing supervised information to directly guide discrete (binary) coding procedure can avoid suboptimal solution and improve the accuracy. On the other hand, deep hashing, which integrates deep feature learning and hashcode learning into an endtoend architecture, can enhance the feedback between feature learning and hashcode learning. The key in discrete supervised hashing is to adopt supervised information to directly guide the discrete coding procedure in hashing. The key in deep hashing is to adopt the supervised information to directly guide the deep feature learning procedure. However, there have not existed works which can use the supervised information to directly guide both discrete coding procedure and deep feature learning procedure in the same framework. In this paper, we propose a novel deep hashing method, called deep discrete supervised hashing (DDSH), to address this problem. DDSH is the first deep hashing method which can utilize supervised information to directly guide both discrete coding procedure and deep feature learning procedure, and thus enhance the feedback between these two important procedures. Experiments on three real datasets show that DDSH can outperform other stateoftheart baselines, including both discrete hashing and deep hashing baselines, for image retrieval. 
Deep Echo State Network (deepESN) 
The study of deep recurrent neural networks (RNNs) and, in particular, of deep Reservoir Computing (RC) is gaining an increasing research attention in the neural networks community. The recently introduced deep Echo State Network (deepESN) model opened the way to an extremely efficient approach for designing deep neural networks for temporal data. At the same time, the study of deepESNs allowed to shed light on the intrinsic properties of state dynamics developed by hierarchical compositions of recurrent layers, i.e. on the bias of depth in RNNs architectural design. In this paper, we summarize the advancements in the development, analysis and applications of deepESNs. 
Deep Evolutionary Network Structured Representation (DENSER) 
Deep Evolutionary Network Structured Representation (DENSER) is a novel approach to automatically design Artificial Neural Networks (ANNs) using Evolutionary Computation (EC). The algorithm not only searches for the best network topology (e.g., number of layers, type of layers), but also tunes hyperparameters, such as, learning parameters or data augmentation parameters. The automatic design is achieved using a representation with two distinct levels, where the outer level encodes the general structure of the network, i.e., the sequence of layers, and the inner level encodes the parameters associated with each layer. The allowed layers and hyperparameter value ranges are defined by means of a humanreadable ContextFree Grammar. DENSER was used to evolve ANNs for two widely used image classification benchmarks obtaining an average accuracy result of up to 94.27% on the CIFAR10 dataset, and of 78.75% on the CIFAR100. To the best of our knowledge, our CIFAR100 results are the highest performing models generated by methods that aim at the automatic design of Convolutional Neural Networks (CNNs), and is amongst the best for manually designed and finetuned CNNs . 
Deep Expander Network (XNet) 
Deep Neural Networks, while being unreasonably effective for several vision tasks, have their usage limited by the computational and memory requirements, both during training and inference stages. Analyzing and improving the connectivity patterns between layers of a network has resulted in several compact architectures like GoogleNet, ResNet and DenseNetBC. In this work, we utilize results from graph theory to develop an efficient connection pattern between consecutive layers. Specifically, we use {\it expander graphs} that have excellent connectivity properties to develop a sparse network architecture, the deep expander network (XNet). The XNets are shown to have high connectivity for a given level of sparsity. We also develop highly efficient training and inference algorithms for such networks. Experimental results show that we can achieve the similar or better accuracy as DenseNetBC with twothirds the number of parameters and FLOPs on several image classification benchmarks. We hope that this work motivates other approaches to utilize results from graph theory to develop efficient network architectures. 
Deep Frame Interpolation  This work presents a supervised learning based approach to the computer vision problem of frame interpolation. The presented technique could also be used in the cartoon animations since drawing each individual frame consumes a noticeable amount of time. The most existing solutions to this problem use unsupervised methods and focus only on real life videos with already high frame rate. However, the experiments show that such methods do not work as well when the frame rate becomes low and object displacements between frames becomes large. This is due to the fact that interpolation of the large displacement motion requires knowledge of the motion structure thus the simple techniques such as frame averaging start to fail. In this work the deep convolutional neural network is used to solve the frame interpolation problem. In addition, it is shown that incorporating the prior information such as optical flow improves the interpolation quality significantly. 
Deep Gaussian Covariance Network (DGCP) 
The correlation lengthscale next to the noise variance are the most used hyperparameters for the Gaussian processes. Typically, stationary covariance functions are used, which are only dependent on the distances between input points and thus invariant to the translations in the input space. The optimization of the hyperparameters is commonly done by maximizing the log marginal likelihood. This works quite well, if the distances are uniform distributed. In the case of a locally adapted or even sparse input space, the prediction of a test point can be worse dependent of its position. A possible solution to this, is the usage of a nonstationary covariance function, where the hyperparameters are calculated by a deep neural network. So that the correlation length scales and possibly the noise variance are dependent on the test point. Furthermore, different types of covariance functions are trained simultaneously, so that the Gaussian process prediction is an additive overlay of different covariance matrices. The right covariance functions combination and its hyperparameters are learned by the deep neural network. Additional, the Gaussian process will be able to be trained by batches or online and so it can handle arbitrarily large data sets. We call this framework Deep Gaussian Covariance Network (DGCP). There are also further extensions to this framework possible, for example sequentially dependent problems like time series or the local mixture of experts. The basic framework and some extension possibilities will be presented in this work. Moreover, a comparison to some recent state of the art surrogate model methods will be performed, also for a time dependent problem. 
Deep Gaussian Mixture Model  Deep learning is a hierarchical inference method formed by subsequent multiple layers of learning able to more efficiently describe complex relationships. In this work, Deep Gaussian Mixture Models are introduced and discussed. A Deep Gaussian Mixture model (DGMM) is a network of multiple layers of latent variables, where, at each layer, the variables follow a mixture of Gaussian distributions. Thus, the deep mixture model consists of a set of nested mixtures of linear models, which globally provide a nonlinear model able to describe the data in a very flexible way. In order to avoid overparameterized solutions, dimension reduction by factor models can be applied at each layer of the architecture thus resulting in deep mixtures of factor analysers. 
Deep Generalized Canonical Correlation Analysis (DGCCA) 
We present Deep Generalized Canonical Correlation Analysis (DGCCA) — a method for learning nonlinear transformations of arbitrarily many views of data, such that the resulting transformations are maximally informative of each other. While methods for nonlinear twoview representation learning (Deep CCA, (Andrew et al., 2013)) and linear manyview representation learning (Generalized CCA (Horst, 1961)) exist, DGCCA is the first CCAstyle multiview representation learning technique that combines the flexibility of nonlinear (deep) representation learning with the statistical power of incorporating information from many independent sources, or views. We present the DGCCA formulation as well as an efficient stochastic optimization algorithm for solving it. We learn DGCCA repre sentations on two distinct datasets for three downstream tasks: phonetic transcrip tion from acoustic and articulatory measurements, and recommending hashtags and friends on a dataset of Twitter users. We find that DGCCA representations soundly beat existing methods at phonetic transcription and hashtag recommendation, and in general perform no worse than standard linear manyview techniques. 
Deep Gradient Compression (DGC) 
Largescale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multinode training, and requires expensive highbandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warmup training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables largescale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile. 
Deep Hashing Neural Network (HNN) 
In this paper we propose a synergistic melting of neural networks and decision trees into a deep hashing neural network (HNN) having a modeling capability exponential with respect to its number of neurons. We first derive a soft decision tree named neural decision tree allowing the optimization of arbitrary decision function at each split node. We then rewrite this soft space partitioning as a new kind of neural network layer, namely the hashing layer (HL), which can be seen as a generalization of the known softmax layer. This HL can easily replace the standard last layer of ANN in any known network topology and thus can be used after a convolutional or recurrent neural network for example. We present the modeling capacity of this deep hashing function on small datasets where one can reach at least equally good results as standard neural networks by diminishing the number of output neurons. Finally, we show that for the case where the number of output neurons is large, the neural network can mitigate the absence of linear decision boundaries by learning for each difficult class a collection of not necessarily connected subregions of the space leading to more flexible decision surfaces. Finally, the HNN can be seen as a deep locality sensitive hashing function which can be trained in a supervised or unsupervised setting as we will demonstrate for classification and regression problems. 
Deep Hyperalignment (DHA) 
This paper proposes Deep Hyperalignment (DHA) as a regularized, deep extension, scalable Hyperalignment (HA) method, which is wellsuited for applying functional alignment to fMRI datasets with nonlinearity, highdimensionality (broad ROI), and a large number of subjects. Unlink previous methods, DHA is not limited by a restricted fixed kernel function. Further, it uses a parametric approach, rank$m$ Singular Value Decomposition (SVD), and stochastic gradient descent for optimization. Therefore, DHA has a suitable time complexity for large datasets, and DHA does not require the training data when it computes the functional alignment for a new subject. Experimental studies on multisubject fMRI analysis confirm that the DHA method achieves superior performance to other stateoftheart HA algorithms. 
Deep Hyperspherical Learning  ➘ “Hyperspherical Convolution” 
Deep Incremental Boosting  This paper introduces Deep Incremental Boosting, a new technique derived from AdaBoost, specifically adapted to work with Deep Learning methods, that reduces the required training time and improves generalisation. We draw inspiration from Transfer of Learning approaches to reduce the startup time to training each incremental Ensemble member. We show a set of experiments that outlines some preliminary results on some common Deep Learning datasets and discuss the potential improvements Deep Incremental Boosting brings to traditional Ensemble methods in Deep Learning. 
Deep Kernelized Autoencoder  In this paper we introduce the deep kernelized autoencoder, a neural network model that allows an explicit approximation of (i) the mapping from an input space to an arbitrary, userspecified kernel space and (ii) the backprojection from such a kernel space to input space. The proposed method is based on traditional autoencoders and is trained through a new unsupervised loss function. During training, we optimize both the reconstruction accuracy of input samples and the alignment between a kernel matrix given as prior and the inner products of the hidden representations computed by the autoencoder. Kernel alignment provides control over the hidden representation learned by the autoencoder. Experiments have been performed to evaluate both reconstruction and kernel alignment performance. Additionally, we applied our method to emulate kPCA on a denoising task obtaining promising results. 
Deep Laplacian Pyramid SuperResolution Network  Convolutional neural networks have recently demonstrated highquality reconstruction for single image superresolution. However, existing methods often require a large number of network parameters and entail heavy computational loads at runtime for generating highaccuracy superresolution results. In this paper, we propose the deep Laplacian Pyramid SuperResolution Network for fast and accurate image superresolution. The proposed network progressively reconstructs the subband residuals of highresolution images at multiple pyramid levels. In contrast to existing methods that involve the bicubic interpolation for preprocessing (which results in large feature maps), the proposed method directly extracts features from the lowresolution input space and thereby entails low computational loads. We train the proposed network with deep supervision using the robust Charbonnier loss functions and achieve highquality image reconstruction. Furthermore, we utilize the recursive layers to share parameters across as well as within pyramid levels, and thus drastically reduce the number of parameters. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the stateoftheart methods in terms of runtime and image quality. 
Deep Layer Aggregation  Convolutional networks have had great success in image classification and other areas of computer vision. Recent efforts have designed deeper or wider networks to improve performance; as convolutional blocks are usually stacked together, blocks at different depths represent information at different scales. Recent models have explored `skip’ connections to aggregate information across layers, but heretofore such skip connections have themselves been `shallow’, projecting to a single fusion node. In this paper, we investigate new deepacrosslayer architectures to aggregate the information from multiple layers. We propose novel iterative and hierarchical structures for deep layer aggregation. The former can produce deep high resolution representations from a network whose final layers have low resolution, while the latter can effectively combine scale information from all blocks. Results show that the our proposed architectures can make use of network parameters and features more efficiently without dictating convolution module structure. We also show transfer of the learned networks to semantic segmentation tasks and achieve better results than alternative networks with baseline training settings. 
Deep Learning  Deep learning is a set of algorithms in machine learning that attempt to model highlevel abstractions in data by using architectures composed of multiple nonlinear transformations. Deep learning is part of a broader family of machine learning methods based on learning representations. An observation (e.g., an image) can be represented in many ways (e.g., a vector of pixels), but some representations make it easier to learn tasks of interest (e.g., is this the image of a human face?) from examples, and research in this area attempts to define what makes better representations and how to create models to learn these representations. Various deep learning architectures such as deep neural networks, convolutional deep neural networks, and deep belief networks have been applied to fields like computer vision, automatic speech recognition, natural language processing, and music/audio signal recognition where they have been shown to produce stateoftheart results on various tasks. 
Deep Learning Accelerator Unit (DLAU) 
As the emerging field of machine learning, deep learning shows excellent ability in solving complex learning problems. However, the size of the networks becomes increasingly large scale due to the demands of the practical applications, which poses significant challenge to construct a high performance implementations of deep learning neural networks. In order to improve the performance as well to maintain the low power cost, in this paper we design DLAU, which is a scalable accelerator architecture for largescale deep learning networks using FPGA as the hardware prototype. The DLAU accelerator employs three pipelined processing units to improve the throughput and utilizes tile techniques to explore locality for deep learning applications. Experimental results on the stateoftheart Xilinx FPGA board demonstrate that the DLAU accelerator is able to achieve up to 36.1x speedup comparing to the Intel Core2 processors, with the power consumption at 234mW. 
Deep Learning Impact (DLI) 
Deep Learning Impact (DLI) is a set of software tools to help users develop AI models with the leading open source deep learning frameworks, like TensorFlow and Caffe, for the deployment and prediction phases of deep learning. DLI enables users to run distributed deep learning workloads on x86 and Power, and complements the PowerAI deep learning software distribution. 
Deep Learning Virtual Machine (DLVM) 
Many current approaches to deep learning make use of highlevel toolkits such as TensorFlow, Torch, or Caffe. Toolkits such as Caffe have a layerbased programming framework with hardcoded gradients specified for each layer type, making research using novel layer types problematic. Toolkits such as Torch and TensorFlow define a computation graph in a host language such as Python, where each node represents a linear algebra operation parallelized as a compute kernel on GPU and stores the result of evaluation; some of these toolkits subsequently perform runtime interpretation over that graph, storing the results of forward calculations and reverseaccumulated gradients at each node. This approach is more flexible, but these toolkits take a very limited and adhoc approach to performing optimization. Also problematic are the facts that most toolkits lack type safety, and target only a single (usually GPU) architecture, limiting users’ abilities to make use of heterogeneous and emerging hardware architectures. We introduce a novel framework for highlevel programming that addresses all of the above shortcomings. 
Deep Linear Discriminant Analysis (DeepLDA) 
We introduce Deep Linear Discriminant Analysis (DeepLDA) which learns linearly separable latent representations in an endtoend fashion. Classic LDA extracts features which preserve class separability and is used for dimensionality reduction for many classification problems. The central idea of this paper is to put LDA on top of a deep neural network. This can be seen as a nonlinear extension of classic LDA. Instead of maximizing the likelihood of target labels for individual samples, we propose an objective function that pushes the network to produce feature distributions which: (a) have low variance within the same class and (b) high variance between different classes. Our objective is derived from the general LDA eigenvalue problem and still allows to train with stochastic gradient descent and backpropagation. 
Deep Local Binary Patterns (Deep LBP) 
Local Binary Pattern (LBP) is a traditional descriptor for texture analysis that gained attention in the last decade. Being robust to several properties such as invariance to illumination translation and scaling, LBPs achieved stateoftheart results in several applications. However, LBPs are not able to capture highlevel features from the image, merely encoding features with low abstraction levels. In this work, we propose Deep LBP, which borrow ideas from the deep learning community to improve LBP expressiveness. By using parametrized datadriven LBP, we enable successive applications of the LBP operators with increasing abstraction levels. We validate the relevance of the proposed idea in several datasets from a wide range of applications. Deep LBP improved the performance of traditional and multiscale LBP in all cases. 
Deep Matching and Validation Network (DMVN) 
Image splicing is a very common image manipulation technique that is sometimes used for malicious purposes. A splicing detection and localization algorithm usually takes an input image and produces a binary decision indicating whether the input image has been manipulated, and also a segmentation mask that corresponds to the spliced region. Most existing splicing detection and localization pipelines suffer from two main shortcomings: 1) they use handcrafted features that are not robust against subsequent processing (e.g., compression), and 2) each stage of the pipeline is usually optimized independently. In this paper we extend the formulation of the underlying splicing problem to consider two input images, a query image and a potential donor image. Here the task is to estimate the probability that the donor image has been used to splice the query image, and obtain the splicing masks for both the query and donor images. We introduce a novel deep convolutional neural network architecture, called Deep Matching and Validation Network (DMVN), which simultaneously localizes and detects image splicing. The proposed approach does not depend on handcrafted features and uses raw input images to create deep learned representations. Furthermore, the DMVN is endtoend op timized to produce the probability estimates and the segmentation masks. Our extensive experiments demonstrate that this approach outperforms stateoftheart splicing detection methods by a large margin in terms of both AUC score and speed. 
Deep Matching Autoencoder (DMAE) 
Increasingly many real world tasks involve data in multiple modalities or views. This has motivated the development of many effective algorithms for learning a common latent space to relate multiple domains. However, most existing crossview learning algorithms assume access to paired data for training. Their applicability is thus limited as the paired data assumption is often violated in practice: many tasks have only a small subset of data available with pairing annotation, or even no paired data at all. In this paper we introduce Deep Matching Autoencoders (DMAE), which learn a common latent space and pairing from unpaired multimodal data. Specifically we formulate this as a crossdomain representation learning and object matching problem. We simultaneously optimise parameters of representation learning autoencoders and the pairing of unpaired multimodal data. This framework elegantly spans the full regime from fully supervised, semisupervised, and unsupervised (no paired data) multimodal learning. We show promising results in image captioning, and on a new task that is uniquely enabled by our methodology: unsupervised classifier learning. 
Deep Mean Maps (DMM) 
The use of distributions and highlevel features from deep architecture has become commonplace in modern computer vision. Both of these methodologies have separately achieved a great deal of success in many computer vision tasks. However, there has been little work attempting to leverage the power of these to methodologies jointly. To this end, this paper presents the Deep Mean Maps (DMMs) framework, a novel family of methods to nonparametrically represent distributions of features in convolutional neural network models. DMMs are able to both classify images using the distribution of toplevel features, and to tune the toplevel features for performing this task. We show how to implement DMMs using a special mean map layer composed of typical CNN operations, making both forward and backward propagation simple. 
Deep Multimodal Attention Network (DMAN) 
Learning social media data embedding by deep models has attracted extensive research interest as well as boomed a lot of applications, such as link prediction, classification, and crossmodal search. However, for social images which contain both link information and multimodal contents (e.g., text description, and visual content), simply employing the embedding learnt from network structure or data content results in suboptimal social image representation. In this paper, we propose a novel social image embedding approach called Deep Multimodal Attention Networks (DMAN), which employs a deep model to jointly embed multimodal contents and link information. Specifically, to effectively capture the correlations between multimodal contents, we propose a multimodal attention network to encode the finegranularity relation between image regions and textual words. To leverage the network structure for embedding learning, a novel SiameseTriplet neural network is proposed to model the links among images. With the joint deep model, the learnt embedding can capture both the multimodal contents and the nonlinear network information. Extensive experiments are conducted to investigate the effectiveness of our approach in the applications of multilabel classification and crossmodal search. Compared to stateoftheart image embeddings, our proposed DMAN achieves significant improvement in the tasks of multilabel classification and crossmodal search. 
Deep Mutual Learning (DML) 
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to lowmemory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static predefined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR100 recognition and Market1501 person reidentification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary — mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher. 
Deep Nearest Neighbor Descent (DNND) 
Most densitybased clustering methods largely rely on how well the underlying density is estimated. However, density estimation itself is also a challenging problem, especially the determination of the kernel bandwidth. A large bandwidth could lead to the oversmoothed density estimation in which the number of density peaks could be less than the true clusters, while a small bandwidth could lead to the undersmoothed density estimation in which spurious density peaks, or called the ‘ripple noise’, would be generated in the estimated density. In this paper, we propose a densitybased hierarchical clustering method, called the Deep Nearest Neighbor Descent (DNND), which could learn the underlying density structure layer by layer and capture the cluster structure at the same time. The oversmoothed density estimation could be largely avoided and the negative effect of the underestimated cases could be also largely reduced. Overall, DNND presents not only the strong capability of discovering the underlying cluster structure but also the remarkable reliability due to its insensitivity to parameters. 
Deep Neural Decision Forests  We present Deep Neural Decision Forests – a novel approach that unifies classification trees with the representation learning functionality known from deep convolutional networks, by training them in an endtoend manner. To combine these two worlds, we introduce a stochastic and differentiable decision tree model, which steers the representation learning usually conducted in the initial layers of a (deep) convolutional network. Our model differs from conventional deep networks because a decision forest provides the final predictions and it differs from conventional decision forests since we propose a principled, joint and global optimization of split and leaf node parameters. We show experimental results on benchmark machine learning datasets like MNIST and ImageNet and find onpar or superior results when compared to stateoftheart deep models. Most remarkably, we obtain Top5Errors of only 7:84%=6:38% on ImageNet validation data when integrating our forests in a singlecrop, single/seven model GoogLeNet architecture, respectively. Thus, even without any form of training data set augmentation we are improving on the 6.67% error obtained by the best GoogLeNet architecture (7 models, 144 crops). 
Deep Optimistic Linear Support Learning (DOL) 
We propose Deep Optimistic Linear Support Learning (DOL) to solve highdimensional multiobjective decision problems where the relative importances of the objectives are not known a priori. Using features from the highdimensional inputs, DOL computes the convex coverage set containing all potential optimal solutions of the convex combinations of the objectives. To our knowledge, this is the first time that deep reinforcement learning has succeeded in learning multiobjective policies. In addition, we provide a testbed with two experiments to be used as a benchmark for deep multiobjective reinforcement learning. 
Deep Policy Inference QNetwork (DPIQN) 
We present DPIQN, a deep policy inference Qnetwork that targets multiagent systems composed of controllable agents, collaborators, and opponents that interact with each other. We focus on one challenging issue in such systems—modeling agents with varying strategies—and propose to employ ‘policy features’ learned from raw observations (e.g., raw images) of collaborators and opponents by inferring their policies. DPIQN incorporates the learned policy features as a hidden vector into its own deep Qnetwork (DQN), such that it is able to predict better Q values for the controllable agents than the stateoftheart deep reinforcement learning models. We further propose an enhanced version of DPIQN, called deep recurrent policy inference Qnetwork (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network’s attention to learn the policy features and its own Qvalues at different phases of the training process. We present a comprehensive analysis of DPIQN and DRPIQN, and highlight their effectiveness and generalizability in various multiagent settings. Our models are evaluated in a classic soccer game involving both competitive and collaborative scenarios. Experimental results performed on 1 vs. 1 and 2 vs. 2 games show that DPIQN and DRPIQN demonstrate superior performance to the baseline DQN and deep recurrent Qnetwork (DRQN) models. We also explore scenarios in which collaborators or opponents dynamically change their policies, and show that DPIQN and DRPIQN do lead to better overall performance in terms of stability and mean scores. 
Deep Product Quantization (DPQ) 
Despite their widespread adoption, Product Quantization techniques were recently shown to be inferior to other hashing techniques. In this work, we present an improved Deep Product Quantization (DPQ) technique that leads to more accurate retrieval and classification than the latest state of the art methods, while having similar computational complexity and memory footprint as the Product Quantization method. To our knowledge, this is the first work to introduce a representation that is inspired by Product Quantization and which is learned endtoend, and thus benefits from the supervised signal. DPQ explicitly learns soft and hard representations to enable an efficient and accurate asymmetric search, by using a straightthrough estimator. A novel loss function, Joint Central Loss, is introduced, which both improves the retrieval performance, and decreases the discrepancy between the soft and the hard representations. Finally, by using a normalization technique, we improve the results for crossdomain category retrieval. 
Deep QNetwork (DQN) 
The theory of reinforcement learning provides a normative account, deeply rooted in psychological and neuroscientific perspectives on animal behaviour, of how agents may optimize their control of an environment. To use reinforcement learning successfully in situations approaching realworld complexity, however, agents are confronted with a difficult task: they must derive efficient representations of the environment from highdimensional sensory inputs, and use these to generalize past experience to new situations. Remarkably, humans and other animals seem to solve this problem through a harmonious combination of reinforcement learning and hierarchical sensory processing systems, the former evidenced by a wealth of neural data revealing notable parallels between the phasic signals emitted by dopaminergic neurons and temporal difference reinforcement learning algorithms. While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, lowdimensional state spaces. Here we use recent advances in training deep neural networks to develop a novel artificial agent, termed a deep Qnetwork, that can learn successful policies directly from highdimensional sensory inputs using endtoend reinforcement learning. We tested this agent on the challenging domain of classic Atari 2600 games. We demonstrate that the deep Qnetwork agent, receiving only the pixels and the game score as inputs, was able to surpass the performance of all previous algorithms and achieve a level comparable to that of a professional human games tester across a set of 49 games, using the same algorithm, network architecture and hyperparameters. This work bridges the divide between highdimensional sensory inputs and actions, resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks. 
Deep Quaternion Network  The field of deep learning has seen significant advancement in recent years. However, much of the existing work has been focused on realvalued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a set parameter budget compared to its realvalued counterpart. In this work, we explore the benefits of generalizing one step further into the hypercomplex numbers, quaternions specifically, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batchnormalization. These pieces are tested by endtoend training on the CIFAR10 and CIFAR100 data sets to show the improved convergence to a realvalued network. 
Deep Rendering Mixture Model (DRMM) 
A Probabilistic Framework for Deep Learning 
Deep Rendering Model (DRM) 
In this paper, we develop a new theoretical framework that provides insights into both the successes and shortcomings of deep learning systems, as well as a principled route to their design and improvement. Our framework is based on a generative probabilistic model that explicitly captures variation due to latent nuisance variables. The Rendering Model (RM) explicitly models nuisance variation through a rendering function that combines the taskspecific variables of interest (e.g., object class in an object recognition task) and the collection of nuisance variables. The Deep Rendering Model (DRM) extends the RM in a hierarchical fashion by rendering via a product of affine nuisance transformations across multiple levels of abstraction. The graphical structures of the RM and DRM enable inference via message passing, using, for example, the sumproduct or maxsum algorithms, and training via the expectationmaximization (EM) algorithm. A key element of the framework is the relaxation of the RM/DRM generative model to a discriminative one in order to optimize the biasvariance tradeoff. 
Deep Residual Hashing  In this paper, we define an extension of the supersymmetric hyperbolic nonlinear sigma model introduced by Zirnbauer. We show that it arises as a weak joint limit of a timechanged version introduced by Sabot and Tarr\`es of the vertexreinforced jump process. It describes the asymptotics of rescaled crossing numbers, rescaled fluctuations of local times, asymptotic local times on a logarithmic scale, endpoints of paths, and last exit trees. 
Deep Rewiring (DEEP R) 
Neuromorphic hardware tends to pose limits on the connectivity of deep networks that one can run on them. But also generic hardware and software implementations of deep learning run more efficiently on sparse networks. Several methods exist for pruning connections of a neural network after it was trained without connectivity constraints. We present an algorithm, DEEP R, that enables us to train directly a sparsely connected neural network. DEEP R automatically rewires the network during supervised training so that connections are there where they are most needed for the task, while its total number is all the time strictly bounded. We demonstrate that DEEP R can be used to train very sparse feedforward and recurrent neural networks on standard benchmark tasks with just a minor loss in performance. DEEP R is based on a rigorous theoretical foundation that views rewiring as stochastic sampling of network configurations from a posterior. 
Deep Ritz Method  We propose a deep learning based method, the Deep Ritz Method, for numerically solving variational problems, particularly the ones that arise from partial differential equations. The Deep Ritz method is naturally nonlinear, naturally adaptive and has the potential to work in rather high dimensions. The framework is quite simple and fits well with the stochastic gradient descent method used in deep learning. We illustrate the method on several problems including some eigenvalue problems. 
Deep Roots  We propose a new method for training computationally efficient and compact convolutional neural networks (CNNs) using a novel sparse connection structure that resembles a tree root. Our sparse connection structure facilitates a significant reduction in computational cost and number of parameters of stateoftheart deep CNNs without compromising accuracy. We validate our approach by using it to train more efficient variants of stateoftheart CNN architectures, evaluated on the CIFAR10 and ILSVRC datasets. Our results show similar or higher accuracy than the baseline architectures with much less compute, as measured by CPU and GPU timings. For example, for ResNet 50, our model has 40% fewer parameters, 45% fewer floating point operations, and is 31% (12%) faster on a CPU (GPU). For the deeper ResNet 200 our model has 25% fewer floating point operations and 44% fewer parameters, while maintaining stateoftheart accuracy. For GoogLeNet, our model has 7% fewer parameters and is 21% (16%) faster on a CPU (GPU). 
Deep Rotation Equivariant Network (DREN) 
Recently, learning equivariant representations has attracted considerable research attention. Dieleman et al. introduce four operations which can be inserted to CNN to learn deep representations equivariant to rotation. However, feature maps should be copied and rotated four times in each layer in their approach, which causes much running time and memory overhead. In order to address this problem, we propose Deep Rotation Equivariant Network(DREN) consisting of cycle layers, isotonic layers and decycle layers.Our proposed layers apply rotation transformation on filters rather than feature maps, achieving a speed up of more than 2 times with even less memory overhead. We evaluate DRENs on Rotated MNIST and CIFAR10 datasets and demonstrate that it can improve the performance of stateoftheart architectures. Our codes are released on GitHub. 
Deep Sparse Subspace Clustering  In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original realworld data do not meet the classspecific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four realworld datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering. 
Deep Subspace Clustering Network  We present a novel deep neural network architecture for unsupervised subspace clustering. This architecture is built upon deep autoencoders, which nonlinearly map the input data into a latent space. Our key idea is to introduce a novel selfexpressive layer between the encoder and the decoder to mimic the ‘selfexpressiveness’ property that has proven effective in traditional subspace clustering. Being differentiable, our new selfexpressive layer provides a simple but effective way to learn pairwise affinities between all data points through a standard backpropagation procedure. Being nonlinear, our neuralnetwork based method is able to cluster data points having complex (often nonlinear) structures. We further propose pretraining and finetuning strategies that let us effectively learn the parameters of our subspace clustering networks. Our experiments show that the proposed method significantly outperforms the stateoftheart unsupervised subspace clustering methods. 
Deep Successor Reinforcement Learning (DSR) 
Learning robust value functions given raw observations and rewards is now possible with modelfree and modelbased deep reinforcement learning algorithms. There is a third alternative, called Successor Representations (SR), which decomposes the value function into two components — a reward predictor and a successor map. The successor map represents the expected future state occupancy from any given state and the reward predictor maps states to scalar rewards. The value function of a state can be computed as the inner product between the successor map and the reward weights. In this paper, we present DSR, which generalizes SR within an endtoend deep reinforcement learning framework. DSR has several appealing properties including: increased sensitivity to distal reward changes due to factorization of reward and world dynamics, and the ability to extract bottleneck states (subgoals) given successor maps trained under a random policy. We show the efficacy of our approach on two diverse environments given raw pixel observations — simple gridworld domains (MazeBase) and the Doom game engine. 
Deep Survival  Previous research has shown that neural networks can model survival data in situations in which some patients’ death times are unknown, e.g. rightcensored. However, neural networks have rarely been shown to outperform their linear counterparts such as the Cox proportional hazards model. In this paper, we run simulated experiments and use real survival data to build upon the riskregression architecture proposed by Faraggi and Simon. We demonstrate that our model, DeepSurv, not only works as well as the standard linear Cox proportional hazards model but actually outperforms it in predictive ability on survival data with linear and nonlinear risk functions. We then show that the neural network can also serve as a recommender system by including a categorical variable representing a treatment group. This can be used to provide personalized treatment recommendations based on an individual’s calculated risk. We provide an open source Python module that implements these methods in order to advance research on deep learning and survival analysis. 
Deep Symbolic Network (DSN) 
We introduce the Deep Symbolic Network (DSN) model, which aims at becoming the whitebox version of Deep Neural Networks (DNN). The DSN model provides a simple, universal yet powerful structure, similar to DNN, to represent any knowledge of the world, which is transparent to humans. The conjecture behind the DSN model is that any type of real world objects sharing enough common features are mapped into human brains as a symbol. Those symbols are connected by links, representing the composition, correlation, causality, or other relationships between them, forming a deep, hierarchical symbolic network structure. Powered by such a structure, the DSN model is expected to learn like humans, because of its unique characteristics. First, it is universal, using the same structure to store any knowledge. Second, it can learn symbols from the world and construct the deep symbolic networks automatically, by utilizing the fact that real world objects have been naturally separated by singularities. Third, it is symbolic, with the capacity of performing causal deduction and generalization. Fourth, the symbols and the links between them are transparent to us, and thus we will know what it has learned or not – which is the key for the security of an AI system. Fifth, its transparency enables it to learn with relatively small data. Sixth, its knowledge can be accumulated. Last but not least, it is more friendly to unsupervised learning than DNN. We present the details of the model, the algorithm powering its automatic learning ability, and describe its usefulness in different use cases. The purpose of this paper is to generate broad interest to develop it within an open source project centered on the Deep Symbolic Network (DSN) model towards the development of general AI. 
Deep Texture Encoding Network (Deep TEN) 
We propose a Deep Texture Encoding Network (DeepTEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate offtheshelf features such as SIFT descriptors or pretrained CNN features for material recognition. Our new approach provides an endtoend learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to stateoftheart methods using goldstandard databases such as MINC2500, Flickr Material Database, KTHTIPS2b, and two recent databases 4DLightFieldMaterial and GTOS. The source code for the complete system are publicly available. 
Deep Variational Canonical Correlation Analysis (VCCA) 
We present deep variational canonical correlation analysis (VCCA), a deep multiview learning model that extends the latent variable model interpretation of linear CCA~\citep{BachJordan05a} to nonlinear observation models parameterized by deep neural networks (DNNs). Marginal data likelihood as well as inference are intractable under this model. We derive a variational lower bound of the data likelihood by parameterizing the posterior density of the latent variables with another DNN, and approximate the lower bound via Monte Carlo sampling. Interestingly, the resulting model resembles that of multiview autoencoders~\citep{Ngiam_11b}, with the key distinction of an additional sampling procedure at the bottleneck layer. We also propose a variant of VCCA called VCCAprivate which can, in addition to the ‘common variables’ underlying both views, extract the ‘private variables’ within each view. We demonstrate that VCCAprivate is able to disentangle the shared and private information for multiview data without hard supervision. 
Deep Visual Explanation (DVE) 
The practical impact of deep learning on complex supervised learning problems has been significant, so much so that almost every Artificial Intelligence problem, or at least a portion thereof, has been somehow recast as a deep learning problem. The applications appeal is significant, but this appeal is increasingly challenged by what some call the challenge of explainability, or more generally the more traditional challenge of debuggability: if the outcomes of a deep learning process produce unexpected results (e.g., less than expected performance of a classifier), then there is little available in the way of theories or tools to help investigate the potential causes of such unexpected behavior, especially when this behavior could impact people’s lives. We describe a preliminary framework to help address this issue, which we call ‘deep visual explanation’ (DVE). ‘Deep,’ because it is the development and performance of deep neural network models that we want to understand. ‘Visual,’ because we believe that the most rapid insight into a complex multidimensional model is provided by appropriate visualization techniques, and ‘Explanation,’ because in the spectrum from instrumentation by inserting print statements to the abductive inference of explanatory hypotheses, we believe that the key to understanding deep learning relies on the identification and exposure of hypotheses about the performance behavior of a learned deep model. In the exposition of our preliminary framework, we use relatively straightforward image classification examples and a variety of choices on initial configuration of a deep model building scenario. By careful but not complicated instrumentation, we expose classification outcomes of deep models using visualization, and also show initial results for one potential application of interpretability. 
Deep VisualSemantic Embedding Model (DeViSE) 
Modern visual recognition systems are often limited in their ability to scale to large numbers of object categories. This limitation is in part due to the increasing difficulty of acquiring sufficient training data in the form of labeled images as the number of object categories grows. One remedy is to leverage data from other sources – such as text data – both to train visual models and to constrain their predictions. In this paper we present a new deep visualsemantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text. We demonstrate that this model matches stateoftheart performance on the 1000class ImageNet object recognition challenge while making more semantically reasonable errors, and also show that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Semantic knowledge improves such zeroshot predictions by up to 65%, achieving hit rates of up to 10% across thousands of novel labels never seen by the visual model. 
Deep Web  The Deep Web, Deep Net, Invisible Web, or Hidden Web, refers to the content on the World Wide Web that is not indexed by standard search engines. Computer scientist Mike Bergman is credited with coining the term in 2000. 
DeepAM  Computer programs written in one language are often required to be ported to other languages to support multiple devices and environments. When programs use language specific APIs (Application Programming Interfaces), it is very challenging to migrate these APIs to the corresponding APIs written in other languages. Existing approaches mine API mappings from projects that have corresponding versions in two languages. They rely on the sparse availability of bilingual projects, thus producing a limited number of API mappings. In this paper, we propose an intelligent system called DeepAM for automatically mining API mappings from a largescale code corpus without bilingual projects. The key component of DeepAM is based on the multimodal sequence to sequence learning architecture that aims to learn joint semantic representations of bilingual API sequences from big source code data. Experimental results indicate that DeepAM significantly increases the accuracy of API mappings as well as the number of API mappings, when compared with the stateoftheart approaches. 
DeepArchitect  In deep learning, performance is strongly affected by the choice of architecture and hyperparameters. While there has been extensive work on automatic hyperparameter optimization for simple spaces, complex spaces such as the space of deep architectures remain largely unexplored. As a result, the choice of architecture is done manually by the human expert through a slow trial and error process guided mainly by intuition. In this paper we describe a framework for automatically designing and training deep models. We propose an extensible and modular language that allows the human expert to compactly represent complex search spaces over architectures and their hyperparameters. The resulting search spaces are treestructured and therefore easy to traverse. Models can be automatically compiled to computational graphs once values for all hyperparameters have been chosen. We can leverage the structure of the search space to introduce different model search algorithms, such as random search, Monte Carlo tree search (MCTS), and sequential modelbased optimization (SMBO). We present experiments comparing the different algorithms on CIFAR10 and show that MCTS and SMBO outperform random search. In addition, these experiments show that our framework can be used effectively for model discovery, as it is possible to describe expressive search spaces and discover competitive models without much effort from the human expert. Code for our framework and experiments has been made publicly available. 
DeepBalance  Class imbalance problems manifest in domains such as financial fraud detection or network intrusion analysis, where the prevalence of one class is much higher than another. Typically, practitioners are more interested in predicting the minority class than the majority class as the minority class may carry a higher misclassification cost. However, classifier performance deteriorates in the face of class imbalance as oftentimes classifiers may predict every point as the majority class. Methods for dealing with class imbalance include costsensitive learning or resampling techniques. In this paper, we introduce DeepBalance, an ensemble of deep belief networks trained with balanced bootstraps and random feature selection. We demonstrate that our proposed method outperforms baseline resampling methods such as SMOTE and under and oversampling in metrics such as AUC and sensitivity when applied to a highly imbalanced financial transaction data. Additionally, we explore performance and training time implications of various model parameters. Furthermore, we show that our model is easily parallelizable, which can reduce training times. Finally, we present an implementation of DeepBalance in R. 
DeepBoost  We present a new ensemble learning algorithm, DeepBoost, which can use as base classifiers a hypothesis set containing deep decision trees, or members of other rich or complex families, and succeed in achieving high accuracy without overfitting the data. The key to the success of the algorithm is a capacityconscious criterion for the selection of the hypotheses. We give new datadependent learning bounds for convex ensembles expressed in terms of the Rademacher complexities of the subfamilies composing the base classifier set, and the mixture weight assigned to each subfamily. Our algorithm directly benefits from these guarantees since it seeks to minimize the corresponding learning bound. We give a full description of our algorithm, including the details of its derivation, and report the results of several experiments showing that its performance compares favorably to that of AdaBoost and Logistic Regression and their L1regularized variants. DeepBoost 
DeepDetect  DeepDetect is a deep learning API and server written in C++11. It makes state of the art deep learning easy to work with and integrate into existing applications. 
DeepDive  DeepDive is a new type of system that enables developers to analyze data on a deeper level than ever before. DeepDive is a trained system: it uses machine learning techniques to leverage on domainspecific knowledge and incorporates user feedback to improve the quality of its analysis. 
DeepDSL  In recent years, Deep Learning (DL) has found great success in domains such as multimedia understanding. However, the complex nature of multimedia data makes it difficult to develop DLbased software. The stateofthe art tools, such as Caffe, TensorFlow, Torch7, and CNTK, while are successful in their applicable domains, are programming libraries with fixed user interface, internal representation, and execution environment. This makes it difficult to implement portable and customized DL applications. In this paper, we present DeepDSL, a domain specific language (DSL) embedded in Scala, that compiles deep networks written in DeepDSL to Java source code. Deep DSL provides (1) intuitive constructs to support compact encoding of deep networks; (2) symbolic gradient derivation of the networks; (3) static analysis for memory consumption and error detection; and (4) DSLlevel optimization to improve memory and runtime efficiency. DeepDSL programs are compiled into compact, efficient, customizable, and portable Java source code, which operates the CUDA and CUDNN interfaces running on Nvidia GPU via a Java Native Interface (JNI) library. We evaluated DeepDSL with a number of popular DL networks. Our experiments show that the compiled programs have very competitive runtime performance and memory efficiency compared to the existing libraries. 
DeepER  Entity Resolution (ER) is a fundamental problem with many applications. Machine learning (ML)based and rulebased approaches have been widely studied for decades, with many efforts being geared towards which features/attributes to select, which similarity functions to employ, and which blocking function to use – complicating the deployment of an ER system as a turnkey system. In this paper, we present DeepER, a turnkey ER system powered by deep learning (DL) techniques. The central idea is that distributed representations and representation learning from DL can alleviate the above human efforts for tuning existing ER systems. DeepER makes several notable contributions: encoding a tuple as a distributed representation of attribute values, building classifiers using these representations and a semantic aware blocking based on LSH, and learning and tuning the distributed representations for ER. We evaluate our algorithms on multiple benchmark datasets and achieve competitive results while requiring minimal interaction with experts. 
DeepFeat  A deep feature based saliency model (DeepFeat) is developed to leverage the understanding of the prediction of human fixations. Traditional saliency models often predict the human visual attention relying on few level image cues. Although such models predict fixations on a variety of image complexities, their approaches are limited to the incorporated features. In this study, we aim to provide an intuitive interpretation of convolu tional neural network deep features by combining low and high level visual factors. We exploit four evaluation metrics to evaluate the correspondence between the proposed framework and the groundtruth fixations. The key findings of the results demon strate that the DeepFeat algorithm, incorporation of bottom up and top down saliency maps, outperforms the individual bottom up and top down approach. Moreover, in comparison to nine 9 stateoftheart saliency models, our proposed DeepFeat model achieves satisfactory performance based on all four evaluation metrics. 
DeepFuse  We present a novel deep learning architecture for fusing static multiexposure images. Current multiexposure fusion (MEF) approaches use handcrafted features to fuse input sequence. However, the weak handcrafted representations are not robust to varying input conditions. Moreover, they perform poorly for extreme exposure image pairs. Thus, it is highly desirable to have a method that is robust to varying input conditions and capable of handling extreme exposure without artifacts. Deep representations have known to be robust to input conditions and have shown phenomenal performance in a supervised setting. However, the stumbling block in using deep learning for MEF was the lack of sufficient training data and an oracle to provide the groundtruth for supervision. To address the above issues, we have gathered a large dataset of multiexposure image stacks for training and to circumvent the need for ground truth images, we propose an unsupervised deep learning framework for MEF utilizing a noreference quality metric as loss function. The proposed approach uses a novel CNN architecture trained to learn the fusion operation without reference ground truth image. The model fuses a set of common low level features extracted from each image to generate artifactfree perceptually pleasing results. We perform extensive quantitative and qualitative evaluation and show that the proposed technique outperforms existing stateoftheart approaches for a variety of natural images. 
DeepGraph  The topological (or graph) structures of realworld networks are known to be predictive of multiple dynamic properties of the networks. Conventionally, a graph structure is represented using an adjacency matrix or a set of handcrafted structural features. These representations either fail to highlight local and global properties of the graph or suffer from a severe loss of structural information. There lacks an effective graph representation, which hinges the realization of the predictive power of network structures. In this study, we propose to learn the represention of a graph, or the topological structure of a network, through a deep learning model. This endtoend prediction model, named DeepGraph, takes the input of the raw adjacency matrix of a realworld network and outputs a prediction of the growth of the network. The adjacency matrix is first represented using a graph descriptor based on the heat kernel signature, which is then passed through a multicolumn, multiresolution convolutional neural network. Extensive experiments on five large collections of realworld networks demonstrate that the proposed prediction model significantly improves the effectiveness of existing methods, including linear or nonlinear regressors that use handcrafted features, graph kernels, and competing deep learning methods. 
DeepLearningKit  In this paper we present DeepLearningKit – an open source framework that supports using pretrained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOSbased mobile apps on iPhone/iPad, tvOSbased apps for the big screen, or OS X desktop applications. The goal is to support using deep learning models trained with popular frameworks such as Caffe, Torch, TensorFlow, Theano, Pylearn, Deeplearning4J and Mocha. Given the massive GPU resources and time required to train Deep Learning models we suggest an App Store like model to distribute and download pretrained and reusable Deep Learning models. 
Deeply Supervised Object Detector (DSOD) 
We present Deeply Supervised Object Detector (DSOD), a framework that can learn object detectors from scratch. Stateoftheart object objectors rely heavily on the offtheshelf networks pretrained on largescale classification datasets like ImageNet, which incurs learning bias due to the difference on both the loss functions and the category distributions between classification and detection tasks. Model finetuning for the detection task could alleviate this bias to some extent but not fundamentally. Besides, transferring pretrained models from classification to detection between discrepant domains is even more difficult (e.g. RGB to depth images). A better solution to tackle these two critical problems is to train object detectors from scratch, which motivates our proposed DSOD. Previous efforts in this direction mostly failed due to much more complicated loss functions and limited training data in object detection. In DSOD, we contribute a set of design principles for training object detectors from scratch. One of the key findings is that deep supervision, enabled by dense layerwise connections, plays a critical role in learning a good detector. Combining with several other principles, we develop DSOD following the singleshot detection (SSD) framework. Experiments on PASCAL VOC 2007, 2012 and MS COCO datasets demonstrate that DSOD can achieve better results than the stateoftheart solutions with much more compact models. For instance, DSOD outperforms SSD on all three benchmarks with realtime detection speed, while requires only 1/2 parameters to SSD and 1/10 parameters to Faster RCNN. Our code and models are available at: https://…/DSOD . 
DeeplySupervised Nets (DSN) 
Our proposed deeplysupervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent. We make an attempt to boost the classification performance by studying a new formulation in deep networks. Three aspects in convolutional neural networks (CNN) style architectures are being looked at: (1) transparency of the intermediate layers to the overall classification; (2) discriminativeness and robustness of learned features, especially in the early layers; (3) effectiveness in training due to the presence of the exploding and vanishing gradients. We introduce ‘companion objective’ to the individual hidden layers, in addition to the overall objective at the output layer (a different strategy to layerwise pretraining). We extend techniques from stochastic gradient methods to analyze our algorithm. The advantage of our method is evident and our experimental result on benchmark datasets shows significant performance gain over existing methods (e.g. all stateoftheart results on MNIST, CIFAR10, CIFAR100, and SVHN). 
DeepPath  We study the problem of learning to reason in large scale knowledge graphs (KGs). More specifically, we describe a novel reinforcement learning framework for learning multihop relational paths: we use a policybased agent with continuous states based on knowledge graph embeddings, which reasons in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, our approach includes a reward function that takes the accuracy, diversity, and efficiency into consideration. Experimentally, we show that our proposed method outperforms a pathranking based algorithm and knowledge graph embedding methods on Freebase and NeverEnding Language Learning datasets. 
DeepProbe  Information extraction and user intention identification are central topics in modern query understanding and recommendation systems. In this paper, we propose DeepProbe, a generic informationdirected interaction framework which is built around an attentionbased sequence to sequence (seq2seq) recurrent neural network. DeepProbe can rephrase, evaluate, and even actively ask questions, leveraging the generative ability and likelihood estimation made possible by seq2seq models. DeepProbe makes decisions based on a derived uncertainty (entropy) measure conditioned on user inputs, possibly with multiple rounds of interactions. Three applications, namely a rewritter, a relevance scorer and a chatbot for ad recommendation, were built around DeepProbe, with the first two serving as precursory building blocks for the third. We first use the seq2seq model in DeepProbe to rewrite a user query into one of standard query form, which is submitted to an ordinary recommendation system. Secondly, we evaluate DeepProbe’s seq2seq modelbased relevance scoring. Finally, we build a chatbot prototype capable of making active user interactions, which can ask questions that maximize information gain, allowing for a more efficient user intention idenfication process. We evaluate first two applications by 1) comparing with baselines by BLEU and AUC, and 2) human judge evaluation. Both demonstrate significant improvements compared with current stateoftheart systems, proving their values as useful tools on their own, and at the same time laying a good foundation for the ongoing chatbot application. 
DeepRank  This paper concerns a deep learning approach to relevance ranking in information retrieval (IR). Existing deep IR models such as DSSM and CDSSM directly apply neural networks to generate ranking scores, without explicit understandings of the relevance. According to the human judgement process, a relevance label is generated by the following three steps: 1) relevant locations are detected, 2) local relevances are determined, 3) local relevances are aggregated to output the relevance label. In this paper we propose a new deep learning architecture, namely DeepRank, to simulate the above human judgment process. Firstly, a detection strategy is designed to extract the relevant contexts. Then, a measure network is applied to determine the local relevances by utilizing a convolutional neural network (CNN) or twodimensional gated recurrent units (2DGRU). Finally, an aggregation network with sequential integration and term gating mechanism is used to produce a global relevance score. DeepRank well captures important IR characteristics, including exact/semantic matching signals, proximity heuristics, query term importance, and diverse relevance requirement. Experiments on both benchmark LETOR dataset and a large scale clickthrough data show that DeepRank can significantly outperform learning to ranking methods, and existing deep learning methods. 
DeepSense  Mobile sensing applications usually require timeseries inputs from sensors. Some applications, such as tracking, can use sensed acceleration and rate of rotation to calculate displacement based on physical system models. Other applications, such as activity recognition, extract manually designed features from sensor inputs for classification. Such applications face two challenges. On one hand, ondevice sensor measurements are noisy. For many mobile applications, it is hard to find a distribution that exactly describes the noise in practice. Unfortunately, calculating target quantities based on physical system and noise models is only as accurate as the noise assumptions. Similarly, in classification applications, although manually designed features have proven to be effective, it is not always straightforward to find the most robust features to accommodate diverse sensor noise patterns and user behaviors. To this end, we propose DeepSense, a deep learning framework that directly addresses the aforementioned noise and feature customization challenges in a unified manner. DeepSense integrates convolutional and recurrent neural networks to exploit local interactions among similar mobile sensors, merge local interactions of different sensory modalities into global interactions, and extract temporal relationships to model signal dynamics. DeepSense thus provides a general signal estimation and classification framework that accommodates a wide range of applications. We demonstrate the effectiveness of DeepSense using three representative and challenging tasks: car tracking with motion sensors, heterogeneous human activity recognition, and user identification with biometric motion analysis. DeepSense significantly outperforms the stateoftheart methods for all three tasks. In addition, DeepSense is feasible to implement on smartphones due to its moderate energy consumption and low latency 
DeepWalk  We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes recent advancements in language modeling and unsupervised feature learning (or deep learning) from sequences of words to graphs. DeepWalk uses local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. We demonstrate DeepWalk’s latent representations on several multilabel network classification tasks for social networks such as BlogCatalog, Flickr, and YouTube. Our results show that DeepWalk outperforms challenging baselines which are allowed a global view of the network, especially in the presence of missing information. DeepWalk’s representations can provide F1 scores up to 10% higher than competing methods when labeled data is sparse. In some experiments, DeepWalk’s representations are able to outperform all baseline methods while using 60% less training data. DeepWalk is also scalable. It is an online learning algorithm which builds useful incremental results, and is trivially parallelizable. These qualities make it suitable for a broad class of real world applications such as network classification, and anomaly detection. 
DeepXplore  Deep learning (DL) systems are increasingly deployed in securitycritical domains including selfdriving cars and malware detection, where the correctness and predictability of a system’s behavior for cornercase inputs are of great importance. However, systematic testing of largescale DL systems with thousands of neurons and millions of parameters for all possible cornercases is a hard problem. Existing DL testing depends heavily on manually labeled data and therefore often fails to expose different erroneous behaviors for rare inputs. We present DeepXplore, the first whitebox framework for systematically testing realworld DL systems. We address two problems: (1) generating inputs that trigger different parts of a DL system’s logic and (2) identifying incorrect behaviors of DL systems without manual effort. First, we introduce neuron coverage for estimating the parts of DL system exercised by a set of test inputs. Next, we leverage multiple DL systems with similar functionality as crossreferencing oracles and thus avoid manual checking for erroneous behaviors. We demonstrate how finding inputs triggering differential behaviors while achieving high neuron coverage for DL algorithms can be represented as a joint optimization problem and solved efficiently using gradientbased optimization techniques. DeepXplore finds thousands of incorrect cornercase behaviors in stateoftheart DL models trained on five popular datasets. For all tested DL models, on average, DeepXplore generated one test input demonstrating incorrect behavior within one second while running on a commodity laptop. The inputs generated by DeepXplore achieved 33.2% higher neuron coverage on average than existing testing methods. We further show that the test inputs generated by DeepXplore can also be used to retrain the corresponding DL model to improve classification accuracy or identify polluted training data. 
Deferred Acceptance Algorithm (DAA) 
The Deferred Acceptance Algorithm (DAA) goes back to Gale and Shapley (1962). They introduce a rather simple algorithm that finds a stable matching for example for college admissions or in a marriage market. In a marriage market where M men have preferences over W women, and men take the role of the proposing party, the DAA produces what is called the Mstable matching: each man strictly prefers the Mstable matching to any other potential matching. “Stable” means that no couple of a man and a woman could break the matching by choosing another mate. This is quite a strong result. Variations of this algoritm are used in Hospital assignments in the USA, whereby recently graduated doctors submit preferences over hospitals, and hospitals submit preferences over graduates. Another application is the kidney exchange, where the algorithm is used to find the best match between a set of donors and a set of receivers. matchingMarkets 
Definition Extraction Tool (DefExt) 
We present DefExt, an easy to use semi supervised Definition Extraction Tool. DefExt is designed to extract from a target corpus those textual fragments where a term is explicitly mentioned together with its core features, i.e. its definition. It works on the back of a Conditional Random Fields based sequential labeling algorithm and a bootstrapping approach. Bootstrapping enables the model to gradually become more aware of the idiosyncrasies of the target corpus. In this paper we describe the main components of the toolkit as well as experimental results stemming from both automatic and manual evaluation. We release DefExt as open source along with the necessary files to run it in any Unix machine. We also provide access to training and test data for immediate use. 
Deflated Deterministic Parallel Analysis (DDPA) 
➘ “Deterministic Parallel Analysis” 
Deformable Convolution  ➘ “Deformable Convolutional Networks” 
Deformable Convolutional Networks  Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained endtoend by standard backpropagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released. 
Deformable RoI Pooling  ➘ “Deformable Convolutional Networks” 
Degradation Data Analysis  Given that products are more frequently being designed with higher reliability and developed in a shorter amount of time, it is often not possible to test new designs to failure under normal operating conditions. In some cases, it is possible to infer the reliability behavior of unfailed test samples with only the accumulated test time information and assumptions about the distribution. However, this generally leads to a great deal of uncertainty in the results. Another option in this situation is the use of degradation analysis. Degradation analysis involves the measurement of performance data that can be directly related to the presumed failure of the product in question. Many failure mechanisms can be directly linked to the degradation of part of the product, and degradation analysis allows the analyst to extrapolate to an assumed failure time based on the measurements of degradation over time. 
Degree Penalty  Network embedding aims to learn the lowdimensional representations of vertexes in a network, while structure and inherent properties of the network is preserved. Existing network embedding works primarily focus on preserving the microscopic structure, such as the first and secondorder proximity of vertexes, while the macroscopic scalefree property is largely ignored. Scalefree property depicts the fact that vertex degrees follow a heavytailed distribution (i.e., only a few vertexes have high degrees) and is a critical property of realworld networks, such as social networks. In this paper, we study the problem of learning representations for scalefree networks. We first theoretically analyze the difficulty of embedding and reconstructing a scalefree network in the Euclidean space, by converting our problem to the sphere packing problem. Then, we propose the ‘degree penalty’ principle for designing scalefree property preserving network embedding algorithm: punishing the proximity between highdegree vertexes. We introduce two implementations of our principle by utilizing the spectral techniques and a skipgram model respectively. Extensive experiments on six datasets show that our algorithms are able to not only reconstruct heavytailed distributed degree distribution, but also outperform stateoftheart embedding models in various network mining tasks, such as vertex classification and link prediction. 
Degree Weighted Lasso  DWLasso 
DeGrootFriedkin Model (DF) 
The DeGrootFriedkin model in , contains two stages and studies the evolution of selfconfidence, i.e., how confident an individual is for her opinions on a sequence of issues. In the first stage, individuals update their opinions for a particular issue according to the classical DeGroot model, and in the second stage, the selfconfidence for the next issue is governed by the reflected appraisal mechanism studied in ,. Reflected appraisal mechanism, in simple words, describes the phenomenon that individuals’ selfappraisals on some dimension (e.g., selfconfidence, selfesteem) are influenced by the appraisals of other individuals on them. 
Delaunay Diagram  
Delta Epsilon Alpha Star  Delta Epsilon Alpha Star is a minimal coverage, realtime robotic search algorithm that yields a moderately aggressive search path with minimal backtracking. Search performance is bounded by a placing a combinatorial bound, epsilon and delta, on the maximum deviation from the theoretical shortest path and the probability at which further deviations can occur. Additionally, we formally define the notion of PACadmissibility — a relaxed admissibility criteria for algorithms, and show that PACadmissible algorithms are better suited to robotic search situations than epsilonadmissible or strict algorithms. 
DelugeNets  Human brains are adept at dealing with the deluge of information they continuously receive, by suppressing the nonessential inputs and focusing on the important ones. Inspired by such capability, we propose Deluge Networks (DelugeNets), a novel class of neural networks facilitating massive crosslayer information inflows from preceding layers to succeeding layers. The connections between layers in DelugeNets are efficiently established through crosslayer depthwise convolutional layers with learnable filters, acting as a flexible selection mechanism. By virtue of the massive crosslayer information inflows, DelugeNets can propagate information across many layers with greater flexibility and utilize network parameters more effectively, compared to existing ResNet models. Experiments show the superior performances of DelugeNets in terms of both classification accuracies and parameter efficiencies. Remarkably, a DelugeNet model with just 20.2M parameters achieve stateoftheart error of 19.02% on CIFAR100 dataset, outperforming DenseNet model with 27.2M parameters. Moreover, DelugeNet performs comparably to ResNet200 on ImageNet dataset with merely half of the computations needed by the latter. 
Demand Sensing  Demand Sensing is a next generation forecasting method that leverages new mathematical techniques and near realtime information to create an accurate forecast of demand, based on the current realities of the supply chain. The typical performance of demand sensing systems reduces nearterm forecast error by 30% or more compared to traditional timeseries forecasting techniques. The jump in forecast accuracy helps companies manage the effects of market volatility and gain the benefits of a demanddriven supply chain, including more efficient operations, increased service levels, and a range of financial benefits including higher revenue, better profit margins, less inventory, better perfect order performance and a shorter cashtocash cycle time. Gartner, Inc. insight on demand sensing can be found in its report, “Supply Chain Strategy for Manufacturing Leaders: The Handbook for Becoming Demand Driven.” 
Deming Regression  In statistics, Deming regression, named after W. Edwards Deming, is an errorsinvariables model which tries to find the line of best fit for a twodimensional dataset. It differs from the simple linear regression in that it accounts for errors in observations on both the x and the y axis. It is a special case of total least squares, which allows for any number of predictors and a more complicated error structure. deming 
Democratization  Democratization is defined as the action/development of making something accessible to everyone, to the “common masses.” History provides democratization lessons from the Industrial and Information Revolutions. Both of these moments in history were driven by the standardization of parts, tools, architectures, interfaces, designs and trainings that allowed for the creation of common platforms. Instead of being dependent upon a “high priesthood” of specialists to assemble your guns or cars or computer systems, organizations of all sizes where able to leverage common platforms to build their own sources of customer, business and financial differentiation. 
Dempster’s Rule of Combination  How to combine two independent sets of probability mass assignments in specific situations. In case different sources express their beliefs over the frame in terms of belief constraints such as in case of giving hints or in case of expressing preferences, then Dempster’s rule of combination is the appropriate fusion operator. dst 
Dempster–Shafer Theory (DST) 
The DempsterShafer theory (DST) is a mathematical theory of evidence. It allows one to combine evidence from different sources and arrive at a degree of belief (represented by a belief function) that takes into account all the available evidence. The theory was first developed by Arthur P. Dempster and Glenn Shafer. In a narrow sense, the term DempsterShafer theory refers to the original conception of the theory by Dempster and Shafer. However, it is more common to use the term in the wider sense of the same general approach, as adapted to specific kinds of situations. In particular, many authors have proposed different rules for combining evidence, often with a view to handling conflicts in evidence better. dst 
Dendrogram  A dendrogram (from Greek dendron “tree” and gramma “drawing”) is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. 
Denoising Autoencoder (dA) 
The idea behind denoising autoencoders is simple. In order to force the hidden layer to discover more robust features and prevent it from simply learning the identity, we train the autoencoder to reconstruct the input from a corrupted version of it. The denoising autoencoder is a stochastic version of the autoencoder. Intuitively, a denoising autoencoder does two things: try to encode the input (preserve the information about the input), and try to undo the effect of a corruption process stochastically applied to the input of the autoencoder. The latter can only be done by capturing the statistical dependencies between the inputs. The denoising autoencoder can be understood from different perspectives (the manifold learning perspective, stochastic operator perspective, bottomup – information theoretic perspective, topdown – generative model perspective), all of which are explained in. See also section 7.2 of for an overview of autoencoders. In , the stochastic corruption process randomly sets some of the inputs (as many as half of them) to zero. Hence the denoising autoencoder is trying to predict the corrupted (i.e. missing) values from the uncorrupted (i.e., nonmissing) values, for randomly selected subsets of missing patterns. Note how being able to predict any subset of variables from the rest is a sufficient condition for completely capturing the joint distribution between a set of variables (this is how Gibbs sampling works). To convert the autoencoder class into a denoising autoencoder class, all we need to do is to add a stochastic corruption step operating on the input. The input can be corrupted in many ways, but in this tutorial we will stick to the original corruption mechanism of randomly masking entries of the input by making them zero. 
Denoising Random Forest  This paper proposes a novel type of random forests called a denoising random forests that are robust against noises contained in test samples. Such noisecorrupted samples cause serious damage to the estimation performances of random forests, since unexpected child nodes are often selected and the leaf nodes that the input sample reaches are sometimes far from those for a clean sample. Our main idea for tackling this problem originates from a binary indicator vector that encodes a traversal path of a sample in the forest. Our proposed method effectively employs this vector by introducing denoising autoencoders into random forests. A denoising autoencoder can be trained with indicator vectors produced from clean and noisy input samples, and nonleaf nodes where incorrect decisions are made can be identified by comparing the input and output of the trained denoising autoencoder. Multiple traversal paths with respect to the nodes with incorrect decisions caused by the noises can then be considered for the estimation. 
Dense Convolutional Network (DenseNet) 
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feedforward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the stateoftheart on all five of them (e.g., yielding 3.74% test error on CIFAR10, 19.25% on CIFAR100 and 1.59% on SVHN). 
Dense Transformer Networks  The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixelwise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoderdecoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods. 
Densely Connected Convolutional Network (DenseNet) 
Classical approaches for estimating optical flow have achieved rapid progress in the last decade. However, most of them are too slow to be applied in realtime video analysis. Due to the great success of deep learning, recent work has focused on using CNNs to solve such dense prediction problems. In this paper, we investigate a new deep architecture, Densely Connected Convolutional Networks (DenseNet), to learn optical flow. This specific architecture is ideal for the problem at hand as it provides shortcut connections throughout the network, which leads to implicit deep supervision. We extend current DenseNet to a fully convolutional network to learn motion estimation in an unsupervised manner. Evaluation results on three standard benchmarks demonstrate that DenseNet is a better fit than other widely adopted CNN architectures for optical flow estimation. 
Densitybased spatial clustering of applications with noise (DBSCAN) 
Densitybased spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, HansPeter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a densitybased clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature. OPTICS can be seen as a generalization of DBSCAN to multiple ranges, effectively replacing the e parameter with a maximum search radius. dbscan 
Dependence Modeling  ➚ “Copula” http://…/9781466583221 http://…/7699_chap01.pdf 
Dependency Network  The dependency network approach provides a new system level analysis of the activity and topology of directed networks. The approach extracts causal topological relations between the network’s nodes (when the network structure is analyzed), and provides an important step towards inference of causal activity relations between the network nodes (when analyzing the network activity). This methodology has originally been introduced for the study of financial data, it has been extended and applied to other systems, such as the immune system, and semantic networks. In the case of network activity, the analysis is based on partial correlations, which are becoming ever more widely used to investigate complex systems. In simple words, the partial (or residual) correlation is a measure of the effect (or contribution) of a given node, say j, on the correlations between another pair of nodes, say i and k. Using this concept, the dependency of one node on another node, is calculated for the entire network. This results in a directed weighted adjacency matrix, of a fully connected network. Once the adjacency matrix has been constructed, different algorithms can be used to construct the network, such as a threshold network, Minimal Spanning Tree (MST), Planar Maximally Filtered Graph (PMFG), and others. 
DeployR  DeployR is a serverbased framework that provides simple, secure R integration for application developers. It’s available in two editions: DeployR Open, which is free and opensource; and Revolution R Enterprise DeployR, which adds a scalable grid framework and enterprise authentication features for production applications integrated with R. If you’re looking for an overview of what DeployR is and how you can use it to access R from other applications, we’ve just released a new white paper, Using DeployR to Solve the R Integration Problem. DeployR Data I/O 
Depthfirst Search (DFS) 
Depthfirst search (DFS) is an algorithm for traversing or searching tree or graph data structures. One starts at the root (selecting some arbitrary node as the root in the case of a graph) and explores as far as possible along each branch before backtracking. A version of depthfirst search was investigated in the 19th century by French mathematician Charles Pierre Tremaux as a strategy for solving mazes. 
Descriptive Statistics  Descriptive statistics is the discipline of quantitatively describing the main features of a collection of information, or the quantitative description itself. Descriptive statistics are distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory. Even when a data analysis draws its main conclusions using inferential statistics, descriptive statistics are generally also presented. For example in a paper reporting on a study involving human subjects, there typically appears a table giving the overall sample size, sample sizes in important subgroups (e.g., for each treatment or exposure group), and demographic or clinical characteristics such as the average age, the proportion of subjects of each sex, and the proportion of subjects with related comorbidities. 
DesignExecuteExamineDeploy – Framework (DEED) 

Despeckling Residual Neural Network (DRNN) 
Unsupervised Despeckling 
Determinantal Point Process (DPP) 
In mathematics, a determinantal point process is a stochastic point process, the probability distribution of which is characterized as a determinant of some function. Such processes arise as important tools in random matrix theory, combinatorics, and physics. Improving the Diversity of TopN Recommendation via Determinantal Point Process 
Deterministic Parallel Analysis (DPA) 
Factor analysis is widely used in many application areas. The first step, choosing the number of factors, remains a serious challenge. One of the most popular methods is parallel analysis (PA), which compares the observed factor strengths to simulated ones under a noiseonly model. % Abstracts are commonly just one paragraph. This paper presents a deterministic version of PA (DPA), which is faster and more reproducible than PA. We show that DPA selects large factors and does not select small factors just like [Dobriban, 2017] shows for PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We develop a deflated version of DPA (DDPA) that counters shadowing. By raising the decision threshold in DDPA, a new method (DDPA+) also improves estimation accuracy. We illustrate our methods on data from the Human Genome Diversity Project (HGDP). There PA and DPA select seemingly too many factors, while DDPA+ selects only a few. A Matlab implementation is available. 
Deterministic Statistical Machine  E.g. you input a data set and then specify the question you are asking (is variable Y related to variable X? can i predict Z from W?) then, depending on your question, it uses a deterministic set of methods to analyze the data. Say regression for inference, linear discriminant analysis for prediction, etc. But the method is fixed and deterministic for each question. It also performs a prespecified set of checks for outliers, confounders, missing data, maybe even data fudging. It generates a report with a markdown tool and then immediately publishes the result. 
Determinized Sparse Partially Observable Tree (DESPOT) 
The partially observable Markov decision process (POMDP) provides a principled general framework for planning under uncertainty, but solving POMDPs optimally is computationally intractable, due to the ‘curse of dimensionality’ and the ‘curse of history’. To overcome these challenges, we introduce the Determinized Sparse Partially Observable Tree (DESPOT), a sparse approximation of the standard belief tree, for online planning under uncertainty. A DESPOT focuses online planning on a set of randomly sampled scenarios and compactly captures the ‘execution’ of all policies under these scenarios. We show that the best policy obtained from a DESPOT is nearoptimal, with a regret bound that depends on the representation size of the optimal policy. Leveraging this result, we give an anytime online planning algorithm, which searches a DESPOT for a policy that optimizes a regularized objective function. Regularization balances the estimated value of a policy under the sampled scenarios and the policy size, thus avoiding overfitting. The algorithm demonstrates strong experimental results, compared with some of the best online POMDP algorithms available. It has also been incorporated into an autonomous driving system for realtime vehicle control. 
DetMCD Algorithm  Most algorithms for highly robust estimators of multivariate location and scatter start by drawing a large number of random subsets. For instance, the FASTMCD algorithm of Rousseeuw and Van Driessen starts in this way, and then takes socalled concentration steps to obtain a more accurate approximation to the MCD. The FASTMCD algorithm is affine equivariant but not permutation invariant. In this article,we present a deterministic algorithm, denoted as DetMCD, which does not use random subsets and is even faster. It computes a small number of deterministic initial estimators, followed by concentration steps. DetMCD is permutation invariant and very close to affine equivariant. DetMCD 
Detrended Fluctuation Analysis (DFA) 
In stochastic processes, chaos theory and time series analysis, detrended fluctuation analysis (DFA) is a method for determining the statistical selfaffinity of a signal. It is useful for analysing time series that appear to be longmemory processes (diverging correlation time, e.g. powerlaw decaying autocorrelation function) or 1/f noise. The obtained exponent is similar to the Hurst exponent, except that DFA may also be applied to signals whose underlying statistics (such as mean and variance) or dynamics are nonstationary (changing with time). It is related to measures based upon spectral techniques such as autocorrelation and Fourier transform. Peng et al. introduced DFA in 1994 in a paper that has been cited over 2000 times as of 2013 and represents an extension of the (ordinary) fluctuation analysis (FA), which is affected by nonstationarities. 
Deviance  In statistics, deviance is a quality of fit statistic for a model that is often used for statistical hypothesis testing. It is a generalization of the idea of using the sum of squares of residuals in ordinary least squares to cases where modelfitting is achieved by maximum likelihood. 
Deviance Information Criterion (DIC) 
The deviance information criterion (DIC) is a hierarchical modeling generalization of the AIC (Akaike information criterion) and BIC (Bayesian information criterion, also known as the Schwarz criterion). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. Like AIC and BIC it is an asymptotic approximation as the sample size becomes large. It is only valid when the posterior distribution is approximately multivariate normal. The idea is that models with smaller DIC should be preferred to models with larger DIC. 
Dex  This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention. 
Diagram Generating Function (DGF) 
The recentlyintroduced selflearning Monte Carlo method is a generalpurpose numerical method that speeds up Monte Carlo simulations by training an effective model to propose uncorrelated configurations in the Markov chain. We implement this method in the framework of continuous time Monte Carlo method with auxiliary field in quantum impurity models. We introduce and train a diagram generating function (DGF) to model the probability distribution of auxiliary field configurations in continuous imaginary time, at all orders of diagrammatic expansion. By using DGF to propose global moves in configuration space, we show that the selflearning continuoustime Monte Carlo method can significantly reduce the computational complexity of the simulation. 
Diceware  Diceware is a method for creating passphrases, passwords, and other cryptographic variables using an ordinary die from a pair of dice as a hardware random number generator. For each word in the passphrase, five rolls of the dice are required. The numbers from 1 to 6 that come up in the rolls are assembled as a five digit number, e.g. 43146. That number is then used to look up a word in a word list. In the English list 43146 corresponds to munch. Lists have been compiled for several languages, including English, Finnish, German, Italian, Polish, Romanian, Russian, Spanish and Swedish. A Diceware word list is any list of 6^5 = 7,776 unique words, preferably ones the user will find easy to spell and to remember. The contents of the word list do not have to be protected or concealed in any way, as the security of a Diceware passphrase is in the number of words selected, and the number of words each selected word could be taken from. The level of unpredictability of a Diceware passphrase can be easily calculated: each word adds 12.9 bits of entropy to the passphrase (that is, \log_2( 6^5 ) bits). Originally, in 1995, Diceware creator Arnold Reinhold considered five words (64 bits) the minimum length needed by average users. However, starting in 2014, Reinhold recommends that at least six words (77 bits) should be used. This level of unpredictability assumes that a potential attacker knows both that Diceware has been used to generate the passphrase, the particular word list used, and exactly how many words make up the passphrase. If the attacker has less information, the entropy can be greater than 12.9 bits per word. If words were simply concatenated rather than separated by spaces, concatenating could form words that are already in the word list. For example, ‘in’ and ‘put’ form ‘input’; all three words can be found in the above mentioned word list. This could slightly decrease the entropy, when compared with the recommended method of using spaces to separate each word in the list. riceware 
dIconomy  “dIconomy” = “Digital Economy” 
Dictionary Learning (DL) 
Dictionary Learning Algorithms for Sparse Representation Dictionary Learning 
Dictionary Learning – Separating the Particularity and the Commonality (DLCOPAR) 
Empirically, we find that, despite the classspecific features owned by the objects appearing in the images, the objects from different categories usually share some common patterns, which do not contribute to the discrimination of them. Concentrating on this observation and under the general dictionary learning (DL) framework, we propose a novel method to explicitly learn a common pattern pool (the commonality) and classspecific dictionaries (the particularity) for classification. We call our method DLCOPAR, which can learn the most compact and most discriminative classspecific dictionaries used for classification. The proposed DLCOPAR is extensively evaluated both on synthetic data and on benchmark image databases in comparison with existing DLbased classification methods. The experimental results demonstrate that DLCOPAR achieves very promising performances in various applications, such as face recognition, handwritten digit recognition, scene classification and object recognition. 
DifferencesinDifferences (DID) 
Difference in differences (sometimes ‘DifferenceinDifferences’, ‘DID’, or ‘DD’) is a statistical technique used in econometrics and quantitative sociology, which attempts to mimic an experimental research design using observational study data. It calculates the effect of a treatment (i.e., an explanatory variable or an independent variable) on an outcome (i.e., a response variable or dependent variable) by comparing the average change over time in the outcome variable for the treatment group to the average change over time for the control group. This method may be subject to certain biases (mean reversion bias, etc.), although it is intended to eliminate some of the effect of selection bias. In contrast to a withinsubjects estimate of the treatment effect (which measures differences over time) or a betweensubjects estimate of the treatment effect (which measures the difference between the treatment and control groups), the DID measures the difference in the differences between the treatment and control group over time. 
Differentiable Lasso (dlasso) 
DLASSO 
Differential Evolution (DE) 
In evolutionary computation, differential evolution (DE) is a method that optimizes a problem by iteratively trying to improve a candidate solution with regard to a given measure of quality. Such methods are commonly known as metaheuristics as they make few or no assumptions about the problem being optimized and can search very large spaces of candidate solutions. However, metaheuristics such as DE do not guarantee an optimal solution is ever found. DE is used for multidimensional realvalued functions but does not use the gradient of the problem being optimized, which means DE does not require for the optimization problem to be differentiable as is required by classic optimization methods such as gradient descent and quasinewton methods. DE can therefore also be used on optimization problems that are not even continuous, are noisy, change over time, etc. DE optimizes a problem by maintaining a population of candidate solutions and creating new candidate solutions by combining existing ones according to its simple formulae, and then keeping whichever candidate solution has the best score or fitness on the optimization problem at hand. In this way the optimization problem is treated as a black box that merely provides a measure of quality given a candidate solution and the gradient is therefore not needed. DE is originally due to Storn and Price. Books have been published on theoretical and practical aspects of using DE in parallel computing, multiobjective optimization, constrained optimization, and the books also contain surveys of application areas. http://…/9783540209508 
Differential Generative Adversarial Network (DGAN) 
In facerelated applications with a public available dataset, synthesizing nonlinear facial variations (e.g., facial expression, headpose, illumination, etc.) through a generative model is helpful in addressing the lack of training data. In reality, however, there is insufficient data to even train the generative model for face synthesis. In this paper, we propose Differential Generative Adversarial Networks (DGAN) that can perform photorealistic face synthesis even when training data is small. Two adversarial networks are devised to ensure the generator to approximate a face manifold, which can express face changes as it wants. Experimental results demonstrate that the proposed method is robust to the amount of training data and synthesized images are useful to improve the performance of a face expression classifier. 
Differential Item Functioning (DIF) 
Differential item functioning (DIF), also referred to as measurement bias, occurs when people from different groups (commonly gender or ethnicity) with the same latent trait (ability/skill) have a different probability of giving a certain response on a questionnaire or test. DIF analysis provides an indication of unexpected behavior of items on a test. An item does not display DIF if people from different groups have a different probability to give a certain response; it displays DIF if and only if people from different groups with the same underlying true ability have a different probability of giving a certain response. Common procedures for assessing DIF are MantelHaenszel, item response theory (IRT) based methods, and logistic regression. difR 
Differential Message Importance Measure (DMIM) 
Information collection is a fundamental problem in big data, where the size of sampling sets plays a very important role. This work considers the information collection process by taking message importance into account. Similar to differential entropy, we define differential message importance measure (DMIM) as a measure of message importance for continuous random variable. It is proved that the change of DMIM can describe the gap between the distribution of a set of sample values and a theoretical distribution. In fact, the deviation of DMIM is equivalent to KolmogorovSmirnov statistic, but it offers a new way to characterize the distribution goodnessoffit. Numerical results show some basic properties of DMIM and the accuracy of the proposed approximate values. Furthermore, it is also obtained that the empirical distribution approaches the real distribution with decreasing of the DMIM deviation, which contributes to the selection of suitable sampling points in actual system. 
Differential Privacy  In cryptography, differential privacy aims to provide means to maximize the accuracy of queries from statistical databases while minimizing the chances of identifying its records. 
Differential Privacy Cleaning  Data cleaning, or the process of detecting and repairing inaccurate or corrupt records in the data, is inherently humandriven. State of the art systems assume cleaning experts can access the data (or a sample of it) to tune the cleaning process. However, in many cases, privacy constraints disallow unfettered access to the data. To address this challenge, we observe and provide empirical evidence that data cleaning can be achieved without access to the sensitive data, but with access to a (noisy) query interface that supports a small set of linear counting query primitives. Motivated by this, we present DPClean, a first of a kind system that allows engineers tune data cleaning workflows while ensuring differential privacy. In DPClean, a cleaning engineer can pose sequences of aggregate counting queries with error tolerances. A privacy engine translates each query into a differentially private mechanism that returns an answer with error matching the specified tolerance, and allows the data owner track the overall privacy loss. With extensive experiments using human and simulated cleaning engineers on blocking and matching tasks, we demonstrate that our approach is able to achieve high cleaning quality while ensuring a reasonable privacy loss. 
Differentially Private Regression for DiscreteTime Survival Analysis  In survival analysis, regression models are used to understand the effects of explanatory variables (e.g., age, sex, weight, etc.) to the survival probability. However, for sensitive survival data such as medical data, there are serious concerns about the privacy of individuals in the data set when medical data is used to fit the regression models. The closest work addressing such privacy concerns is the work on Cox regression which linearly projects the original data to a lower dimensional space. However, the weakness of this approach is that there is no formal privacy guarantee for such projection. In this work, we aim to propose solutions for the regression problem in survival analysis with the protection of differential privacy which is a golden standard of privacy protection in data privacy research. To this end, we extend the Output Perturbation and Objective Perturbation approaches which are originally proposed to protect differential privacy for the Empirical Risk Minimization (ERM) problems. In addition, we also propose a novel sampling approach based on the Markov Chain Monte Carlo (MCMC) method to practically guarantee differential privacy with better accuracy. We show that our proposed approaches achieve good accuracy as compared to the nonprivate results while guaranteeing differential privacy for individuals in the private data set. 
DiffSharp  DiffSharp is an algorithmic differentiation or automatic differentiation (AD) library for the .NET ecosystem, which is targeted by the C# and F# languages, among others. The library has been designed with machine learning applications in mind, allowing very succinct implementations of models and optimization routines. DiffSharp is implemented in F# and exposes forward and reverse AD operators as general nestable higherorder functions, usable by any .NET language. It provides highperformance linear algebra primitives—scalars, vectors, and matrices, with a generalization to tensors underway—that are fully supported by all the AD operators, and which use a BLAS/LAPACK backend via the highly optimized OpenBLAS library. DiffSharp currently uses operator overloading, but we are developing a transformationbased version of the library using F#’s ‘code quotation’ metaprogramming facility. Work on a CUDAbased GPU backend is also underway. 
Diffusion Map  Diffusion maps is a machine learning algorithm introduced by R. R. Coifman and S. Lafon. It computes a family of embeddings of a data set into Euclidean space (often lowdimensional) whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the “diffusion distance” between probability distributions centered at those points. Different from other dimensionality reduction methods such as principal component analysis (PCA) and multidimensional scaling (MDS), diffusion maps is a nonlinear method that focuses on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps gives a global description of the dataset. Compared with other methods, the diffusion maps algorithm is robust to noise perturbation and is computationally inexpensive. 
Digital Analytics  Digital analytics is the analysis of qualitative and quantitative data from your business and the competition to drive a continual improvement of the online experience that your customers and potential customers have which translates to your desired outcomes (both online and offline). One of the most important steps of digital analytics is determining what your ultimate business objectives or outcomes are and how you expect to measure those outcomes. In the online world, there are five common business objectives: • For ecommerce sites, an obvious objective is selling products or services. • For lead generation sites, the goal is to collect user information for sales teams to connect with potential leads. • For content publishers, the goal is to encourage engagement and frequent visitation. • For online informational or support sites, helping users find the information they need at the right time is of primary importance. • For branding, the main objective is to drive awareness, engagement and loyalty. There are key actions on any website or mobile application that tie back to a business’ objectives. The actions can indicate an objective, like a purchase on an ecommerce site, has been fully met. These are ‘macro’ conversions. Some of the actions on a site might also be behavioral indicators that a customer hasn’t fully reached your main objectives but is coming closer, like, in the ecommerce example, signing up to receive an email coupon or a new product notification. These are ‘micro’ conversions. It’s important to measure both micro and macro conversions so that you are equipped with more behavioral data to understand what experiences help drive the right outcomes for your site. 
Digital Analytics Association (DAA) 
The Digital Analytics Association makes analytics professionals more effective and valuable through professional development and community. 
Digital Asset Management (DAM) 
Digital asset management (DAM) consists of management tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution of digital assets. Digital photographs, animations, videos and music exemplify the target areas of media asset management (a subcategory of DAM). Digital asset management systems (DAMS) include computer software and hardware systems that aid in the process of digital asset management. The term “digital asset management” (DAM) also refers to the protocol for downloading, renaming, backing up, rating, grouping, archiving, optimizing, maintaining, thinning, and exporting files. The “media asset management” (MAM) subcategory of digital asset management mainly addresses audio, video and other media content. The more recent concept of enterprise content management (ECM) often deals with solutions which address similar features but in a wider range of industries or applications. 
Digital Hoarding  Digital hoarding (also known as ehoarding) is excessive acquisition and reluctance to delete electronic material no longer valuable to the user. The behavior includes the mass storage of digital artifacts and the retainment of unnecessary or irrelevant electronic data. The term is increasingly common in pop culture, used to describe the habitual characteristics of compulsive hoarding, but in cyberspace. As with physical space in which excess items are described as ‘clutter’ or ‘junk,’ excess digital media is often referred to as ‘digital clutter.’ 
Digital Native (DN) 
The term Digital Native was coined and popularized by education consultant, Marc Prensky in his 2001 article entitled Digital Natives, Digital Immigrants, in which he relates the contemporaneous decline in American education to educators’ failure to understand the needs of modern students. His article posited that ‘the arrival and rapid dissemination of digital technology in the last decade of the 20th century’ had fundamentally changed the way students think and process information, making it impossible for them to excel academically using the outdated teaching methods of the day. In other words, children raised in the postdigital, media saturated world, require a mediarich learning environment to hold their attention. Contextually, his ideas were introduced after a decade of worry over increased diagnosis of children with ADD and ADHD, which itself turned out to be largely overblown. Prensky did not strictly define the Digital Native in his 2001 article, but it was later, somewhat arbitrarily, applied to children born after 1980, due to the fact that computer bulletin board systems, and Usenet were already in use at the time. The idea became popular among educators and parents, whose children fell within Prensky’s definition of a Digital Native, and has since been embraced as an effective marketing tool. 
Digital Twin  Digital twin refers to a digital replica of physical assets, processes and systems that can be used for various purposes. The digital representation provides both the elements and the dynamics of how an Internet of Things device operates and lives throughout its life cycle. Digital Twins integrate artificial intelligence, machine learning and software analytics with data to create living digital Simulation models that update and change as their physical counterparts’ change. A digital twin continuously learns and updates itself from multiple sources to represent their near realtime status, working condition or position. This learning system, learns from itself, using sensor data that conveys various aspects of its operating condition; from human experts, such as engineers with deep and relevant industry domain knowledge; from other similar machines; from other similar fleets of machines; and from the larger systems and environment in which it may be a part of. A digital twin also integrates historical data from past machine usage to factor into its digital model. 
Dijkstra Algorithm  Dijkstra’s algorithm, conceived by computer scientist Edsger Dijkstra in 1956 and published in 1959, is a graph search algorithm that solves the singlesource shortest path problem for a graph with nonnegative edge path costs, producing a shortest path tree. This algorithm is often used in routing and as a subroutine in other graph algorithms. 
Dilated Recurrent Neural Network (DILATEDRNN) 
Notoriously, learning with recurrent neural networks (RNNs) on long sequences is a difficult task. There are three major challenges: 1) extracting complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DILATEDRNN, which simultaneously tackles all these challenges. The proposed architecture is characterized by multiresolution dilated recurrent skip connections and can be combined flexibly with different RNN cells. Moreover, the DILATEDRNN reduces the number of parameters and enhances training efficiency significantly, while matching stateoftheart performance (even with Vanilla RNN cells) in tasks involving very longterm dependencies. To provide a theorybased quantification of the architecture’s advantages, we introduce a memory capacity measure – the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DILATEDRNN over other recurrent neural architectures. 
Dimensionality Reduction  In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction. 
DimmWitted  A storage abstraction that captures the access patterns of popular statistical analytics tasks and a prototype called DimmWitted. 
dimple  dimple is a simpletouse charting API powered by D3.js. The aim of dimple is to open up the power and flexibility of d3 to analysts. It aims to give a gentle learning curve and minimal code to achieve something productive. It also exposes the d3 objects so you can pick them up and run to create some really cool stuff. 
DiNoDB  As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient adhoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to reiterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactivespeed query engine for adhoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of adhoc queries compared to alternatives %such as Hive, Stado, SparkSQL and Impala. 
Dionysius  We address the following problem: How do we incorporate user item interaction signals as part of the relevance model in a largescale personalized recommendation system such that, (1) the ability to interpret the model and explain recommendations is retained, and (2) the existing infrastructure designed for the (user profile) contentbased model can be leveraged? We propose Dionysius, a hierarchical graphical model based framework and system for incorporating user interactions into recommender systems, with minimal change to the underlying infrastructure. We learn a hidden fields vector for each user by considering the hierarchy of interaction signals, and replace the user profilebased vector with this learned vector, thereby not expanding the feature space at all. Thus, our framework allows the use of existing recommendation infrastructure that supports content based features. We implemented and deployed this system as part of the recommendation platform at LinkedIn for more than one year. We validated the efficacy of our approach through extensive offline experiments with different model choices, as well as online A/B testing experiments. Our deployment of this system as part of the job recommendation engine resulted in significant improvement in the quality of retrieved results, thereby generating improved user experience and positive impact for millions of users. 
DiracNet  Deep neural networks with skipconnections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them – training deeper networks – does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skipconnections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We’re able to achieve 95.5% accuracy on CIFAR10 with 34layer deep plain network, surpassing 1001layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and nonresidual networks. The code and models for our experiments are available at https://…/diracnets 
Directed Acyclic Graph (DAG) 
In mathematics and computer science, a directed acyclic graph (DAG), is a directed graph with no directed cycles. That is, it is formed by a collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. dimple 
Directed Acyclic Graph AutoRegressive (DAGAR) 

Directed Graph  In mathematics, and more specifically in graph theory, a directed graph (or digraph) is a graph, or set of nodes connected by edges, where the edges have a direction associated with them. In formal terms, a digraph is a pair G=(V,A) (sometimes G=(V,E)) of: • a set V, whose elements are called vertices or nodes, • a set A of ordered pairs of vertices, called arcs, directed edges, or arrows (and sometimes simply edges with the corresponding set named E instead of A). It differs from an ordinary or undirected graph, in that the latter is defined in terms of unordered pairs of vertices, which are usually called edges. A digraph is called ‘simple’ if it has no loops, and no multiple arcs (arcs with same starting and ending nodes). A directed multigraph, in which the arcs constitute a multiset, rather than a set, of ordered pairs of vertices may have loops (that is, ‘selfloops’ with same starting and ending node) and multiple arcs. Some, but not all, texts allow a digraph, without the qualification simple, to have self loops, multiple arcs, or both. 
Directional Statistics  Directional statistics is the subdiscipline of statistics that deals with directions (unit vectors in Rn), axes (lines through the origin in Rn) or rotations in Rn. More generally, directional statistics deals with observations on compact Riemannian manifolds. The fact that 0 degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of 2 degrees and 358 degrees, provides one illustration that special statistical methods are required for the analysis of some types of data (in this case, angular data). Other examples of data that may be regarded as directional include statistics involving temporal periods (e.g. time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations and so on. Directional 
Dirichlet Distribution  When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible usergenerated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data. 
Dirichlet Lasso (DLASSO) 
Selection of the most important predictor variables in regression analysis is one of the key problems statistical research has been concerned with for long time. In this article, we propose the methodology, Dirichlet Lasso (abbreviated as DLASSO) to address this issue in a Bayesian framework. In many modern regression settings, large set of predictor variables are grouped and the coefficients belonging to any one of these groups are either all redundant or all important in predicting the response; we say in those cases that the predictors exhibit a group structure. We show that DLASSO is particularly useful where the group structure is not fully known. We exploit the clustering property of Dirichlet Process priors to infer the possibly missing group information. The Dirichlet Process has the advantage of simultaneously clustering the variable coefficients and selecting the best set of predictor variables. We compare the predictive performance of DLASSO to Group Lasso and ordinary Lasso with real data and simulation studies. Our results demonstrate that the predictive performance of DLASSO is almost as good as that of Group Lasso when group label information is given; and superior to the ordinary Lasso for missing group information. For high dimensional data (e.g., genetic data) with missing group information, DLASSO will be a powerful approach of variable selection since it provides a superior predictive performance and higher statistical accuracy. 
Dirichlet Process (DP) 
In probability theory, a Dirichlet process is a way of assigning a probability distribution over probability distributions. That is, a Dirichlet process is a probability distribution whose domain is itself a set of probability distributions. The probability distributions in the domain are almost surely discrete and may be infinite dimensional. Assigning an arbitrary probability distribution over a domain of infinite dimensional probability distributions would require an infinite amount of computational resources. The main function of the Dirichlet process is that it allows the specification of a distribution over infinite dimensional distributions in a way that uses only finite resources. 
Dirichlet Process Mixture Model (DPMM) 
The Dirichlet process is a family of nonparametric Bayesian models which are commonly used for density estimation, semiparametric modelling and model selection/averaging. The Dirichlet processes are nonparametric in a sense that they have infinite number of parameters. Since they are treated in a Bayesian approach we are able to construct large models with infinite parameters which we integrate out to avoid overfitting. 
Disciplined Convex Optimization  An objectoriented modeling language for disciplined convex programming (DCP). It allows the user to formulate convex optimization problems in a natural way following mathematical convention and DCP rules. The system analyzes the problem, verifies its convexity, converts it into a canonical form, and hands it off to an appropriate solver to obtain the solution. ➘ “Disciplined Convex Programming” CVXR 
Disciplined Convex Programming (DCP) 
Convex programming is a subclass of nonlinear programming (NLP) that unifies and generalizes least squares (LS), linear programming (LP), and convex quadratic programming (QP). It has become quite popular recently for a number of reasons, including its attractive theoretical properties, efficient numerical algorithms, and practical applications. Nevertheless, there remains a significant impediment to the more widespread adoption of convex programming: the high level of expertise required to use it. We introduce a new modeling methodology called disciplined convex programming. As the term ‘disciplined’ suggests, the methodology imposes a set of conventions that one must follow when constructing convex programs. The conventions are simple and teachable, taken from basic principles of convex analysis, and inspired by the practices of those who regularly study and apply convex optimization today. The conventions do not limit generality; but they do allow much of the manipulation and transformation required to analyze and solve convex programs to be automated. ➘ “Disciplined Convex Optimization” 
Discounted Cumulative Gain (DCG) 
Discounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom with the gain of each result discounted at lower ranks. 
Discourse Analysis (DA) 
Discourse analysis (DA), or discourse studies, is a general term for a number of approaches to analyzing written, vocal, or sign language use or any significant semiotic event. The objects of discourse analysis – discourse, writing, conversation, communicative event – are variously defined in terms of coherent sequences of sentences, propositions, speech, or turnsattalk. Contrary to much of traditional linguistics, discourse analysts not only study language use ‘beyond the sentence boundary’, but also prefer to analyze ‘naturally occurring’ language use, and not invented examples. Text linguistics is related. The essential difference between discourse analysis and text linguistics is that it aims at revealing sociopsychological characteristics of a person/persons rather than text structure. Discourse analysis has been taken up in a variety of social science disciplines, including linguistics, education, sociology, anthropology, social work, cognitive psychology, social psychology, area studies, cultural studies, international relations, human geography, communication studies, and translation studies, each of which is subject to its own assumptions, dimensions of analysis, and methodologies. 
Discover, Access, Distill (DAD) 
DAD is comprised of: • Discover: Find, identify the sources of good data, and the metrics. Sometimes request the data to be created (work with data engineers and business analysts) • Access: Access the data. Sometimes via an API, a web crawler, an Internet download, a database access or sometimes inmemory within a database. • Distill: Extract essence from data, the stuff that leads to decisions, increased ROI, and actions (such as determining optimum bid prices in an automated bidding system). It involves • Exploring the data (creating a data dictionary and exploratory analysis) • Cleaning (removing impurities) • Refining (data summarization, sometimes multiple layers of summarization or hierarchical summarization) Analyzing: statistical analyses (sometimes including stuff like experimental design that can take place even before the Access stage), both automated and manual. Might or might not require statistical modeling • Presenting results or integrating results in some automated process 
Discrete Dantzig Selector  We propose a new highdimensional linear regression estimator: the Discrete Dantzig Selector, which minimizes the number of nonzero regression coefficients, subject to a budget on the maximal absolute correlation between the features and the residuals. We show that the estimator can be expressed as a solution to a Mixed Integer Linear Optimization (MILO) problem—a computationally tractable framework that enables the computation of provably optimal global solutions. Our approach has the appealing characteristic that even if we terminate the optimization problem at an early stage, it exits with a certificate of suboptimality on the quality of the solution. We develop new discrete first order methods, motivated by recent algorithmic developments in first order continuous convex optimization, to obtain high quality feasible solutions for the Discrete Dantzig Selector problem. Our proposal leads to advantages over the offtheshelf stateoftheart integer programming algorithms, which include superior upper bounds obtained for a given computational budget. When a solution obtained from the discrete first order methods is passed as a warmstart to a MILO solver, the performance of the latter improves significantly. Exploiting problem specific information, we propose enhanced MILO formulations that further improve the algorithmic performance of the MILO solvers. We demonstrate, both theoretically and empirically, that, in a wide range of regimes, the statistical properties of the Discrete Dantzig Selector are superior to those of popular $\ell_{1}$based approaches. For problem instances with $p \approx 2500$ features and $n \approx 900$ observations, our computational framework delivers optimal solutions in a few minutes and certifies optimality within an hour. 
Discrete Event Simulation (DES) 
In the field of simulation, a discreteevent simulation (DES), models the operation of a system as a discrete sequence of events in time. Each event occurs at a particular instant in time and marks a change of state in the system. Between consecutive events, no change in the system is assumed to occur; thus the simulation can directly jump in time from one event to the next. This contrasts with continuous simulation in which the simulation continuously tracks the system dynamics over time. Instead of being eventbased, this is called an activitybased simulation; time is broken up into small time slices and the system state is updated according to the set of activities happening in the time slice. Because discreteevent simulations do not have to simulate every time slice, they can typically run much faster than the corresponding continuous simulation. Another alternative to eventbased simulation is processbased simulation. In this approach, each activity in a system corresponds to a separate process, where a process is typically simulated by a thread in the simulation program. In this case, the discrete events, which are generated by threads, would cause other threads to sleep, wake, and update the system state. A more recent method is the threephased approach to discrete event simulation (Pidd, 1998). In this approach, the first phase is to jump to the next chronological event. The second phase is to execute all events that unconditionally occur at that time (these are called Bevents). The third phase is to execute all events that conditionally occur at that time (these are called Cevents). The three phase approach is a refinement of the eventbased approach in which simultaneous events are ordered so as to make the most efficient use of computer resources. The threephase approach is used by a number of commercial simulation software packages, but from the user’s point of view, the specifics of the underlying simulation method are generally hidden. simmer,DES 
Discrete Morse Theory  Discrete Morse theory is a tool for determining equivalences between topological spaces arising from discrete mathematical structures. This theory was developed by Robin Forman in the 1990s as a combinatorial analog to Morse theory, developed by Marston Morse in the 1920s. The original theory deals with analyzing such equivalences for general topological spaces, while discrete Morse theory provides similar methods of analysis for topological spaces endowed with additional, discrete structure. For these structures, applications of the discrete theory are often more natural, as well as simpler and more straightforward to apply. Discrete Morse theory has applications throughout many fields of pure and applied mathematics. Within pure mathematics, for example, the theory has been widely applied to problems in geometry, topology, and knot theory; and within computer science, the theory has been used to evaluate data compression algorithms and to bound the complexity of algorithms that determine whether graphs have certain properties – for example, whether all components of a graph are connected. If we wish to know whether a given property holds for a certain topological space, our question can often be reduced to the question of whether the space is equivalent to another space for which the property holds. For example, whether a simple algorithm exists for determining if a graph is connected depends on whether the structure that represents the space of notconnected graphs can be shrunken to a point. Alas, it cannot, so any algorithm for testing graph connectedness must, at least in some cases, conduct an exhaustive search. This result has realworld implications: for example, it means that if we want to test a communications system – say, immediately after a disaster – to determine whether it is still connected, there is no guaranteed way of finding the answer without testing every component individually. http://…/Discrete_Morse_theory http://…/Morse_theory http://…/s48forman.pdf TDAmapper 
Discrete Sparklines  
Discriminant Analysis  Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. This method is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. HiDimDA 
Discriminant Function Analysis  Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). The original dichotomous discriminant analysis was developed by Sir Ronald Fisher in 1936. It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure. In simple terms, discriminant function analysis is classification – the act of distributing things into groups, classes or categories of the same type. Moreover, it is a useful followup procedure to a MANOVA instead of doing a series of oneway ANOVAs, for ascertaining how the groups differ on the composite of dependent variables. In this case, a significant F test allows classification based on a linear combination of predictor variables. Terminology can get confusing here, as in MANOVA, the dependent variables are the predictor variables, and the independent variables are the grouping variables. 
Discriminative kshot learning  This paper introduces a probabilistic framework for kshot image classification. The goal is to generalise from an initial largescale classification task to a separate task comprising new classes and small numbers of examples. The new approach not only leverages the featurebased representation learned by a neural network from the initial task (representational transfer), but also information about the form of the classes (concept transfer). The concept information is encapsulated in a probabilistic model for the final layer weights of the neural network which then acts as a prior when probabilistic kshot learning is performed. Surprisingly, simple probabilistic models and inference schemes outperform many existing kshot learning approaches and compare favourably with the stateoftheart method in terms of errorrate. The new probabilistic methods are also able to accurately model uncertainty, leading to well calibrated classifiers, and they are easily extensible and flexible, unlike many recent approaches to kshot learning. 
Discriminative Model  Discriminative models, also called conditional models, are a class of models used in machine learning for modeling the dependence of an unobserved variable y on an observed variable x. Within a probabilistic framework, this is done by modeling the conditional probability distribution P(yx), which can be used for predicting y from x. Discriminative models, as opposed to generative models, do not allow one to generate samples from the joint distribution of x and y. However, for tasks such as classification and regression that do not require the joint distribution, discriminative models can yield superior performance. On the other hand, generative models are typically more flexible than discriminative models in expressing dependencies in complex learning tasks. In addition, most discriminative models are inherently supervised and cannot easily be extended to unsupervised learning. Application specific details ultimately dictate the suitability of selecting a discriminative versus generative model. 
Discriminative Optimization (DO) 
Many computer vision problems are formulated as the optimization of a cost function. This approach faces two main challenges: (i) designing a cost function with a local optimum at an acceptable solution, and (ii) developing an efficient numerical method to search for one (or multiple) of these local optima. While designing such functions is feasible in the noiseless case, the stability and location of local optima are mostly unknown under noise, occlusion, or missing data. In practice, this can result in undesirable local optima or not having a local optimum in the expected place. On the other hand, numerical optimization algorithms in highdimensional spaces are typically local and often rely on expensive first or second order information to guide the search. To overcome these limitations, this paper proposes Discriminative Optimization (DO), a method that learns search directions from data without the need of a cost function. Specifically, DO explicitly learns a sequence of updates in the search space that leads to stationary points that correspond to desired solutions. We provide a formal analysis of DO and illustrate its benefits in the problem of 3D point cloud registration, camera pose estimation, and image denoising. We show that DO performed comparably or outperformed stateoftheart algorithms in terms of accuracy, robustness to perturbations, and computational efficiency. 
DisentAngled Representation Learning Agent (DARLA) 
Domain adaptation is an important open problem in deep reinforcement learning (RL). In many scenarios of interest data is hard to obtain, so agents may learn a source policy in a setting where data is readily available, with the hope that it generalises well to the target domain. We propose a new multistage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act. DARLA’s vision is based on learning a disentangled representation of the observed environment. Once DARLA can see, it is able to acquire source policies that are robust to many domain shifts – even with no access to the target domain. DARLA significantly outperforms conventional baselines in zeroshot domain adaptation scenarios, an effect that holds across a variety of RL environments (Jaco arm, DeepMind Lab) and base RL algorithms (DQN, A3C and EC). 
DISsimilarity COefficient Networks (DISCO Nets) 
We present a new type of probabilistic model which we call DISsimilarity COefficient Networks (DISCO Nets). DISCO Nets allow us to efficiently sample from a posterior distribution parametrised by a neural network. During training, DISCO Nets are learned by minimising the dissimilarity coefficient between the true distribution and the estimated distribution. This allows us to tailor the training to the loss related to the task at hand. We empirically show that (i) by modeling uncertainty on the output value, DISCO Nets outperform equivalent nonprobabilistic predictive networks and (ii) DISCO Nets accurately model the uncertainty of the output, outperforming existing probabilistic models based on deep neural networks. 
Dissimilarity Measure  If features are given or can be defined they can be used to define a distance measure between objects. This can be understood as the euclidean distance in a properly scaled feature space. For good features this will also result in good dissimilarities. However, as long a dissimilarities are based on features their performance will be determined by the quality of these. As features do not describe the full objects it is possible that two objects that are different have a zero distance based on the available features. This is an essential cause of class overlap in feature spaces. Dissimilarities offer the possibility to overcome this. If the dissimilarity measure is defined in such a way that objects have a zero distance to itself or to entirely identical copies of themselves (which thereby should belong to the same class) there is no class overlap. 
Distance Based on Conditional Ordered List (DCOL) 
nlnet 
Distance Correlation  In statistics and in probability theory, distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson productmoment correlation coefficient. These distancebased measures can be put into an indirect relationship to the ordinary moments by an alternative formulation (described below) using ideas related to Brownian motion, and this has led to the use of names such as Brownian covariance and Brownian distance covariance. cdcsis 
Distance Metric Learning (DML) 
Many machine learning algorithms, such as K Nearest Neighbor (KNN), heavily rely on the distance metric for the input data patterns. Distance Metric learning is to learn a distance metric for the input space of data from a given collection of pair of similar/dissimilar points that preserves the distance relation among the training data. In recent years, many studies have demonstrated, both empirically and theoretically, that a learned metric can significantly improve the performance in classification, clustering and retrieval tasks. 
Distance Multivariance  We introduce two new measures for the dependence of $n \ge 2$ random variables: `distance multivariance’ and `total distance multivariance’. Both measures are based on the weighted $L^2$distance of quantities related to the characteristic functions of the underlying random variables. They extend distance covariance (introduced by Szekely, Rizzo and Bakirov) and generalized distance covariance (introduced in part I) from pairs of random variables to $n$tuplets of random variables. We show that total distance multivariance can be used to detect the independence of $n$ random variables and has a simple finitesample representation in terms of distance matrices of the sample points, where distance is measured by a continuous negative definite function. Based on our theoretical results, we present a test for independence of multiple random vectors which is consistent against all alternatives. multivariance 
Distance Preservation to Local Mean (DPLM) 
In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of Symmetric Positive Definite (SPD) matrices that considers the geometry of SPD matrices and provides a low dimensional representation of the manifold with high class discrimination. The proposed algorithm, tries to preserve the local structure of the data by preserving distance to local mean (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples and may use the label information when they are available in order to performance improvement in classification tasks. We performed several experiments on the multiclass dataset IIa from BCI competition IV. The results show that our approach as dimensionality reduction technique – leads to superior results in comparison with other competitor in the related literature because of its robustness against outliers. The experiments confirm that the combination of DPLM with FGMDM as the classifier leads to the state of the art performance on this dataset. 
Distance Weighted Discrimination (DWD) 
High Dimension Low Sample Size statistical analysis is becoming increasingly important in a wide range of applied contexts. In such situations, it is seen that the appealing discrimination method called the Support Vector Machine can be improved. The revealing concept is ‘data piling’ at the margin. This leads naturally to the development of ‘Distance Weighted Discrimination’, which also is based on modern computationally intensive optimization methods, and seems to give improved ‘generalizability’. Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS sdwd 
Distill and Transfer Learning (Distral) 
Most deep reinforcement learning algorithms are data inefficient in complex and rich environments, limiting their applicability to many scenarios. One direction for improving data efficiency is multitask learning with shared neural network parameters, where efficiency may be improved through transfer across related tasks. In practice, however, this is not usually observed, because gradients from different tasks can interfere negatively, making learning unstable and sometimes even less data efficient. Another issue is the different reward schemes between tasks, which can easily lead to one task dominating the learning of a shared model. We propose a new approach for joint training of multiple tasks, which we refer to as Distral (Distill & transfer learning). Instead of sharing parameters between the different workers, we propose to share a ‘distilled’ policy that captures common behaviour across tasks. Each worker is trained to solve its own task while constrained to stay close to the shared policy, while the shared policy is trained by distillation to be the centroid of all task policies. Both aspects of the learning process are derived by optimizing a joint objective function. We show that our approach supports efficient transfer on complex 3D environments, outperforming several related methods. Moreover, the proposed learning process is more robust and more stable—attributes that are critical in deep reinforcement learning. 
DistMult  Knowledge Base Completion: Baselines Strike Back 
Distributed Computing  Distributed computing is a field of computer science that studies distributed systems. A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages. The components interact with each other in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOAbased systems to massively multiplayer online games to peertopeer applications. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs. There are many alternatives for the message passing mechanism, including RPClike connectors and message queues. A goal and challenge pursued by some computer scientists and practitioners in distributed systems is location transparency; however, this goal has fallen out of favour in industry, as distributed systems are different from conventional nondistributed systems, and the differences, such as network partitions, partial system failures, and partial upgrades, cannot simply be ‘papered over’ by attempts at ‘transparency’ – see CAP theorem. Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers, which communicate with each other by message passing. 
Distributed Dynamic Dataintensive Science (D3 Science) 
A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often timedependent. Collectively, these characteristics give rise to dynamic distributed dataintensive applications. While ‘static’ data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and dataintensive applications have received relatively less attention. This paper surveys several representative dynamic distributed dataintensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications. 
Distributed Lag Model  A distributedlag model is a dynamic model in which the effect of a regressor x on y occurs over time rather than all at once. In the simple case of one explanatory variable and a linear relationship. This form is very similar to the infinitemovingaverage representation of an ARMA process, except that the lag polynomial on the righthand side is applied to the explanatory variable x rather than to a whitenoise process e. The individual coefficients ßs are called lag weights and the collectively comprise the lag distribution. They define the pattern of how x affects y over time. 
Distributed LanceWilliam Clustering Algorithm  One important tool is the optimal clustering of data into useful categories. Dividing similar objects into a smaller number of clusters is of importance in many applications. These include search engines, monitoring of academic performance, biology and wireless networks. We first discuss a number of clustering methods. We present a parallel algorithm for the efficient clustering of objects into groups based on their similarity to each other. The input consists of an n by n distance matrix. This matrix would have a distance ranking for each pair of objects. The smaller the number, the more similar the two objects are to each other. We utilize parallel processors to calculate a hierarchal cluster of these n items based on this matrix. Another advantage of our method is distribution of the large n by n matrix. We have implemented our algorithm and have found it to be scalable both in terms of processing speed and storage. 
Distributed Machine Learning Toolkit (DMTK) 
Distributed machine learning has become more important than ever in this big data era. Especially in recent years, practices have demonstrated the trend that bigger models tend to generate better accuracies in various applications. However, it remains a challenge for common machine learning researchers and practitioners to learn big models, because the task usually requires a large number of computation resources. In order to enable the training of big models using just a modest cluster and in an efficient manner, we release the Microsoft Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. These innovations make machine learning tasks on big data highly scalable, efficient and flexible. 
Distributed Matrix  A distributed matrix has longtyped row and column indices and doubletyped values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices. Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive. Three types of distributed matrices have been implemented so far. The basic type is called RowMatrix. A RowMatrix is a roworiented distributed matrix without meaningful row indices, e.g., a collection of feature vectors. It is backed by an RDD of its rows, where each row is a local vector. We assume that the number of columns is not huge for a RowMatrix so that a single local vector can be reasonably communicated to the driver and can also be stored / operated on using a single node. An IndexedRowMatrix is similar to a RowMatrix but with row indices, which can be used for identifying rows and executing joins. A CoordinateMatrix is a distributed matrix stored in coordinate list (COO) format, backed by an RDD of its entries. 
Distributed Robust Algorithm for Countbased Learning (Dracula) 

Distributed Stream Processing Engines (DSPE) 
Distributed stream processing engines (DSPEs) are a new emergent family of MapReduce inspired technologies that address this issue. These engines allow to express parallel computation on streams, and combine the scalability of distributed processing with the efficiency of streaming algorithms. Examples of these engines include Storm, S4, and Samza. 
Distribution Regression  Linear regression is a fundamental and popular statistical method. There are various kinds of linear regression, such as mean regression and quantile regression. In this paper, we propose a new one called distribution regression, which allows broadspectrum of the error distribution in the linear regression. Our method uses nonparametric technique to estimate regression parameters. Our studies indicate that our method provides a better alternative than mean regression and quantile regression under many settings, particularly for asymmetrical heavytailed distribution or multimodal distribution of the error term. Under some regular conditions, our estimator is $\sqrt n$consistent and possesses the asymptotically normal distribution. The proof of the asymptotic normality of our estimator is very challenging because our nonparametric likelihood function cannot be transformed into sum of independent and identically distributed random variables. Furthermore, penalized likelihood estimator is proposed and enjoys the socalled oracle property with diverging number of parameters. Numerical studies also demonstrate the effectiveness and the flexibility of the proposed method. 
Distribution Separation Method (DSM) 

Distributional Adversarial Networks  We propose a framework for adversarial training that relies on a sample rather than a single sample point as the fundamental unit of discrimination. Inspired by discrepancy measures and twosample tests between probability distributions, we propose two such distributional adversaries that operate and predict on samples, and show how they can be easily implemented on top of existing models. Various experimental results show that generators trained with our distributional adversaries are much more stable and are remarkably less prone to mode collapse than traditional models trained with pointwise prediction discriminators. The application of our framework to domain adaptation also results in considerable improvement over recent stateoftheart. 
Distributionally Robust Stochastic Optimization (DRSO) 
A central question in statistical learning is to design algorithms that not only perform well on training data, but also generalize to new and unseen data. In this paper, we tackle this question by formulating a distributionally robust stochastic optimization (DRSO) problem, which seeks a solution that minimizes the worstcase expected loss over a family of distributions that are close to the empirical distribution in Wasserstein distances. We establish a connection between such Wasserstein DRSO and regularization. More precisely, we identify a broad class of loss functions, for which the Wasserstein DRSO is asymptotically equivalent to a regularization problem with a gradientnorm penalty. Such relation provides new interpretations for problems involving regularization, including a great number of statistical learning problems and discrete choice models (e.g. multinomial logit). The connection suggests a principled way to regularize highdimensional, nonconvex problems. This is demonstrated through two applications: the training of Wasserstein generative adversarial networks (WGANs) in deep learning, and learning heterogeneous consumer preferences with mixed logit choice model. 
Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples (DECORATE) 
DECORATE (Diverse Ensemble Creation by Oppositional Relabeling of Artificial Training Examples) builds an ensemble of J48 trees by recursively adding artificial samples of the training data (‘Melville, P., & Mooney, R. J. (2005). Creating diversity in ensembles using artificial data. Information Fusion, 6(1), 99111. doi:10.1016/j.inffus.2004.04.001’). DecorateR 
Diversity Index  A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of types increases and when evenness increases. For a given number of types, the value of a diversity index is maximized when all types are equally abundant. 
Dixon’s Q Test  In statistics, Dixon’s Q test, or simply the Q test, is used for identification and rejection of outliers. This assumes normal distribution and per Dean and Dixon, and others, this test should be used sparingly and never more than once in a data set. To apply a Q test for bad data, arrange the data in order of increasing values and calculate Q as defined: Q = gap/range, Where gap is the absolute difference between the outlier in question and the closest number to it. If Q > Qtable, where Qtable is a reference value corresponding to the sample size and confidence level, then reject the questionable point. Note that only one point may be rejected from a data set using a Q test. 
Django  Django is a highlevel Python Web framework that encourages rapid development and clean, pragmatic design. Built by experienced developers, it takes care of much of the hassle of Web development, so you can focus on writing your app without needing to reinvent the wheel. It’s free and open source. 
DKPro Similarity  DKPro Similarity is an open source software package for developing text similarity algorithms. The framework is designed to complement DKPro Core, a collection of software components for natural language processing (NLP) based on the Apache UIMA framework. By leveraging the power of the tools available in DKPro Core, it allows for a rich set of similarity computation operations, including the design of fullfledged language processing pipelines and fully customizable processing steps. 
Dlib  Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real world problems. It is used in both industry and academia in a wide range of domains including robotics, embedded devices, mobile phones, and large high performance computing environments. Dlib’s open source licensing allows you to use it in any application, free of charge. dlib 
DLVHEX System  The DLVHEX system implements the HEXsemantics, which integrates answer set programming (ASP) with arbitrary external sources. Since its first release ten years ago, significant advancements were achieved. Most importantly, the exploitation of properties of external sources led to efficiency improvements and flexibility enhancements of the language, and technical improvements on the system side increased user’s convenience. In this paper, we present the current status of the system and point out the most important recent enhancements over early versions. While existing literature focuses on theoretical aspects and specific components, a bird’s eye view of the overall system is missing. In order to promote the system for realworld applications, we further present applications which were already successfully realized on top of DLVHEX. This paper is under consideration for acceptance in Theory and Practice of Logic Programming. 
DNAGAN  Disentangling factors of variation has always been a challenging problem in representation learning. Existing algorithms suffer from many limitations, such as unpredictable disentangling factors, bad quality of generated images from encodings, lack of identity information, etc. In this paper, we propose a supervised algorithm called DNAGAN trying to disentangle different attributes of images. The latent representations of images are DNAlike, in which each individual piece represents an independent factor of variation. By annihilating the recessive piece and swapping a certain piece of two latent representations, we obtain another two different representations which could be decoded into images. In order to obtain realistic images and also disentangled representations, we introduce the discriminator for adversarial training. Experiments on MultiPIE and CelebA datasets demonstrate the effectiveness of our method and the advantage of overcoming limitations existing in other methods. 
Docker  Build, Ship and RunAny App, Anywhere. Docker – An open platform for distributed applications for developers and sysadmins. Docker is a relatively new open source application and service, which is seeing interest across a number of areas. It uses recent Linux kernel features (containers, namespaces) to shield processes. While its use (superficially) resembles that of virtual machines, it is much more lightweight as it operates at the level of a single process (rather than an emulation of an entire OS layer). This also allows it to start almost instantly, require very little resources and hence permits an order of magnitude more deployments per host than a virtual machine. Docker offers a standard interface to creation, distribution and deployment. The shipping container analogy is apt: just how shipping containers (via their standard size and “interface”) allow global trade to prosper, Docker is aiming for nothing less for deployment. A Dockerfile provides a concise, extensible, and executable description of the computational environment. Docker software then builds a Docker image from the Dockerfile. Docker images are analogous to virtual machine images, but smaller and built in discrete, extensible and reuseable layers. Images can be distributed and run on any machine that has Docker software installed—including Windows, OS X and of course Linux. Running instances are called Docker containers. A single machine can run hundreds of such containers, including multiple containers running the same image. 
DocTag2Vec  Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec—two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple $k$nearest neighbor search to predict tags for unseen documents. In contrast to previous multilabel learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against stateoftheart methods. 
Document Classification  Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done ‘manually’ (or ‘intellectually’) or algorithmically. The intellectual classification of documents has mostly been the province of library science, while the algorithmic classification of documents is mainly in information science and computer science. The problems are overlapping, however, and there is therefore interdisciplinary research on document classification. The documents to be classified may be texts, images, music, etc. Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. Documents may be classified according to their subjects or according to other attributes (such as document type, author, printing year etc.). In the rest of this article only subject classification is considered. There are two main philosophies of subject classification of documents: The content based approach and the request based approach. 
Document Term Matrix  ➘ “Term Document Matrix” 
DocumentContext Language Model (DCLM) 
Text documents are structured on multiple levels of detail: individual words are related by syntax, and larger units of text are related by discourse structure. Existing language models generally fail to account for discourse structure, but it is crucial if we are to have language models that reward coherence and generate coherent texts. We present and empirically evaluate a set of multilevel recurrent neural network language models, called DocumentContext Language Models (DCLMs), which incorporate contextual information both within and beyond the sentence. In comparison with wordlevel recurrent neural network language models, the DCLMs obtain slightly better predictive likelihoods, and considerably better assessments of document coherence. 
Dodgson Score  Dodgson’s method is a voting system proposed by the author, mathematician and logician Charles Dodgson, better known as Lewis Carroll. The method is to extend the Condorcet method by swapping candidates until a Condorcet winner is found. The winner is the candidate which requires the minimum number of swaps. Dodgson proposed this voting scheme in his 1876 work ‘A method of taking votes on more than two issues’. Given an integer k and an election, it is NPcomplete to determine whether or not a candidate can become a Condorcet winner with fewer than k swaps. In Dodgson’s method, each voter submits an ordered list of all candidates according to their own preference (from best to worst). The winner is defined to be the candidate for whom we need to perform the minimum number of pairwise swaps (added over all candidates) before they become a Condorcet winner. In particular, if there is already a Condorcet winner, they win the election. In short, we must find the voting profile with minimum Kendall tau distance from the input, such that it has a Condorcet winner; they are declared the victor. Computing the winner or even the Dodgson score of a candidate (the number of swaps needed to make him a winner) is a PNPcomplete problem. Efficient DodgsonScore Calculation Using Heuristics and Parallel Computing 
Domain Adaptive Low Rank  Deep Neural Networks trained on large datasets can be easily transferred to new domains with far fewer labeled examples by a process called finetuning. This has the advantage that representations learned in the large source domain can be exploited on smaller target domains. However, networks designed to be optimal for the source task are often prohibitively large for the target task. In this work we address the compression of networks after domain transfer. We focus on compression algorithms based on lowrank matrix decomposition. Existing methods base compression solely on learned network weights and ignore the statistics of network activations. We show that domain transfer leads to large shifts in network activations and that it is desirable to take this into account when compressing. We demonstrate that considering activation statistics when compressing weights leads to a rankconstrained regression problem with a closedform solution. Because our method takes into account the target domain, it can more optimally remove the redundancy in the weights. Experiments show that our Domain Adaptive Low Rank (DALR) method significantly outperforms existing lowrank compression techniques. With our approach, the fc6 layer of VGG19 can be compressed more than 4x more than using truncated SVD alone — with only a minor or no loss in accuracy. When applied to domaintransferred networks it allows for compression down to only 520% of the original number of parameters with only a minor drop in performance. 
Domain Generating Algorithm (DGA) 
Domain generation algorithm (DGA) are algorithms seen in various families of malware that are used to periodically generate a large number of domain names that can be used as rendezvous points with their controllers. The large number of potential rendezvous points makes it difficult for law enforcement to effectively shut down botnets since infected computers will attempt to contact some of these domain names every day to receive updates or commands. By using publickey cryptography, it is unfeasible for law enforcement and other actors to mimic commands from the malware controllers as some worms will automatically reject any updates not signed by the malware controllers. dga 
Domain Knowledgedriven Methodology (DoKnowMe) 
Software engineering considers performance evaluation to be one of the key portions of software quality assurance. Unfortunately, there seems to be a lack of standard methodologies for performance evaluation even in the scope of experimental computer science. Inspired by the concept of ‘instantiation’ in objectoriented programming, we distinguish the generic performance evaluation logic from the distributed and adhoc relevant studies, and develop an abstract evaluation methodology (by analogy of ‘class’) we name Domain Knowledgedriven Methodology (DoKnowMe). By replacing five predefined domainspecific knowledge artefacts, DoKnowMe could be instantiated into specific methodologies (by analogy of ‘object’) to guide evaluators in performance evaluation of different software and even computing systems. We also propose a generic validation framework with four indicators (i.e.~usefulness, feasibility, effectiveness and repeatability), and use it to validate DoKnowMe in the Cloud services evaluation domain. Given the positive and promising validation result, we plan to integrate more common evaluation strategies to improve DoKnowMe and further focus on the performance evaluation of Cloud autoscaler systems. 
Double Deep Machine Learning  ➘ “ReKopedia” 
DOuble Sparsity Kernel (DOSK) 
Learning with Reproducing Kernel Hilbert Spaces (RKHS) has been widely used in many scientific disciplines. Because a RKHS can be very flexible, it is common to impose a regularization term in the optimization to prevent overfitting. Standard RKHS learning employs the squared norm penalty of the learning function. Despite its success, many challenges remain. In particular, one cannot directly use the squared norm penalty for variable selection or data extraction. Therefore, when there exists noise predictors, or the underlying function has a sparse representation in the dual space, the performance of standard RKHS learning can be suboptimal. In the literature,work has been proposed on how to perform variable selection in RKHS learning, and a data sparsity constraint was considered for data extraction. However, how to learn in a RKHS with both variable selection and data extraction simultaneously remains unclear. In this paper, we propose a unified RKHS learning method, namely, DOuble Sparsity Kernel (DOSK) learning, to overcome this challenge. An efficient algorithm is provided to solve the corresponding optimization problem. We prove that under certain conditions, our new method can asymptotically achieve variable selection consistency. Simulated and real data results demonstrate that DOSK is highly competitive among existing approaches for RKHS learning. 
Double Wedge Plot  A graphical display which indicates outliers and potential level shifts in time series. 
DouglasRachford Algorithm  The DouglasRachford algorithm is a very popular splitting technique for finding a zero of the sum of two maximally monotone operators. However, the behaviour of the algorithm remains mysterious in the general inconsistent case, i.e., when the sum problem has no zeros. More than a decade ago, however, it was shown that in the (possibly inconsistent) convex feasibility setting, the shadow sequence remains bounded and it is weak cluster points solve a best approximation problem. In this paper, we advance the understanding of the inconsistent case significantly by providing a complete proof of the full weak convergence in the convex feasibility setting. In fact, a more general sufficient condition for the weak convergence in the general case is presented. Several examples illustrate the results. On the generalized DouglasRachford algorithm for feasibility problems 
Dpush  Herein this paper is presented a novel invention – called Dpush – that enables truly scalable spam resistant uncensorable automatically encrypted and inherently authenticated messaging; thus restoring our ability to exert our right to private communication, and thus a step forward in restoring an uncorrupted democracy. Using a novel combination of a distributed hash table (DHT) and a proof of work (POW), combined in a way that can only be called a synergy, the emergent property of a scalable and spam resistant unsolicited messaging protocol elegantly emerges. Notable is that the receiver does not need to be online at the time the message is sent. This invention is already implemented and operating within the package that is called MORPHiS – which is a Sybil resistant enhanced Kademlia DHT implementation combined with an already functioning implementation of Dpush, as well as a polished HTTP Dmail interface to send and receive such messages today. 
Drift Analysis  Drift analysis is one of the major tools for analysing evolutionary algorithms and natureinspired search heuristics. In this chapter we give an introduction to drift analysis and give some examples of how to use it for the analysis of evolutionary algorithms. 
Dropout  Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from coadapting too much. During training, dropout samples from an exponential number of different ‘thinned’ networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining stateoftheart results on many benchmark data sets. 
DropoutDAgger  While imitation learning is becoming common practice in robotics, this approach often suffers from data mismatch and compounding errors. DAgger is an iterative algorithm that addresses these issues by continually aggregating training data from both the expert and novice policies, but does not consider the impact of safety. We present a probabilistic extension to DAgger, which uses the distribution over actions provided by the novice policy, for a given observation. Our method, which we call DropoutDAgger, uses dropout to train the novice as a Bayesian neural network that provides insight to its confidence. Using the distribution over the novice’s actions, we estimate a probabilistic measure of safety with respect to the expert action, tuned to balance exploration and exploitation. The utility of this approach is evaluated on the MuJoCo HalfCheetah and in a simple driving experiment, demonstrating improved performance and safety compared to other DAgger variants and classic imitation learning. 
DSDP  The DSDP software is a free open source implementation of an interiorpoint method for semidefinite programming. It provides primal and dual solutions, exploits lowrank structure and sparsity in the data, and has relatively low memory requirements for an interiorpoint method. It allows feasible and infeasible starting points and provides approximate certificates of infeasibility when no feasible solution exists. The dualscaling algorithm implemented in this package has a convergence proof and worstcase polynomial complexity under mild assumptions on the data. The software can be used as a set of subroutines, through Matlab, or by reading and writing to data files. Furthermore, the solver offers scalable parallel performance for large problems and a well documented interface. Some of the most popular applications of semidefinite programming and linear matrix inequalities (LMI) are model control, truss topology design, and semidefinite relaxations of combinatorial and global optimization problems. Rdsdp 
DSLATS  Through the last decade, we have witnessed a surge of Internet of Things (IoT) devices, and with that a greater need to choreograph their actions across both time and space. Although these two problems, namely time synchronization and localization, share many aspects in common, they are traditionally treated separately or combined on centralized approaches that results in an ineffcient use of resources, or in solutions that are not scalable in terms of the number of IoT devices. Therefore, we propose DSLATS, a framework comprised of three different and independent algorithms to jointly solve time synchronization and localization problems in a distributed fashion. The First two algorithms are based mainly on the distributed Extended Kalman Filter (EKF) whereas the third one uses optimization techniques. No fusion center is required, and the devices only communicate with their neighbors. The proposed methods are evaluated on custom UltraWideband communication Testbed and a quadrotor, representing a network of both static and mobile nodes. Our algorithms achieve up to three microseconds time synchronization accuracy and 30 cm localization error. 
Dual Discriminator Generative Adversarial net (D2GAN) 
We propose in this paper a novel approach to tackle the problem of mode collapse encountered in generative adversarial network (GAN). Our idea is intuitive but proven to be very effective, especially in addressing some key limitations of GAN. In essence, it combines the KullbackLeibler (KL) and reverse KL divergences into a unified objective function, thus it exploits the complementary statistical properties from these divergences to effectively diversify the estimated density in capturing multimodes. We term our method dual discriminator generative adversarial nets (D2GAN) which, unlike GAN, has two discriminators; and together with a generator, it also has the analogy of a minimax game, wherein a discriminator rewards high scores for samples from data distribution whilst another discriminator, conversely, favoring data from the generator, and the generator produces data to fool both two discriminators. We develop theoretical analysis to show that, given the maximal discriminators, optimizing the generator of D2GAN reduces to minimizing both KL and reverse KL divergences between data distribution and the distribution induced from the data generated by the generator, hence effectively avoiding the mode collapsing problem. We conduct extensive experiments on synthetic and realworld largescale datasets (MNIST, CIFAR10, STL10, ImageNet), where we have made our best effort to compare our D2GAN with the latest stateoftheart GAN’s variants in comprehensive qualitative and quantitative evaluations. The experimental results demonstrate the competitive and superior performance of our approach in generating good quality and diverse samples over baselines, and the capability of our method to scale up to ImageNet database. 
Dual Lasso Selector  We consider the problem of model selection and estimation in sparse high dimensional linear regression models with strongly correlated variables. First, we study the theoretical properties of the dual Lasso solution, and we show that joint consideration of the Lasso primal and its dual solutions are useful for selecting correlated active variables. Second, we argue that correlations among active predictors are not problematic, and we derive a new weaker condition on the design matrix, called Pseudo Irrepresentable Condition (PIC). Third, we present a new variable selection procedure, Dual Lasso Selector, and we prove that the PIC is a necessary and sufficient condition for consistent variable selection for the proposed method. Finally, by combining the dual Lasso selector further with the Ridge estimation even better prediction performance is achieved. We call the combination (DLSelect+Ridge), it can be viewed as a new combined approach for inference in highdimensional regression models with correlated variables. We illustrate DLSelect+Ridge method and compare it with popular existing methods in terms of variable selection, prediction accuracy, estimation accuracy and computation speed by considering various simulated and real data examples. 
Dual Learning for Machine Translation (dualNMT) 
While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a duallearning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a duallearning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e.g., EnglishtoFrench translation (primal) versus FrenchtoEnglish translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the duallearning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e.g., the languagemodel likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e.g., using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dualNMT}. Experiments show that dualNMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the FrenchtoEnglish translation task. 
Dual Path Network (DPN) 
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the stateoftheart Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature reusage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over stateofthearts. In particular, on the ImagNet1k dataset, a shallow DPN surpasses the best ResNeXt101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN131) further pushes the stateoftheart single model performance with more than 3 times faster training speed. Experiments on the Places365 largescale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications. 
Dual Principal Component Pursuit (DPCP) 
We extend the theoretical analysis of a recently proposed single subspace learning algorithm, called Dual Principal Component Pursuit (DPCP), to the case where the data are drawn from of a union of hyperplanes. To gain insight into the properties of the $\ell_1$ nonconvex problem associated with DPCP, we develop a geometric analysis of a closely related continuous optimization problem. Then transferring this analysis to the discrete problem, our results state that as long as the hyperplanes are sufficiently separated, the dominant hyperplane is sufficiently dominant and the points are uniformly distributed inside the associated hyperplanes, then the nonconvex DPCP problem has a unique global solution, equal to the normal vector of the dominant hyperplane. This suggests the correctness of a sequential hyperplane learning algorithm based on DPCP. A thorough experimental evaluation reveals that hyperplane learning schemes based on DPCP dramatically improve over the stateoftheart methods for the case of synthetic data, while are competitive to the stateoftheart in the case of 3D plane clustering for Kinect data. 
Dual Rectified Linear Units (DReLU) 
Rectified Linear Units (ReLUs) are widely used in feedforward neural networks, and in convolutional neural networks in particular. However, they can be rarely found in recurrent neural networks due to the unboundedness and the positive image of the rectified linear activation function. In this paper, we introduce Dual Rectified Linear Units (DReLUs), a novel type of rectified unit that comes with a positive and negative image that is unbounded. We show that we can successfully replace the tanh activation function in the recurrent step of quasi recurrent neural networks. In addition, DReLUs are less prone to the vanishing gradient problem, they are noise robust, and they induce sparse activations. Therefore, we are able to stack up to eight quasi recurrent layers, making it possible to improve the current stateoftheart in characterlevel language modeling over architectures based on shallow Long ShortTerm Memory (LSTM). 
Dual Supervised Learning  Many supervised learning tasks are emerged in dual forms, e.g., EnglishtoFrench translation vs. FrenchtoEnglish translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach \emph{dual supervised learning}. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis. 
DunningKruger Effect  The DunningKruger effect is a cognitive bias wherein unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than is accurate. This bias is attributed to a metacognitive inability of the unskilled to recognize their ineptitude. Conversely, highly skilled individuals tend to underestimate their relative competence, erroneously assuming that tasks which are easy for them are also easy for others. 
Duration Analysis  ➘ “Survival Analysis” Duration Analysis and its Applications (Finance) spduration 
Duration and Interval Hidden Markov Model (DIHMM) 
Analysis of sequential event data has been recognized as one of the essential tools in data modeling and analysis field. In this paper, after the examination of its technical requirements and issues to model complex but practical situation, we propose a new sequential data model, dubbed Duration and Interval Hidden Markov Model (DIHMM), that efficiently represents ‘state duration’ and ‘state interval’ of data events. This has significant implications to play an important role in representing practical timeseries sequential data. This eventually provides an efficient and flexible sequential data retrieval. Numerical experiments on synthetic and real data demonstrate the efficiency and accuracy of the proposed DIHMM. 
Dyadic Data  Dyadic data refers to a domain with two nite sets of objects in which observations are made for dyads, i.e., pairs with one element from either set. This type of data arises naturally in many application ranging from computational linguistics and information retrieval to preference analysis and computer vision. In this paper, we present a systematic, domainindependent framework of learning from dyadic data by statistical mixture models. Our approach covers different models with flat and hierarchical latent class structures. We propose an annealed version of the standard EM algorithm for model fitting which is empirically evaluated on a variety of data sets from different domains. http://…/gonzalezgriffin2012dyadicch.pdf dmm 
Dyadic Network Analysis  dyads 
Dygraphs  dygraphs is a fast, flexible open source JavaScript charting library. It allows users to explore and interpret dense data sets. 
DYNAMIC  In this paper we present DYNAMIC, an opensource C++ library implementing dynamic compressed data structures for string manipulation. Our framework includes useful tools such as searchable partial sums, succinct/gapencoded bitvectors, and entropy/runlength compressed strings and FMindexes. We prove closetooptimal theoretical bounds for the resources used by our structures, and show that our theoretical predictions are empirically tightly verified in practice. To conclude, we turn our attention to applications. We compare the performance of four recentlypublished compression algorithms implemented using DYNAMIC with those of stateoftheart tools performing the same task. Our experiments show that algorithms making use of dynamic compressed data structures can be up to three orders of magnitude more spaceefficient (albeit slower) than classical ones performing the same tasks. 
Dynamic Adaptive Network Intelligence (DANI) 
Accurate representational learning of both the explicit and implicit relationships within data is critical to the ability of machines to perform more complex and abstract reasoning tasks. We describe the efficient weakly supervised learning of such inferences by our Dynamic Adaptive Network Intelligence (DANI) model. We report stateoftheart results for DANI over question answering tasks in the bAbI dataset that have proved difficult for contemporary approaches to learning representation (Weston et al., 2015). 
Dynamic Bayesian Network (DBN) 
A Dynamic Bayesian Network (DBN) is a Bayesian Network which relates variables to each other over adjacent time steps. This is often called a TwoTimeslice BN (2TBN) because it says that at any point in time T, the value of a variable can be calculated from the internal regressors and the immediate prior value (time T1). DBNs are common in robotics, and have shown potential for a wide range of data mining applications. For example, they have been used in speech recognition, digital forensics, protein sequencing, and bioinformatics. DBN is a generalization of hidden Markov models and Kalman filters. https://…/thesis.pdf http://…/0000006a.pdf 
Dynamic Capacity Network (DCN) 
We introduce the Dynamic Capacity Network (DCN), a neural network that can adaptively assign its capacity across different portions of the input data. This is achieved by combining modules of two types: lowcapacity subnetworks and highcapacity subnetworks. The lowcapacity subnetworks are applied across most of the input, but also provide a guide to select a few portions of the input on which to apply the highcapacity subnetworks. The selection is made using a novel gradientbased attention mechanism, that efficiently identifies the modules and input features that are most likely to impact the DCN’s output and to which we’d like to devote more capacity. We focus our empirical evaluation on the cluttered MNIST and SVHN image datasets. Our findings indicate that DCNs are able to drastically reduce the number of computations, compared to traditional convolutional neural networks, while maintaining similar performance. 
Dynamic Clustering (DC) 

Dynamic Continuous Indexing (DCI) 

Dynamic Correlation Analysis (DCA) 
In highthroughput data, dynamic correlation between genes, i.e. changing correlation patterns under different biological conditions, can reveal important regulatory mechanisms. Given the complex nature of dynamic correlation, and the underlying conditions for dynamic correlation may not manifest into clinical observations, it is difficult to recover such signal from the data. Current methods seek underlying conditions for dynamic correlation by using certain observed genes as surrogates, which may not faithfully represent true latent conditions. In this study we develop a new method that directly identifies strong latent signals that regulate the dynamic correlation of many pairs of genes, named DCA: Dynamic Correlation Analysis. At the center of the method is a new metric for the identification of gene pairs that are highly likely to be dynamically correlated, without knowing the underlying conditions of the dynamic correlation. We validate the performance of the method with extensive simulations. In real data analysis, the method reveals novel latent factors with clear biological meaning, bringing new insights into the data. 
Dynamic Decision Network (DDN) 
A fully observable dynamic decision network consists of: • a set of state features, each with a domain; • a set of possible actions forming a decision node A, with domain the set of actions; • a twostage belief network with an action node A, nodes F0 and F1 for each feature F (for the features at time 0 and time 1, respectively), and a conditional probability P(F1parents(F1)) such that the parents of F1 can include A and features at times 0 and 1 as long as the resulting network is acyclic; and • a reward function that can be a function of the action and any of the features at times 0 or 1. 
Dynamic Deep Neural Networks (D2NN) 
We introduce Dynamic Deep Neural Networks (D2NN), a new type of feedforward deep neural network that allow selective execution. Given an input, only a subset of D2NN neurons are executed, and the particular subset is determined by the D2NN itself. By pruning unnecessary computation depending on input, D2NNs provide a way to improve computational efficiency. To achieve dynamic selective execution, a D2NN augments a regular feedforward deep neural network (directed acyclic graph of differentiable modules) with one or more controller modules. Each controller module is a subnetwork whose output is a decision that controls whether other modules can execute. A D2NN is trained end to end. Both regular modules and controller modules in a D2NN are learnable and are jointly trained to optimize both accuracy and efficiency. Such training is achieved by integrating backpropagation with reinforcement learning. With extensive experiments of various D2NN architectures on image classification tasks, we demonstrate that D2NNs are general and flexible, and can effectively optimize accuracyefficiency tradeoffs. 
Dynamic Filter Network (DFN) 
In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operations can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach stateoftheart performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation. 
Dynamic Graph Convolutional Networks  Many different classification tasks need to manage structured data, which are usually modeled as graphs. Moreover, these graphs can be dynamic, meaning that the vertices/edges of each graph may change during time. Our goal is to jointly exploit structured data and temporal information through the use of a neural network model. To the best of our knowledge, this task has not been addressed using these kind of architectures. For this reason, we propose two novel approaches, which combine Long ShortTerm Memory networks and Graph Convolutional Networks to learn long shortterm dependencies together with graph structure. The quality of our methods is confirmed by the promising results achieved. 
Dynamic Linear Model (DLM) 
Dynamic Linear Models (DLMs) or State Space Models define a very general class of nonstationary time series models. DLMs may include terms to model trends, seasonality, covariates and autoregressive components. Other time series models like ARMA models are particular DLMs. The main goals are shortterm forecasting, intervention analysis and monitoring. dlm 
Dynamic Partition Models  We present a new approach for learning compact and intuitive distributed representations with binary encoding. Rather than summing up expert votes as in products of experts, we employ for each variable the opinion of the most reliable expert. Data points are hence explained through a partitioning of the variables into expert supports. The partitions are dynamically adapted based on which experts are active. During the learning phase we adopt a smoothed version of this model that uses separate mixtures for each data dimension. In our experiments we achieve accurate reconstructions of highdimensional data points with at most a dozen experts. 
Dynamic Principal Components  
Dynamic Programming  In mathematics, computer science, economics, and bioinformatics, dynamic programming is a method for solving a complex problem by breaking it down into a collection of simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems and optimal substructure (described below). When applicable, the method takes far less time than naive methods that don’t take advantage of the subproblem overlap (like depthfirst search). In order to solve a given problem, using a dynamic programming approach, we need to solve different parts of the problem (subproblems), then combine the solutions of the subproblems to reach an overall solution. Often when using a more naive method, many of the subproblems are generated and solved many times. The dynamic programming approach seeks to solve each subproblem only once, thus reducing the number of computations: once the solution to a given subproblem has been computed, it is stored or “memoized”: the next time the same solution is needed, it is simply looked up. This approach is especially useful when the number of repeating subproblems grows exponentially as a function of the size of the input. Dynamic programming algorithms are used for optimization (for example, finding the shortest path between two points, or the fastest way to multiply many matrices). A dynamic programming algorithm will examine the previously solved subproblems and will combine their solutions to give the best solution for the given problem. The alternatives are many, such as using a greedy algorithm, which picks the locally optimal choice at each branch in the road. The locally optimal choice may be a poor choice for the overall solution. While a greedy algorithm does not guarantee an optimal solution, it is often faster to calculate. Fortunately, some greedy algorithms (such as minimum spanning trees) are proven to lead to the optimal solution. 
Dynamic Regression in the Presence of Autocorrelated Residuals (DREGAR) 
DREGAR 
Dynamic Tensor Clustering  Dynamic tensor data are becoming prevalent in numerous applications. Existing tensor clustering methods either fail to account for the dynamic nature of the data, or are inapplicable to a generalorder tensor. Also there is often a gap between statistical guarantee and computational efficiency for existing tensor clustering solutions. In this article, we aim to bridge this gap by proposing a new dynamic tensor clustering method, which takes into account both sparsity and fusion structures, and enjoys strong statistical guarantees as well as high computational efficiency. Our proposal is based upon a new structured tensor factorization that encourages both sparsity and smoothness in parameters along the specified tensor modes. Computationally, we develop a highly efficient optimization algorithm that benefits from substantial dimension reduction. In theory, we first establish a nonasymptotic error bound for the estimator from the structured tensor factorization. Built upon this error bound, we then derive the rate of convergence of the estimated cluster centers, and show that the estimated clusters recover the true cluster structures with a high probability. Moreover, our proposed method can be naturally extended to coclustering of multiple modes of the tensor data. The efficacy of our approach is illustrated via simulations and a brain dynamic functional connectivity analysis from an Autism spectrum disorder study. 
Dynamic Time Warping (DTW) 
In time series analysis, dynamic time warping (DTW) is an algorithm for measuring similarity between two temporal sequences which may vary in time or speed. For instance, similarities in walking patterns could be detected using DTW, even if one person was walking faster than the other, or if there were accelerations and decelerations during the course of an observation. DTW has been applied to temporal sequences of video, audio, and graphics data – indeed, any data which can be turned into a linear sequence can be analyzed with DTW. A well known application has been automatic speech recognition, to cope with different speaking speeds. Other applications include speaker recognition and online signature recognition. Also it is seen that it can be used in partial shape matching application. In general, DTW is a method that calculates an optimal match between two given sequences (e.g. time series) with certain restrictions. The sequences are ‘warped’ nonlinearly in the time dimension to determine a measure of their similarity independent of certain nonlinear variations in the time dimension. This sequence alignment method is often used in time series classification. Although DTW measures a distancelike quantity between two given sequences, it does’t guarantee the triangle inequality to hold. Dynamic programming algorithm optimization for spoken word recognition dtwclust,dtwSat,IncDTW 
Dynamic Treatment Regimens (DTR) 
In medical research, a dynamic treatment regime (DTR), adaptive intervention, or adaptive treatment strategy is a set of rules for choosing effective treatments for individual patients. Historically, medical research and the practice of medicine tended to rely on an acute care model for the treatment of all medical problems, including chronic illness. Treatment choices made for a particular patient under a dynamic regime are based on that individual’s characteristics and history, with the goal of optimizing his or her longterm clinical outcome. A dynamic treatment regime is analogous to a policy in the field of reinforcement learning, and analogous to a controller in control theory. While most work on dynamic treatment regimes has been done in the context of medicine, the same ideas apply to timevarying policies in other fields, such as education, marketing, and economics. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage by potentially timevarying patient features and intermediate outcomes observed in previous stages. There are 3 main type methods, Olearning, Qlearning and Plearning to learn the optimal Dynamic Treatment Regimes with continuous variables. DTRlearn 
Dynamic Variable Effort Deep Neural Networks (DyVEDeep) 
Deep Neural Networks (DNNs) have advanced the stateoftheart in a variety of machine learning tasks and are deployed in increasing numbers of products and services. However, the computational requirements of training and evaluating largescale DNNs are growing at a much faster pace than the capabilities of the underlying hardware platforms that they are executed upon. In this work, we propose Dynamic Variable Effort Deep Neural Networks (DyVEDeep) to reduce the computational requirements of DNNs during inference. Previous efforts propose specialized hardware implementations for DNNs, statically prune the network, or compress the weights. Complementary to these approaches, DyVEDeep is a dynamic approach that exploits the heterogeneity in the inputs to DNNs to improve their compute efficiency with comparable classification accuracy. DyVEDeep equips DNNs with dynamic effort mechanisms that, in the course of processing an input, identify how critical a group of computations are to classify the input. DyVEDeep dynamically focuses its compute effort only on the critical computa tions, while skipping or approximating the rest. We propose 3 effort knobs that operate at different levels of granularity viz. neuron, feature and layer levels. We build DyVEDeep versions for 5 popular image recognition benchmarks – one for CIFAR10 and four for ImageNet (AlexNet, OverFeat and VGG16, weightcompressed AlexNet). Across all benchmarks, DyVEDeep achieves 2.1x2.6x reduction in the number of scalar operations, which translates to 1.8x2.3x performance improvement over a Caffebased implementation, with < 0.5% loss in accuracy. 
Dynamically Expandable Network (DEN) 
We propose a novel deep network architecture for lifelong learning which we refer to as Dynamically Expandable Network (DEN), that can dynamically decide its network capacity as it trains on a sequence of tasks, to learn a compact overlapping knowledge sharing structure among tasks. DEN is efficiently trained in an online manner by performing selective retraining, dynamically expands network capacity upon arrival of each task with only the necessary number of units, and effectively prevents semantic drift by splitting/duplicating units and timestamping them. We validate DEN on multiple public datasets in lifelong learning scenarios on multiple public datasets, on which it not only significantly outperforms existing lifelong learning methods for deep networks, but also achieves the same level of performance as the batch model with substantially fewer number of parameters. 
Dynamically Routed Network (SkipNet) 
Increasing depth and complexity in convolutional neural networks has enabled significant progress in visual perception tasks. However, incremental improvements in accuracy are often accompanied by exponentially deeper models that push the computational limits of modern hardware. These incremental improvements in accuracy imply that only a small fraction of the inputs require the additional model complexity. As a consequence, for any given image it is possible to bypass multiple stages of computation to reduce the cost of forward inference without affecting accuracy. We exploit this simple observation by learning to dynamically route computation through a convolutional network. We introduce dynamically routed networks (SkipNets) by adding gating layers that route images through existing convolutional networks and formulate the routing problem in the context of sequential decision making. We propose a hybrid learning algorithm which combines supervised learning and reinforcement learning to address the challenges of inherently nondifferentiable routing decisions. We show SkipNet reduces computation by 30 – 90% while preserving the accuracy of the original model on four benchmark datasets. We compare SkipNet with SACT and ACT to show SkipNet achieves better accuracy with lower computation. 
DyNet  We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet’s dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Dynamic declaration thus facilitates the implementation of more complicated network architectures, and DyNet is specifically designed to allow users to implement their models in a way that is idiomatic in their preferred programming language (C++ or Python). One challenge with dynamic declaration is that because the symbolic computation graph is defined anew for every training example, its construction must have low overhead. To achieve this, DyNet has an optimized C++ backend and lightweight graph representation. Experiments show that DyNet’s speeds are faster than or comparable with static declaration toolkits, and significantly faster than Chainer, another dynamic declaration toolkit. DyNet is released opensource under the Apache 2.0 license and available at http://…/dynet. 
Advertisements