Consider a movie studio aiming to produce a set of new movies for summer release: What types of movies it should produce? Who would the movies appeal to? How many movies should it make? Similar issues are encountered by a variety of organizations, e.g., mobile-phone manufacturers and online magazines, who have to create new (non-existent) items to satisfy groups of users with different preferences. In this paper, we present a joint problem formalization of these interrelated issues, and propose generative methods that address these questions simultaneously. Specifically, we leverage the latent space obtained by training a deep generative model—the Variational Autoencoder (VAE)—via a loss function that incorporates both rating performance and item reconstruction terms. We then apply a greedy search algorithm that utilizes this learned latent space to jointly obtain K plausible new items, and user groups that would find the items appealing. An evaluation of our methods on a synthetic dataset indicates that our approach is able to generate novel items similar to highly-desirable unobserved items. As case studies on real-world data, we applied our method on the MART abstract art and Movielens Tag Genome dataset, which resulted in promising results: small and diverse sets of novel items.
Different search engines provide different outputs for the same keyword. This may be due to different definitions of relevance, and/or to different knowledge/anticipation of users’ preferences, but rankings are also suspected to be biased towards own content, which may prejudicial to other content providers. In this paper, we make some initial steps toward a rigorous comparison and analysis of search engines, by proposing a definition for a consensual relevance of a page with respect to a keyword, from a set of search engines. More specifically, we look at the results of several search engines for a sample of keywords, and define for each keyword the visibility of a page based on its ranking over all search engines. This allows to define a score of the search engine for a keyword, and then its average score over all keywords. Based on the pages visibility, we can also define the consensus search engine as the one showing the most visible results for each keyword. We have implemented this model and present an analysis of the results.
This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.
Recently, in the area of big data, some popular applications such as web search engines and recommendation systems, face the problem to diversify results during query processing. In this sense, it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set. In this paper, we firstly define a set’s diversity and an element’s ability to improve the set’s overall diversity. Based on these definitions, we propose a diversification framework which has good performance in terms of effectiveness and efficiency. Also, this framework has theoretical guarantee on probability of success. Secondly, we design implementation algorithms based on this framework for both numerical and string data. Thirdly, for numerical and string data respectively, we carry out extensive experiments on real data to verify the performance of our proposed framework, and also perform scalability experiments on synthetic data.
In the modern e-commerce, the behaviors of customers contain rich information, e.g., consumption habits, the dynamics of preferences. Recently, session-based recommendations are becoming popular to explore the temporal characteristics of customers’ interactive behaviors. However, existing works mainly exploit the short-term behaviors without fully taking the customers’ long-term stable preferences and evolutions into account. In this paper, we propose a novel Behavior-Intensive Neural Network (BINN) for next-item recommendation by incorporating both users’ historical stable preferences and present consumption motivations. Specifically, BINN contains two main components, i.e., Neural Item Embedding, and Discriminative Behaviors Learning. Firstly, a novel item embedding method based on user interactions is developed for obtaining an unified representation for each item. Then, with the embedded items and the interactive behaviors over item sequences, BINN discriminatively learns the historical preferences and present motivations of the target users. Thus, BINN could better perform recommendations of the next items for the target users. Finally, for evaluating the performances of BINN, we conduct extensive experiments on two real-world datasets, i.e., Tianchi and JD. The experimental results clearly demonstrate the effectiveness of BINN compared with several state-of-the-art methods.
Data application developers and data scientists spend an inordinate amount of time iterating on machine learning (ML) workflows — by modifying the data pre-processing, model training, and post-processing steps — via trial-and-error to achieve the desired model performance. Existing work on accelerating machine learning focuses on speeding up one-shot execution of workflows, failing to address the incremental and dynamic nature of typical ML development. We propose Helix, a declarative machine learning system that accelerates iterative development by optimizing workflow execution end-to-end and across iterations. Helix minimizes the runtime per iteration via program analysis and intelligent reuse of previous results, which are selectively materialized — trading off the cost of materialization for potential future benefits — to speed up future iterations. Additionally, Helix offers a graphical interface to visualize workflow DAGs and compare versions to facilitate iterative development. Through two ML applications, in classification and in structured prediction, attendees will experience the succinctness of Helix programming interface and the speed and ease of iterative development using Helix. In our evaluations, Helix achieved up to an order of magnitude reduction in cumulative run time compared to state-of-the-art machine learning tools.
Confidentiality of patient information is an essential part of Electronic Health Record System. Patient information, if exposed, can cause a serious damage to the privacy of individuals receiving healthcare. Hence it is important to remove such details from physician notes. A system is proposed which consists of a deep learning model where a de-convolutional neural network and bi-directional LSTM-CNN is used along with regular expressions to recognize and eliminate the individually identifiable information. This information is then removed from a medical practitioner’s data which further allows the fair usage of such information among researchers and in clinical trials.
Multi-Task Gaussian processes (MTGPs) have shown a significant progress both in expressiveness and interpretation of the relatedness between different tasks: from linear combinations of independent single-output Gaussian processes (GPs), through the direct modeling of the cross-covariances such as spectral mixture kernels with phase shift, to the design of multivariate covariance functions based on spectral mixture kernels which model delays among tasks in addition to phase differences, and which provide a parametric interpretation of the relatedness across tasks. In this paper we further extend expressiveness and interpretability of MTGPs models and introduce a new family of kernels capable to model nonlinear correlations between tasks as well as dependencies between spectral mixtures, including time and phase delay. Specifically, we use generalized convolution spectral mixture kernels for modeling dependencies at spectral mixture level, and coupling coregionalization for discovering task level correlations. The proposed kernels for MTGP are validated on artificial data and compared with existing MTGPs methods on three real-world experiments. Results indicate the benefits of our more expressive representation with respect to performance and interpretability.
Data center networks are an important infrastructure in various applications of modern information technologies. Note that each data center always has a finite lifetime, thus once a data center fails, then it will lose all its storage files and useful information. For this, it is necessary to replicate and copy each important file into other data centers such that this file can increase its lifetime of staying in a data center network. In this paper, we describe a large-scale data center network with a file d-threshold policy, which is to replicate each important file into at most d-1 other data centers such that this file can maintain in the data center network under a given level of data security in the long-term. To this end, we develop three relevant Markov processes to propose two effective methods for assessing the file lifetime and data security. By using the RG-factorizations, we show that the two methods are used to be able to more effectively evaluate the file lifetime of large-scale data center networks. We hope the methodology and results given in this paper are applicable in the file lifetime study of more general data center networks with replication mechanism.
Deep learning models have lately shown great performance in various fields such as computer vision, speech recognition, speech translation, and natural language processing. However, alongside their state-of-the-art performance, it is still generally unclear what is the source of their generalization ability. Thus, an important question is what makes deep neural networks able to generalize well from the training set to new data. In this article, we provide an overview of the existing theory and bounds for the characterization of the generalization error of deep neural networks, combining both classical and more recent theoretical and empirical results.
The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time. There is an increasing need for tools that can pre-process, analyse and classify raw text to extract interpretable content; specifically, identifying topics and content-driven groupings of articles. We present here such a methodology that brings together powerful vector embeddings from Natural Language Processing with tools from Graph Theory that exploit diffusive dynamics on graphs to reveal natural partitions across scales. Our framework uses a recent deep neural network text analysis methodology (Doc2vec) to represent text in vector form and then applies a multi-scale community detection method (Markov Stability) to partition a similarity graph of document vectors. The method allows us to obtain clusters of documents with similar content, at different levels of resolution, in an unsupervised manner. We showcase our approach with the analysis of a corpus of 9,000 news articles published by Vox Media over one year. Our results show consistent groupings of documents according to content without a priori assumptions about the number or type of clusters to be found. The multilevel clustering reveals a quasi-hierarchy of topics and subtopics with increased intelligibility and improved topic coherence as compared to external taxonomy services and standard topic detection methods.
We consider a setting, where the output of a linear dynamical system (LDS) is, with an unknown but fixed probability, replaced by noise. There, we present a robust method for the prediction of the outputs of the LDS and identification of the samples of noise, and prove guarantees on its statistical performance. One application lies in anomaly detection: the samples of noise, unlikely to have been generated by the dynamics, can be flagged to operators of the system for further study.
This article proposes a visualization method for multidimensional data based on: (i) Animated functional Hypothetical Outcome Plots (f-HOPs); (ii) 3-dimensional Kiviat plot; and (iii) data sonification. In an Uncertainty Quantification (UQ) framework, such analysis coupled with standard statistical analysis tools such as Probability Density Functions (PDF) can be used to augment the understanding of how the uncertainties in the numerical code inputs translate into uncertainties in the quantity of interest (QoI). In contrast with static representation of most advanced techniques such as functional Highest Density Region (HDR) boxplot or functional boxplot, f-HOPs is a dynamic visualization that enables the practitioners to infer the dynamics of the physics and enables to see functional correlations that may exist. While this technique only allows to represent the QoI, we propose a 3-dimensional version of the Kiviat plot to encode all input parameters. This new visualization takes advantage of information from f-HOPs through data sonification. All in all, this allows to analyse large datasets within a high-dimensional parameter space and a functional QoI in the same canvas. The proposed method is assessed and showed its benefits on two related environmental datasets.
We propose a Bayesian method to detect change points for functional data. We extract the features of a sequence of functional data by the discrete wavelet transform (DWT), and treat each sequence of feature independently. We believe there is potentially a change in each feature at possibly different time points. The functional data evolves through such changes throughout the sequences of observations. The change point for this sequence of functional data is the cumulative effect of changes in all features. We assign the features with priors which incorporate the characteristic of the wavelet coefficients. Then we compute the posterior distribution of change point for each sequence of feature, and define a matrix where each entry is a measure of similarity between two functional data in this sequence. We compute the ratio of the mean similarity between groups and within groups for all possible partitions, and the change point is where the ratio reaches the minimum. We demonstrate this method using a dataset on climate change.
This paper proposes a maximum-likelihood approach to jointly estimate marginal conditional quantiles of multivariate response variables in a linear regression framework. We consider a slight reparameterization of the Multivariate Asymmetric Laplace distribution proposed by Kotz et al (2001) and exploit its location-scale mixture representation to implement a new EM algorithm for estimating model parameters. The idea is to extend the link between the Asymmetric Laplace distribution and the well-known univariate quantile regression model to a multivariate context, i.e. when a multivariate dependent variable is concerned. The approach accounts for association among multiple responses and study how the relationship between responses and explanatory variables can vary across different quantiles of the marginal conditional distribution of the responses. A penalized version of the EM algorithm is also presented to tackle the problem of variable selection. The validity of our approach is analyzed in a simulation study, where we also provide evidence on the efficiency gain of the proposed method compared to estimation obtained by separate univariate quantile regressions. A real data application is finally proposed to study the main determinants of financial distress in a sample of Italian firms.
Spiking Neural Networks offer an event-driven and biologically realistic alternative to standard Artificial Neural Networks based on analog information processing which make them more suitable for energy-efficient hardware implementations of the functional units of the brain, namely, neurons and synapses. Despite extensive efforts to replicate the energy efficiency of the brain in standard von-Neumann architecture, the massive parallelism has remained elusive. This has led researchers to venture towards beyond von-Neumann computing or in-memory’ computing paradigms based on CMOS and post-CMOS technologies. However, implementations of such platforms in the electrical domain faces potential limitations of switching speed and interconnect losses. Integrated Photonics offers an welcome alternative to standard microelectronic’ platforms and have recently shown promise as a viable technology for spike-based neural processing systems. However, non-volatility has been identified as an important component of large-scale neuromorphic systems. Although the recent demonstrations of ultra-fast computing with phase-change materials (PCMs) show promise, the jump from standalone computing devices to parallel computing architecture is challenging. In this work, we utilize the optical properties of the PCM, Ge$_2$Sb$_2$Te$_5$ (GST), to propose an all-Photonic Spiking Neural Network, comprising of a non-volatile synaptic array integrated seamlessly with previously explored integrate-and-fire’ neurons to realize an in-memory’ computing platform leveraging the inherent parallelism of wavelength-division-multiplexing (WDM). The proposed design not only bridges the gap between isolated computing devices and parallel large-scale implementation, but also paves the way for ultra-fast computing and localized on-chip learning.