Part of Speech Based Term Weighting for Information Retrieval

Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline.

Streaming Pattern Matching with d Wildcards

In the pattern matching with d wildcards problem one is given a text T of length n and a pattern P of length m that contains d wildcard characters, each denoted by a special symbol '?'. A wildcard character matches any other character. The goal is to establish for each m-length substring of T whether it matches P. In the streaming model variant of the pattern matching with d wildcards problem the text T arrives one character at a time and the goal is to report, before the next character arrives, if the last m characters match P while using only o(m) words of space. In this paper we introduce two new algorithms for the d wildcard pattern matching problem in the streaming model. The first is a randomized Monte Carlo algorithm that is parameterized by a constant 0\leq \delta \leq 1. This algorithm uses \tilde{O}(d^{1-\delta}) amortized time per character and \tilde{O}(d^{1+\delta}) words of space. The second algorithm, which is used as a black box in the first algorithm, is a randomized Monte Carlo algorithm which uses O(d+\log m) worst-case time per character and O(d\log m) words of space.

Learning Combinatorial Optimization Algorithms over Graphs

Many combinatorial optimization problems over graphs are NP-hard, and require significant specialized knowledge and trial-and-error to design good heuristics or approximation algorithms. Can we automate this challenging and tedious process, and learn the algorithms instead? In many real world applications, it is typically the case that the same type of optimization problem is solved again and again on a regular basis, maintaining the same problem structure but differing in the data. This provides an opportunity for learning heuristic algorithms which can exploit the structure of such recurring problems. In this paper, we propose a unique combination of reinforcement learning and graph embedding to address this challenge. The learned greedy policy behaves like a meta-algorithm which incrementally constructs a solution, and the action is determined by the output of a graph embedding network capturing the current state of the solution. We show that our framework can be applied to a diverse range of optimization problems over graphs, and provide evidence that our learning approach can compete with or outperform specialized heuristics or approximation algorithms for the Minimum Vertex Cover, Maximum Cut and Traveling Salesman Problems.

Enabling Smart Data: Noise filtering in Big Data classification

In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem.

Neural Question Generation from Text: A Preliminary Study

Automatic question generation aims to generate questions from a text passage where the generated questions can be answered by certain sub-spans of the given passage. Traditional methods mainly use rigid heuristic rules to transform a sentence into related questions. In this work, we propose to apply the neural encoder-decoder model to generate meaningful and diverse questions from natural language sentences. The encoder reads the input text and the answer position, to produce an answer-aware input representation, which is fed to the decoder to generate an answer focused question. We conduct a preliminary study on neural question generation from text with the SQuAD dataset, and the experiment results show that our method can produce fluent and diverse questions.

An Online Hierarchical Algorithm for Extreme Clustering

Many modern clustering methods scale well to a large number of data items, N, but not to a large number of clusters, K. This paper introduces PERCH, a new non-greedy algorithm for online hierarchical clustering that scales to both massive N and K–a problem setting we term extreme clustering. Our algorithm efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, our approach performs tree rotations for the sake of enhancing subtree purity and encouraging balancedness. We prove that, under a natural separability assumption, our non-greedy algorithm will produce trees with perfect dendrogram purity regardless of online data arrival order. Our experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest flat clustering competitor in nearly half the time.

Approximate Clustering with Same-Cluster Queries

Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind ‘do two given points belong to the same optimal cluster?’ There are many clustering contexts where such same-cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed O(k^2 \log{k} + k \log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is \mathsf{NP}-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting showing that a few of such same-cluster queries enables one to get a polynomial-time (1 + \varepsilon)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is \mathsf{NP}-hard to approximate within a factor (1 + c) for a fixed constant 0 < c < 1. The number of same-cluster queries used is \textrm{poly}(k/\varepsilon) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make \Omega \left(\frac{k}{poly \log k} \right) same-cluster queries. Another result we show with respect to the k-means++ seeding algorithm is that a small modification to the k-means++ seeding algorithm within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(\log{k})-approximation algorithm.

Massive Data Clustering in Moderate Dimensions from the Dual Spaces of Observation and Attribute Data Clouds

Cluster analysis of very high dimensional data can benefit from the properties of such high dimensionality. Informally expressed, in this work, our focus is on the analogous situation when the dimensionality is moderate to small, relative to a massively sized set of observations. Mathematically expressed, these are the dual spaces of observations and attributes. The point cloud of observations is in attribute space, and the point cloud of attributes is in observation space. In this paper, we begin by summarizing various perspectives related to methodologies that are used in multivariate analytics. We draw on these to establish an efficient clustering processing pipeline, both partitioning and hierarchical clustering.

Encoder Based Lifelong Learning

This paper introduces a new lifelong learning solution where a single model is trained for a sequence of tasks. The main challenge that vision systems face in this context is catastrophic forgetting: as they tend to adapt to the most recently seen task, they lose performance on the tasks that were learned previously. Our method aims at preserving the knowledge of the previous tasks while learning a new one by using autoencoders. For each task, an under-complete autoencoder is learned, capturing the features that are crucial for its achievement. When a new task is presented to the system, we prevent the reconstructions of the features with these autoencoders from changing, which has the effect of preserving the information on which the previous tasks are mainly relying. At the same time, the features are given space to adjust to the most recent environment as only their projection into a low dimension submanifold is controlled. The proposed system is evaluated on image classification tasks and shows a reduction of forgetting over the state-of-the-art

From Data to City Indicators: A Knowledge Graph for Supporting Automatic Generation of Dashboards

In the context of Smart Cities, indicator definitions have been used to calculate values that enable the comparison among different cities. The calculation of an indicator values has challenges as the calculation may need to combine some aspects of quality while addressing different levels of abstraction. Knowledge graphs (KGs) have been used successfully to support flexible representation, which can support improved understanding and data analysis in similar settings. This paper presents an operational description for a city KG, an indicator ontology that support indicator discovery and data visualization and an application capable of performing metadata analysis to automatically build and display dashboards according to discovered indicators. We describe our implementation in an urban mobility setting.

Probing many-body localization with neural networks

Bag-of-Words Method Applied to Accelerometer Measurements for the Purpose of Classification and Energy Estimation

Rhetorical relations for information retrieval

Preliminary Experiments using Subjective Logic for the Polyrepresentation of Information Needs

Nonnegative/binary matrix factorization with a D-Wave quantum annealer

A Subjective Logic Formalisation of the Principle of Polyrepresentation for Information Needs

Study on a Low Complexity ECG Compression Scheme with Multiple Sensors

Uniform deviation and moment inequalities for random polytopes with general densities in arbitrary convex bodies

Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition

Positive Semidefiniteness of Matrices arising from Ramsey Theory

Generating large Ising models with Markov structure via simple linear relations

Greed is Good: Near-Optimal Submodular Maximization via Greedy Optimization

Automatic Measurement of Pre-aspiration

A Complexity Trichotomy for the Six-Vertex Model

Explicit Determination in ${\Bbb R}^{N}$ of $(N-1)$-Dimensional Area Minimizing Surfaces with Arbitrary Boundaries

The impact of random actions on opinion dynamics

No two starlikes have equal index

The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classification

Optimal transport and integer partitions

Vico-Greengard-Ferrando quadratures in the tensor solver for integral equations

Joint Maximum a Posteriori State Path and Parameter Estimation in Stochastic Differential Equations

On Tests for Complete Independence of Normal Random Vectors

Multi-Personality Partitioning for Heterogeneous Systems

Lower bounds on the 2-adic complexity of modified Jacobi sequences

Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction

A Syntactic Neural Model for General-Purpose Code Generation

Accelerated Stochastic Quasi-Newton Optimization on Riemann Manifolds

Learning Certifiably Optimal Rule Lists for Categorical Data

Adequacy of the Gradient-Descent Method for Classifier Evasion Attacks

Generate To Adapt: Aligning Domains using Generative Adversarial Networks

On The Modified Newman-Watts Small World and Its Random Walk

On The Waiting Time for A M/M/1 Queue with Impatience

Action Representation Using Classifier Decision Boundaries

Beyond triplet loss: a deep quadruplet network for person re-identification

On the equivalence between multiclass PS-type scheduling policies

A stochastic molecular scheme for an artificial cell to infer its environment from partial observations

Object-Part Attention Driven Discriminative Localization for Fine-grained Image Classification

Transferrable Plausibility Model – A Probabilistic Interpretation of Mathematical Theory of Evidence

How to Make an Image More Memorable? A Deep Style Transfer Approach

MRA – Proof of Concept of a Multilingual Report Annotator Web Application

Enhance Feature Discrimination for Unsupervised Hashing

A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

Joint Trajectory and Communication Design for UAV-Enabled Multiple Access

Noether resolutions in dimension $2$

Geometry of Policy Improvement

Prototyping and Experimentation of a Closed-Loop Wireless Power Transmission with Channel Acquisition and Waveform Optimization

Solution to dynamic economic dispatch with prohibited operating zones via MILP

Contextual Data Collection for Smart Cities

Human-Aware Sensor Network Ontology: Semantic Support for Empirical Data Collection

Higher-Order Minimum Cost Lifted Multicuts for Motion Segmentation

Incremental Transductive Learning Approaches to Schistosomiasis Vector Classification

Infinite rank surface cluster algebras

The Theta Number of Simplicial Complexes

Report on TBAS 2012: Workshop on Task-Based and Aggregated Search

Maximum a Posteriori Joint State Path and Parameter Estimation in Stochastic Differential Equations

Fixed versus Dynamic Co-Occurrence Windows in TextRank Term Weights for Information Retrieval

A Service-Oriented Architecture for Assisting the Authoring of Semantic Crowd Maps

Predictive Control for Energy Management in Ship Power Systems under High-power Ramp Rate Loads

Tackling Dynamic Vehicle Routing Problem with Time Windows by means of Ant Colony System

On the wildness of cambrian lattices

Robust Causal Estimation in the Large-Sample Limit without Strict Faithfulness

Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear Running Time

A Convolution Tree with Deconvolution Branches: Exploiting Geometric Relationships for Single Shot Keypoint Detection

Linear Optimal Power Flow Using Cycle Flows

Landmark Guided Probabilistic Roadmap Queries

Conformative Filtering for Implicit Feedback Data

On Multi-source Networks: Enumeration, Rate Region Computation, and Hierarchy

Disjointness graphs of segments

Improved Decoding and Error Floor Analysis of Staircase Codes

Statistical Efficiency of Compositional Nonparametric Prediction

Online Hashing

Local Behavior of Airy Processes

Geometry of Log-Concave Density Estimation

The Proof of CSP Dichotomy Conjecture

Downlink Power Optimization for Heterogeneous Networks with Time Reversal-based Transmission under Backhaul Limitation

Cooperative network localization using hybrid range and angle measurements

Sandwiches Missing Two Ingredients of Order Four

Power-and Rate-Adaptation Improves the Effective Capacity of C-RAN for Nakagami-$m$ Fading Channels

Automated Latent Fingerprint Recognition

Semantically-Guided Video Object Segmentation

Short Labeling Schemes for Topology Recognition in Wireless Tree Networks

Lyapunov criteria for uniform convergence of conditional distributions of absorbed Markov processes

The Matching Problem in General Graphs is in Quasi-NC

Promise Constraint Satisfaction: Algebraic Structure and a Symmetric Boolean Dichotomy

The Interplay of Semantics and Morphology in Word Embeddings

ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models

The quality of priority ratios estimation in relation to a selected prioritization procedure and consistency measure for a Pairwise Comparison Matrix

Limits of the boundary of random planar maps

Swap connectivity for two graph spaces between simple and pseudo graphs and disconnectivity for triangle constraints