R Packages worth a look

Model-Based Detection of Disease Clusters (DClusterm)
Model-based methods for the detection of disease clusters using GLMs, GLMMs and zero-inflated models.

Score Test Based on Saddlepoint Approximation (SPAtest)
Performs score test using saddlepoint approximation to estimate the null distribution.

Expected/Observed Fisher Information and Bias-Corrected Maximum Likelihood Estimate(s) (
Calculates the expected/observed Fisher information and the bias-corrected maximum likelihood estimate(s) via Cox-Snell Methodology.


Distilled News

Data Matching – Entity Identification, Resolution & Linkage

Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several source systems. The entities under consideration most commonly refer to people, places, publications or citations, consumer products, or businesses. Besides data matching, the names most prominently used are record or data linkage, entity resolution, object identification, or field matching. A major challenge in data matching is the lack of common entity identifiers across different source systems to be matched. As a result of this, the matching needs to be conducted using attributes that contain partially identifying information, such as names, addresses, or dates of birth. However, such identifying information is often of low quality and especially suffer from frequently occurring typographical variations and errors, such information can change over time, human errors or it is only partially available in the sources to be matched. In the past decade, significant advances have been achieved in many aspects of the data matching process, but especially on how to improve the accuracy of data matching, and how to scale data matching to very large systems that contain many millions of records. This work has been conducted by researchers in various fields, including applied statistics, health sciences, data mining, machine learning, artificial intelligence, information systems, information retrieval, knowledge engineering, the database and data warehousing communities, and researchers working in the field of digital libraries.

Perceptron: the main component of neural networks

One of the hotests topics of artificial intelligence are neural networks. Neural Networks are computational models based on the structure of the brain. These are information processing structures whose most significant property is their ability to learn from data. These techniques have achieved great success in domains ranging from marketing to engineering. There are many different types of neural networks, from which the multilayer perceptron is the most important one. The characteristic neuron model in the multilayer perceptron is the so called perceptron. In this article we will explain the mathematics on this neuron model.

Gentlest Introduction to Tensorflow #1

Tensorflow (TF) is Google’s attempt to put the power of Deep Learning into the hands of developers around the world. It comes with a beginner & an advanced tutorial, as well as a course on Udacity. However, the materials attempt to introduce both ML and TF concurrently to solve a multi-feature problem?—?character recognition, which albeit interesting, unnecessarily convolutes understanding. In this series of articles, we present the gentlest introduction to TF that starts off by showing how to do linear regression for a single feature problem, and expand from there.

Artificial intelligence: Cooperation vs. aggression

There’s been a lot of buzz about some experiments at DeepMind that study whether AI systems will be aggressive or collaborative when playing a game. Players gather virtual apples; they have the ability to temporarily incapacitate an opponent by ‘shooting’ a virtual ‘laser.’ And humans are surprised that AIs at times decide that it’s to their advantage to shoot their opponent, rather than peacefully gathering apples.

Data Transformation in R: The #Tidyverse-Approach of Organizing Data #rstats

Yesterday, I had the pleasure to give a talk at the 8th Hamburg R User-Group meeting. I talked about data wrangling and data transformation, and how the philosophy behind the tidyverse makes these tasks easier. If you like, you can download the slides. Feel free to add your comments to the slide here.

How to make a global map in R, step by step

Maps are great for practicing data visualization. First of all, there’s a lot of data available on places like Wikipedia that you can map. Moreover, creating maps typically requires several essential skills in combination. Specifically, you commonly need to be able to retrieve the data (e.g., scrape it), mold it into shape, perform a join, and visualize it. Because creating maps requires several skills from data manipulation and data visualization, creating them will be great practice for you. And if that’s not enough, a good map just looks great. They’re visually compelling. With that in mind, I want to walk you through the logic of building one step by step.

Document worth reading: “Learning to Hash for Indexing Big Data – A Survey”

The explosive growth in big data has attracted much attention in designing efficient indexing and search methods recently. In many critical applications such as large-scale search and pattern matching, finding the nearest neighbors to a query is a fundamental research problem. However, the straightforward solution using exhaustive comparison is infeasible due to the prohibitive computational complexity and memory requirement. In response, Approximate Nearest Neighbor (ANN) search based on hashing techniques has become popular due to its promising performance in both efficiency and accuracy. Prior randomized hashing methods, e.g., Locality-Sensitive Hashing (LSH), explore data-independent hash functions with random projections or permutations. Although having elegant theoretic guarantees on the search quality in certain metric spaces, performance of randomized hashing has been shown insufficient in many real-world applications. As a remedy, new approaches incorporating data-driven learning methods in development of advanced hash functions have emerged. Such learning to hash methods exploit information such as data distributions or class labels when optimizing the hash codes or functions. Importantly, the learned hash codes are able to preserve the proximity of neighboring data in the original feature spaces in the hash code spaces. The goal of this paper is to provide readers with systematic understanding of insights, pros and cons of the emerging techniques. We provide a comprehensive survey of the learning to hash framework and representative techniques of various types, including unsupervised, semi-supervised, and supervised. In addition, we also summarize recent hashing approaches utilizing the deep learning models. Finally, we discuss the future direction and trends of research in this area. Learning to Hash for Indexing Big Data – A Survey

Whats new on arXiv

Developing a comprehensive framework for multimodal feature extraction

Feature extraction is a critical component of many applied data science workflows. In recent years, rapid advances in artificial intelligence and machine learning have led to an explosion of feature extraction tools and services that allow data scientists to cheaply and effectively annotate their data along a vast array of dimensions—ranging from detecting faces in images to analyzing the sentiment expressed in coherent text. Unfortunately, the proliferation of powerful feature extraction services has been mirrored by a corresponding expansion in the number of distinct interfaces to feature extraction services. In a world where nearly every new service has its own API, documentation, and/or client library, data scientists who need to combine diverse features obtained from multiple sources are often forced to write and maintain ever more elaborate feature extraction pipelines. To address this challenge, we introduce a new open-source framework for comprehensive multimodal feature extraction. Pliers is an open-source Python package that supports standardized annotation of diverse data types (video, images, audio, and text), and is expressly with both ease-of-use and extensibility in mind. Users can apply a wide range of pre-existing feature extraction tools to their data in just a few lines of Python code, and can also easily add their own custom extractors by writing modular classes. A graph-based API enables rapid development of complex feature extraction pipelines that output results in a single, standardized format. We describe the package’s architecture, detail its major advantages over previous feature extraction toolboxes, and use a sample application to a large functional MRI dataset to illustrate how pliers can significantly reduce the time and effort required to construct sophisticated feature extraction workflows while increasing code clarity and maintainability.

A Graphical Evolutionary Game Approach to Social Learning

In this work, we study the social learning problem, in which agents of a networked system collaborate to detect the state of the nature based on their private signals. A novel distributed graphical evolutionary game theoretic learning method is proposed. In the proposed game-theoretic method, agents only need to communicate their binary decisions rather than the real-valued beliefs with their neighbors, which endows the method with low communication complexity. Under mean field approximations, we theoretically analyze the steady state equilibria of the game and show that the evolutionarily stable states (ESSs) coincide with the decisions of the benchmark centralized detector. Numerical experiments are implemented to confirm the effectiveness of the proposed game-theoretic learning method.

Reinforcement Learning Based Argument Component Detection

Argument component detection (ACD) is an important sub-task in argumentation mining. ACD aims at detecting and classifying different argument components in natural language texts. Historical annotations (HAs) are important features the human annotators consider when they manually perform the ACD task. However, HAs are largely ignored by existing automatic ACD techniques. Reinforcement learning (RL) has proven to be an effective method for using HAs in some natural language processing tasks. In this work, we propose a RL-based ACD technique, and evaluate its performance on two well-annotated corpora. Results suggest that, in terms of classification accuracy, HAs-augmented RL outperforms plain RL by at most 17.85%, and outperforms the state-of-the-art supervised learning algorithm by at most 11.94%.

SAR: A Semantic Analysis Approach for Recommendation

Recommendation system is a common demand in daily life and matrix completion is a widely adopted technique for this task. However, most matrix completion methods lack semantic interpretation and usually result in weak-semantic recommendations. To this end, this paper proposes a {\bf S}emantic {\bf A}nalysis approach for {\bf R}ecommendation systems \textbf{(SAR)}, which applies a two-level hierarchical generative process that assigns semantic properties and categories for user and item. SAR learns semantic representations of users/items merely from user ratings on items, which offers a new path to recommendation by semantic matching with the learned representations. Extensive experiments demonstrate SAR outperforms other state-of-the-art baselines substantially.

Demystifying Fog Computing: Characterizing Architectures, Applications and Abstractions

Internet of Things (IoT) has accelerated the deployment of millions of sensors at the edge of the network, through Smart City infrastructure and lifestyle devices. Cloud computing platforms are often tasked with handling these large volumes and fast streams of data from the edge. Recently, Fog computing has emerged as a concept for low-latency and resource-rich processing of these observation streams, to complement Edge and Cloud computing. In this paper, we review various dimensions of system architecture, application characteristics and platform abstractions that are manifest in this Edge, Fog and Cloud eco-system. We highlight novel capabilities of the Edge and Fog layers, such as physical and application mobility, privacy sensitivity, and a nascent runtime environment. IoT application case studies based on first-hand experiences across diverse domains drive this categorization. We also highlight the gap between the potential and the reality of Fog computing, and identify challenges that need to be overcome for the solution to be sustainable. Together, our article can help platform and application developers bridge the gap that remains in making Fog computing viable.

Interpreting Outliers: Localized Logistic Regression for Density Ratio Estimation

We propose an inlier-based outlier detection method capable of both identifying the outliers and explaining why they are outliers, by identifying the outlier-specific features. Specifically, we employ an inlier-based outlier detection criterion, which uses the ratio of inlier and test probability densities as a measure of plausibility of being an outlier. For estimating the density ratio function, we propose a localized logistic regression algorithm. Thanks to the locality of the model, variable selection can be outlier-specific, and will help interpret why points are outliers in a high-dimensional space. Through synthetic experiments, we show that the proposed algorithm can successfully detect the important features for outliers. Moreover, we show that the proposed algorithm tends to outperform existing algorithms in benchmark datasets.

Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages

An Inequality for the Correlation of Two Functions Operating on Symmetric Bivariate Normal Variables

Bijections for Dyck paths with all peak heights of the same parity

Throughput Optimal Beam Alignment in Millimeter Wave Networks

1-Fan-Bundle-Planar Drawings of Graphs

Bayesian Boolean Matrix Factorisation

A Rollback in the History of Communication-Induced Checkpointing

Structured signal recovery from quadratic measurements: Breaking sample complexity barriers via nonconvex optimization

MOLIERE: Automatic Biomedical Hypothesis Generation System

Coinfection in a stochastic model for bacteriophage systems

Stochastic epidemic SEIRS models with a constant latency period

Are tumor cell lineages solely shaped by mechanical forces?

Characterization of exponential distribution through bivariate regression of record values revisited

The Dialog State Tracking Challenge with Bayesian Approach

Spanning Trees and Spanning Eulerian Subgraphs with Small Degrees. II

Driving-induced many-body localization

Uniform Inference for High-dimensional Quantile Regression: Linear Functionals and Regression Rank Scores

Efficient Dense Labeling of Human Activity Sequences from Wearables using Fully Convolutional Networks

Implementation of a Distributed Coherent Quantum Observer

Filtering Tweets for Social Unrest

An Online Optimization Approach for Multi-Agent Tracking of Dynamic Parameters in the Presence of Adversarial Noise

Determination of hysteresis in finite-state random walks using Bayesian cross validation

Eigenvector spatial filtering for large data sets: fixed and random effects approaches

Noise Models in the Nonlinear Spectral Domain for Optical Fibre Communications

A $(5,5)$-coloring of $K_n$ with few colors

Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models

Beating the World’s Best at Super Smash Bros. with Deep Reinforcement Learning

Intrinsically Knotted and 4-Linked Directed Graphs

On a Class of First-order Primal-Dual Algorithms for Composite Convex Minimization Problems

Learning to generate one-sentence biographies from Wikidata

Exact tensor completion with sum-of-squares

Sample Efficient Policy Search for Optimal Stopping Domains

Best Linear Predictor with Missing Response: Locally Robust Approach

Player Skill Decomposition in Multiplayer Online Battle Arenas

Global linear convergent algorithm to compute the minimum volume enclosing ellipsoid

A $(1.4 + ε)$-approximation algorithm for the $2$-{\sc Max-Duo} problem

The Power of Sparsity in Convolutional Neural Networks

Information-Theoretic Perspectives on Brascamp-Lieb Inequality and Its Reverse

Weighted Motion Averaging for the Registration of Multi-View Range Scans

Semilattices of infinite breadth: structure theory and instability of filters

Memory and Communication Efficient Distributed Stochastic Optimization with Minibatch Prox

Projection based advanced motion model for cubic mapping for 360-degree video

Column normalization of a random measurement matrix

On the (Statistical) Detection of Adversarial Examples

The numbers of edges of 5-polytopes

Quasimartingales associated to Markov processes

Review of Apriori Based Algorithms on MapReduce Framework

Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection

Visual Tracking by Reinforced Decision Making

Learning Compact Appearance Representation for Video-based Person Re-Identification

Convolution Aware Initialization

Computing Influence of a Product through Uncertain Reverse Skyline

Marginals with finite Coulomb cost

Matrix factorizations of correlation matrices and applications

Uniform weak attractivity and criteria for practical uniform asymptotic stability

Is Saki #delicious? The Food Perception Gap on Instagram and Its Relation to Health

Towards a Common Implementation of Reinforcement Learning for Multiple Robotic Tasks

Just DIAL: DomaIn Alignment Layers for Unsupervised Domain Adaptation

Synthesizing Imperative Programs for Introductory Programming Assignments

Edge-Fog Cloud: A Distributed Cloud for Internet of Things Computations

Hybrid Dialog State Tracker with ASR Features

Fast rates for online learning in Linearly Solvable Markov Decision Processes

The network concept of creativity and deep thinking. Applications to social opinion formation and talent support

Ultra-Reliable Short-Packet Communications with Wireless Energy Transfer

Positive-Unlabeled Demand-Aware Recommendation

Spectral radius of uniform hypergraphs and degree sequences

Object Detection in Videos with Tubelet Proposal Networks

Kalman filter tracking on parallel architectures

Quantum discord of states arising from graphs

Negative-Unlabeled Tensor Factorization for Location Category Inference from Inaccurate Mobility Data

Stochastic graph Voronoi tessellation reveals community structure

Linear-Time Tree Containment in Phylogenetic Networks

Automatic implementation of material laws: Jacobian calculation in a finite element code with TAPENADE

Answering Conjunctive Queries under Updates

Scalable computation for optimal control of cascade systems with constraints

Mimicking Ensemble Learning with Deep Branched Networks

Multi-task Learning with CTC and Segmental CRF for Speech Recognition

Compressive Channel Estimation and Multi-user Detection in C-RAN

Energy-Efficient Wireless Content Delivery with Proactive Caching

Deep Geometric Retrieval

Causal Inference on Multivariate Mixed-Type Data by Minimum Description Length

Axiomatic phylogenetics

On convergence for graphexes

Finite Horizon Energy-Efficient Scheduling with Energy Harvesting Transmitters over Fading Channels

On edge exchangeable random graphs

Delving Deeper into MOOC Student Dropout Prediction

Edge states in non-Fermi liquids

General Semiparametric Shared Frailty Model Estimation and Simulation with frailtySurv

A Discriminative Event Based Model for Alzheimer’s Disease Progression Modeling

Addition Theorems in Fp via the Polynomial Method

Stochastic Composite Least-Squares Regression with convergence rate O(1/n)

Phase Transitions of Spectral Initialization for High-Dimensional Nonconvex Estimation

Contract-Theoretic Resource Allocation for Critical Infrastructure Protection

The meet operation in the imbalance lattice of maximal instantaneous codes: alternative proof of existence

Admissibility in Concurrent Games

BrnoCompSpeed: Review of Traffic Camera Calibration and Comprehensive Dataset for Monocular Speed Measurement

Phaseless Sampling and Reconstruction of Real-Valued Signals in Shift-Invariant Spaces

Almost-sure asymptotic for the number of heaps inside a random sequence

Several Classes of Permutation Trinomials over $\mathbb F_{5^n}$ From Niho Exponents

Traffic Surveillance Camera Calibration by 3D Model Bounding Box Alignment for Accurate Vehicle Speed Measurement

3-Dimensional Optical Orthogonal Codes with Ideal Autocorrelation-Bounds and Optimal Constructions

Online Representation Learning with Multi-layer Hebbian Networks for Image Classification Tasks

A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe

Crowd Sourcing Image Segmentation with iaSTAPLE

Predicting non-linear dynamics: a stable local learning scheme for recurrent spiking neural networks

Efficient Social Network Multilingual Classification using Character, POS n-grams and Dynamic Normalization

On distinguishing special trees by their chromatic symmetric functions

Systèmes du LIA à DEFT’13

Stochastic differential games for a multiclass M/M/1 queueing problem with model uncertainty

Distributed Estimation of Principal Eigenspaces

Timely CSI Acquisition Exploiting Full Duplex

Sweeping Processes Perturbed by Rough Signals

Total Forcing Sets in Trees

When can Graph Hyperbolicity be computed in Linear Time?

Iterative bidding in electricity markets: rationality and robustness

PixelNet: Representation of the pixels, by the pixels, and for the pixels

Algorithmes de classification et d’optimisation: participation du LIA/ADOC á DEFT’14

Semiparametric panel data models using neural networks

A data-driven basis for direct estimation of functionals of distributions

A New Approach to the $r$-Whitney Numbers by Using Combinatorial Differential Calculus

VidLoc: 6-DoF Video-Clip Relocalization

Singular SPDEs in domains with boundaries

A Nonconvex Free Lunch for Low-Rank plus Sparse Matrix Recovery

If you did not already know: “Parallel and Interacting Stochastic Approximation Annealing (PISAA)”

We present the parallel and interacting stochastic approximation annealing (PISAA) algorithm, a stochastic simulation procedure for global optimisation, that extends and improves the stochastic approximation annealing (SAA) by using population Monte Carlo ideas. The standard SAA algorithm guarantees convergence to the global minimum when a square-root cooling schedule is used; however the efficiency of its performance depends crucially on its self-adjusting mechanism. Because its mechanism is based on information obtained from only a single chain, SAA may present slow convergence in complex optimisation problems. The proposed algorithm involves simulating a population of SAA chains that interact each other in a manner that ensures significant improvement of the self-adjusting mechanism and better exploration of the sampling space. Central to the proposed algorithm are the ideas of (i) recycling information from the whole population of Markov chains to design a more accurate/stable self-adjusting mechanism and (ii) incorporating more advanced proposals, such as crossover operations, for the exploration of the sampling space. PISAA presents a significantly improved performance in terms of convergence. PISAA can be implemented in parallel computing environments if available. We demonstrate the good performance of the proposed algorithm on challenging applications including Bayesian network learning and protein folding. Our numerical comparisons suggest that PISAA outperforms the simulated annealing, stochastic approximation annealing, and annealing evolutionary stochastic approximation Monte Carlo especially in high dimensional or rugged scenarios. … Parallel and Interacting Stochastic Approximation Annealing (PISAA) google

Book Memo: “Discrete Probability Models and Methods”

Probability on Graphs and Trees, Markov Chains and Random Fields, Entropy and Coding
The emphasis in this book is placed on general models (Markov chains, random fields, random graphs), universal methods (the probabilistic method, the coupling method, the Stein-Chen method, martingale methods, the method of types) and versatile tools (Chernoff’s bound, Hoeffding’s inequality, Holley’s inequality) whose domain of application extends far beyond the present text. Although the examples treated in the book relate to the possible applications, in the communication and computing sciences, in operations research and in physics, this book is in the first instance concerned with theory. The level of the book is that of a beginning graduate course. It is self-contained, the prerequisites consisting merely of basic calculus (series) and basic linear algebra (matrices). The reader is not assumed to be trained in probability since the first chapters give in considerable detail the background necessary to understand the rest of the book.

Distilled News

Bayesian Inference via Simulated Annealing

I recently finished a course on discrete optimization and am currently working through Richard McElreath’s excellent textbook Statistical Rethinking. Combining the two, and duly jazzed by this video on the Traveling Salesman Problem, I’d thought I’d build a toy Bayesian model and try to optimize it via simulated annealing. This work was brief, amusing and experimental. The result is a simple Shiny app that contrasts MCMC search via simulated annealing versus the (more standard) Metropolis algorithm. While far from groundbreaking, I did pick up the following few bits of intuition along the way.

How is Deep Learning Changing Data Science Paradigms?

Deep learning is changing everything – and it’s here to stay. Just as electronics and computers transformed all economic activities, artificial intelligence will reshape retailing, transport, manufacturing, medicine, telecommunications, heavy industry…even data science itself. And that list of applications is still growing, as is the list of complex tasks where AI does better than humans. Here at Schibsted we see the opportunities deep learning offers, and we’re excited to contribute.

What Exactly The Heck Are Prescriptive Analytics?

Prescriptive analytics is about using data and analytics to improve decisions and therefore the effectiveness of actions. Isn’t that what all analytics should be about? A hearty “yes” to that because, if analytics does not lead to more informed decisions and more effective actions, then why do it at all? Many wrongly and incompletely define prescriptive analytics as the what comes after predictive analytics. Our research indicates that prescriptive analytics is not a specific type of analytics, but rather an umbrella term for many types of analytics that can improve decisions. Think of the term “prescriptive” as the goal of all these analytics — to make more effective decisions — rather than a specific analytical technique.

Creativity is Crucial in Data Science

Data science might not be seen as the most creative of pursuits. You add a load of data into a repository, and you crunch it the other end to draw your conclusions. Data in, data out, where is the scope for creativity there? It is not like you are working with a blank canvas. For me, the definition of creativity is when you are able to make something out of nothing. This requires an incredible amount of imagination, and seeing past the obvious headline statistics to reach a deeper conclusion is the hallmark of a great Big Data professional.

Bots: What you need to know

Bots are a new, AI-driven way to interact with users in a variety of environments. As AI improves and users turn away from single-purpose apps and toward messaging interfaces, they could revolutionize customer service, productivity, and communication. Getting started with bots is as simple as using any of a handful of new bot platforms that aim to make bot creation easy; sophisticated bots require an understanding of natural language processing (NLP) and other areas of artificial intelligence. Bots use artificial intelligence to converse in human terms, usually through a lightweight messaging interface like Slack or Facebook Messenger, or a voice interface like Amazon Echo or Google Assistant. Since late 2015, bots have been the subject of immense excitement in the belief that they might replace mobile apps for many tasks and provide a flexible and natural interface for sophisticated AI technology.

Sentiment Analysis in R

Current research in finance and the social sciences utilizes sentiment analysis to understand human decisions in response to textual materials. While sentiment analysis has received great traction lately, the available tools are not yet living up to the needs of researchers. Especially R has not yet capabilities that most research desires. Our package “SentimentAnalysis” performs a sentiment analysis of textual contents in R. This implementation utilizes various existing dictionaries, such as General Inquirer, Harvard IV or Loughran-McDonald. Furthermore, it can also create customized dictionaries. The latter uses LASSO regularization as a statistical approach to select relevant terms based on an exogeneous response variable.

Yes, you can run R in the cloud securely

Once thought of as the ‘little programming language that could’, R has fundamentally transformed the way data scientists and organisations use their data. It gives businesses the power to leverage big data and develop predictive models that enable action, not just reaction. But R isn’t just another programming language. R is a rich ecosystem of more than 10,000 packages, test data and model evaluations that make powerful predictive analytics possible. This is good for data scientists in companies innovating on the edge of industries, but it can be bad news for enterprise security. Why? Because R packages contain executable code. And as with all software you download over the internet, you need to be aware of the security risks. That doesn’t mean you can’t run R in the cloud securely. You can, and you should.

PCA – hierarchical tree – partition: Why do we need to choose for visualizing data?

Principal component methods such as PCA (principal component analysis) or MCA (multiple correspondence analysis) can be used as a pre-processing step before clustering. But principal component methods give also a framework to visualize data. Thus, the clustering methods can be represented onto the map provided by the principal component method. In the figure below, the hierarchical tree is represented in 3D onto the principal component map (using the first 2 component obtained with PCA). And then, a partition has been done and individuals are coloured according to their belonging cluster.

Beyond Deep Learning – 3rd Generation Neural Nets

If Deep Learning is powered by 2nd generation neural nets. What will the 3rd generation look like? What new capabilities does that imply and when will it get here?

Data Manipulation and Visualization with Pandas and Seaborn – A Practical Introduction

In this notebook, I’m going to demonstrate with practical examples various concepts and methods related to Pandas and Seaborn. I will rely on the data format I used for my Facebook Conversation Analyzer project. For seemingly obvious reasons I didn’t use a personal conversation but automatically generated a fake and nonsensical one. Projecting or imagining some conversation relevant to you will most likely help you to better understand and memorize the content of this notebook, even greater if you can play around with your actual data. The main topic is data manipulation with Pandas, for example function application, groupby, aggregation and multi-indexes. All along I’ll mention handy tricks that you can use for various tasks and demonstrate how we can plot results in different ways using Seaborn (based on matplotlib). Given the data format, special focus is put on time-series data manipulation.

The Downside of Converting Full-Text PDFs to XML for Text Mining

To get the best results from text mining projects, researchers need access to full-text articles. Abstracts often don’t include essential facts and relationships, access to secondary study findings, and adverse event data. However, when researchers obtain full-text articles through company subscriptions or document delivery, the documents are often provided as PDFs, a suboptimal format for use with text mining software. The burden is then on researchers to convert the PDFs – potentially thousands in a bulk delivery – to XML (Extensible Markup Language), the preferred format for use in text mining software. But tasking highly-skilled researchers with converting document formats for input into text mining tools creates a number of problems with the transformed content and is inefficient and costly.

Document worth reading: “Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions”

Applying popular machine learning algorithms to large amounts of data raised new challenges for the ML practitioners. Traditional ML libraries does not support well processing of huge datasets, so that new approaches were needed. Parallelization using modern parallel computing frameworks, such as MapReduce, CUDA, or Dryad gained in popularity and acceptance, resulting in new ML libraries developed on top of these frameworks. We will briefly introduce the most prominent industrial and academic outcomes, such as Apache Mahout, GraphLab or Jubatus. We will investigate how cloud computing paradigm impacted the field of ML. First direction is of popular statistics tools and libraries (R system, Python) deployed in the cloud. A second line of products is augmenting existing tools with plugins that allow users to create a Hadoop cluster in the cloud and run jobs on it. Next on the list are libraries of distributed implementations for ML algorithms, and on-premise deployments of complex systems for data analytics and data mining. Last approach on the radar of this survey is ML as Software-as-a-Service, several BigData start-ups (and large companies as well) already opening their solutions to the market. Machine Learning and Cloud Computing: Survey of Distributed and SaaS Solutions