One-Pass Algorithm google
In computing, a one-pass algorithm is one which reads its input exactly once, in order, without unbounded buffering. A one-pass algorithm generally requires O(n) time and less than O(n) storage (typically O(1)), where n is the size of the input. Basically one-pass algorithm operates as follows:
(1) the object descriptions are processed serially;
(2) the first object becomes the cluster representative of the first cluster;
(3) each subsequent object is matched against all cluster representatives existing at its processing time;
(4) a given object is assigned to one cluster (or more if overlap is allowed) according to some condition on the matching function;
(5) when an object is assigned to a cluster the representative for that cluster is recomputed;
(6) if an object fails a certain test it becomes the cluster representative of a new cluster


DeepER google
Entity Resolution (ER) is a fundamental problem with many applications. Machine learning (ML)-based and rule-based approaches have been widely studied for decades, with many efforts being geared towards which features/attributes to select, which similarity functions to employ, and which blocking function to use – complicating the deployment of an ER system as a turn-key system. In this paper, we present DeepER, a turn-key ER system powered by deep learning (DL) techniques. The central idea is that distributed representations and representation learning from DL can alleviate the above human efforts for tuning existing ER systems. DeepER makes several notable contributions: encoding a tuple as a distributed representation of attribute values, building classifiers using these representations and a semantic aware blocking based on LSH, and learning and tuning the distributed representations for ER. We evaluate our algorithms on multiple benchmark datasets and achieve competitive results while requiring minimal interaction with experts. …

Simpson’s Paradox google
In probability and statistics, Simpson’s paradox, or the Yule-Simpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics, and is particularly confounding when frequency data are unduly given causal interpretations. Simpson’s Paradox disappears when causal relations are brought into consideration. Many statisticians believe that the mainstream public should be informed of the counter-intuitive results in statistics such as Simpson’s paradox.
http://…/confounding.html

Advertisements