Factoring Massive Numbers: Machine Learning Approach – Why and How

We are interested here in factoring numbers that are a product of two very large primes. Such numbers are used by encryption algorithms such as RSA, and the prime factors represent the keys (public and private) of the encryption code. Here you will also learn how data science techniques are applied to big data, including visualization, to derive insights. This article is good reading for the data scientist in training, who might not necessarily have easy access to interesting data: here the dataset is the set of all real numbers — not just the integers — and it is readily available to anyone. Much of the analysis performed here is statistical in nature, and thus, of particular interest to data scientists. Factoring numbers that are a product of two large primes allows you to test the strength (or weakness) of these encryption keys. It is believed that if the prime numbers in question are a few hundred binary digits long, factoring is nearly impossible: it would require years of computing power on distributed systems, to factor just one of these numbers.

DataOps – It’s a Secret

DataOps is a series of principles and practices that promises to bring together the conflicting goals of the different data tribes in the organization, data science, BI, line of business, operations, and IT. What has been a growing body of best practices is now becoming the basis for a new category of data access, blending, and deployment platforms that may solve data conflicts in your organization.

Federated Learning: Collaborative Machine Learning without Centralized Training Data

Standard machine learning approaches require centralizing the training data on one machine or in a datacenter. And Google has built one of the most secure and robust cloud infrastructures for processing this data to make our services better. Now for models trained from user interaction with mobile devices, we’re introducing an additional approach: Federated Learning. Federated Learning enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device, decoupling the ability to do machine learning from the need to store the data in the cloud. This goes beyond the use of local models that make predictions on mobile devices (like the Mobile Vision API and On-Device Smart Reply) by bringing model training to the device as well.

Datasets of the Week, March 2017

Every week at Kaggle, we learn something new about the world when our users publish datasets and analyses based on their research, niche hobbies, and portfolio projects. For example, did you know that one Kaggler measured crowdedness at their campus gym using a Wifi sensor to determine the best time to lift weights? And another Kaggler published a dataset that challenges you to generate novel recipes based on ingredient lists and ratings. In this blog post, the first of our Datasets of the Week series, you’ll hear the stories behind these datasets and others that each add something unique to the diverse resources you can find on Kaggle. Read on or follow the links below to jump to the dataset that most catches your eye.

Stuff Happens: A Statistical Guide to the “Impossible”

In summer 1972 Anthony Hopkins was chosen to play a leading role in a film based on George Feifer’s novel The Girl from Petrovka. Not having the book himself, he went to London to buy a copy but none of the main London bookstores had one. On his journey home, however, waiting for an underground train at Leicester Square station he saw a discarded book lying on the seat next to him. It was The Girl from Petrovka! The story gets even weirder. Hopkins later had a chance to meet Feifer and told him about finding the book. Feifer mentioned that in November 1971 he had lent a friend a copy of the book, one in which Feifer had made notes pertaining to the publication of an American edition, but his friend lost the book in Bayswater, London. A check of the annotations in the copy Hopkins found showed that it was the very same one Feifer’s friend had mislaid!

Microsoft R Open 3.3.3 now available

Microsoft R Open (MRO), Microsoft’s enhanced distribution of open source R, has been upgraded to version 3.3.3, and is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to R 3.3.3, upgrades the installer, and updates the bundled packages. R 3.3.3 makes just a few minor fixes compared to R 3.3.2 (see the full list of changes here), so you shouldn’t encounter any compatibility issues when upgrading from MRO 3.3.2. For CRAN packages, MRO 3.3.3 points to CRAN snapshot taken on March 15, 2017 but as always, you can use the built-in checkpoint package to access packages from an earlier date (for compatibility) or a later date (to access new and updated packages).

Data Science for Operational Excellence (Part-1)

R has many powerful libraries to handle operations research. This exercise tries to demonstrate a few basic functionality of R while dealing with linear programming. Linear programming is a technique for the optimization of a linear objective function, subject to linear equality and linear inequality constraints. The lpsolve package in R provides a set of functions to create models from scratch or use some prebuilt ones like the assignment and transportation problems. Answers to the exercises are available here. If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page. Please install and load the package lpsolve and igraph before starting the exercise.

Fitting a rational function in R using ordinary least-squares regression

Here we have used linear regression by ordinary least squares (with lm) to fit distinctly nonlinear rational functions. For simplicity, these examples focus on equations of second order (or less) in both numerator and denominator, but the idea extends to higher orders. On a cautionary note, this approach seems to have numerical stability issues with some inputs. For example, if you take one of the data-simulating equations above and make selected coefficients much larger, you can create datasets that are fitted poorly by this method. And if the data-simulating function does not have the correct form (for example, if the zeroth order term in the denominator is not 1), the fitted curves can be completely wrong. For practical purposes it might be preferable to use a nonlinear least squares approach (e.g., the nls function). Still, this approach works well in many examples, and it lets you fit some curves that cannot be represented by ordinary polynomials.

Weighted Linear Support Vector Machine

Consider the spam vs ham data from this site. Let us do some basic analysis on the data with R version 3.3.3, 64 bits on qn windows machine

Some Lesser Known Machine Learning Libraries

As promised, we have come up with yet another list of some lesser known Machine Learning Libraries that you might find interesting.

Feature Engineering in IoT Age – How to deal with IoT data and create features for machine learning?

If you ask any experienced analytics or data science professional, what differentiates a good model from a bad model – chances are that you will hear a uniform answer. Whether you call it “characteristics generation” or “variable generation” (as it was known traditionally) or “feature engineering” – the importance of this step is unanimously agreed in the data science / analytics world. This step involves creating a large and diverse set of derived variables from the base data. The richer the set of variables that are generated, the better will be your models. Most of our time and coding efforts are usually spent in the area of feature engineering. Therefore, understanding feature engineering for specific data sources is a key success factors for us. Unfortunately, most analytics courses and text books do not cover this aspect in great detail. This article is a humble effort in that direction.

A Brief History of AI

In spite of all the current hype, AI is not a new field of study, but it has its ground in the fifties. If we exclude the pure philosophical reasoning path that goes from the Ancient Greek to Hobbes, Leibniz, and Pascal, AI as we know it has been officially started in 1956 at Dartmouth College, where the most eminent experts gathered to brainstorm on intelligence simulation. This happened only a few years after Asimov set his own three laws of robotics, but more relevantly after the famous paper published by Turing (1950), where he proposes for the first time the idea of a thinking machine and the more popular Turing test to assess whether such machine shows, in fact, any intelligence. As soon as the research group at Dartmouth publicly released the contents and ideas arisen from that summer meeting, a flow of government funding was reserved for the study of creating a nonbiological intelligence.