Data Science Workflow: Overview and Challenges

During my Ph.D., I created tools for people who write programs to obtain insights from data. Millions of professionals in fields ranging from science, engineering, business, finance, public policy, and journalism, as well as numerous students and computer hobbyists, all perform this sort of programming on a daily basis. Shortly after I wrote my dissertation in 2012, the term Data Science started appearing everywhere. Some industry pundits call data science the ‘sexiest job of the 21st century.’ And universities are putting tremendous funding into new Data Science Institutes. I now realize that data scientists are one main target audience for the tools that I created throughout my Ph.D. However, that job title was not as prominent back when I was in grad school, so I didn’t mention it explicitly in my dissertation. What do data scientists do at work, and what challenges do they face? This post provides an overview of the modern data science workflow, adapted from Chapter 2 of my Ph.D. dissertation, Software Tools to Facilitate Research Programming.

Hands on tutorial to perform Data Exploration using Elastic Search and Kibana (using Python)

Exploratory Data Analysis (EDA) helps us to uncover the underlying structure of data and its dynamics through which we can maximize the insights. EDA is also critical to extract important variables and detect outliers and anomalies. Even though there are many algorithms in Machine Learning, EDA is considered to be one of the most critical part to understand and drive the business. There are several ways to perform EDA on various platforms like Python (matplotlib, seaborn), R (ggplot2) and there are a lot of good resources on the web such as “Exploratory Data Analysis” by John W. Tukey, “Exploratory Data Analysis with R” by Roger D. Peng and so on.. In this article, I am going to talk about performing EDA using Kibana and Elastic Search.

Building a Medicare Shiny App – Part 1

Hello R community. if you’re up for some fun tinkering with a Shiny App please join me on a new project. I would love to see some collaboration in designing a Shiny Application which will help people make a decision about a healthcare provider. I have only just begun on this project but would to work with others. This is just a quick look at the data, the roughest shiny app you’ve ever seen can be located on my page The first goal is to help people find a provider based off of City and State (or perhaps zipcode and latitude/longitude). This can take the form of a list, map, etc. I would also like people to be able to glean some information about the place they are going in comparison to the surrounding locations. I was only able to put a an hour or so into this (and that was months ago) but have decided that it would be fun to start collaborating with anyone who is interested. Please make any pull requests and I’ll get to them!

Normal Distributions

I review — and provide derivations for — some basic properties of Normal distributions. Topics currently covered: (i) Their normalization, (ii) Samples from a univariate Normal, (iii) Multivariate Normal distributions, (iv) Central limit theorem.

An executive’s guide to machine learning

Machine learning is based on algorithms that can learn from data without relying on rules-based programming. It came into its own as a scientific discipline in the late 1990s as steady advances in digitization and cheap computing power enabled data scientists to stop building finished models and instead train computers to do so. The unmanageable volume and complexity of the big data that the world is now swimming in have increased the potential of machine learning—and the need for it. In 2007 Fei-Fei Li, the head of Stanford’s Artificial Intelligence Lab, gave up trying to program computers to recognize objects and began labeling the millions of raw images that a child might encounter by age three and feeding them to computers. By being shown thousands and thousands of labeled data sets with instances of, say, a cat, the machine could shape its own rules for deciding whether a particular set of digital pixels was, in fact, a cat. Last November, Li’s team unveiled a program that identifies the visual elements of any picture with a high degree of accuracy. IBM’s Watson machine relied on a similar self-generated scoring system among hundreds of potential answers to crush the world’s best Jeopardy! players in 2011. Dazzling as such feats are, machine learning is nothing like learning in the human sense (yet). But what it already does extraordinarily well—and will get better at—is relentlessly chewing through any amount of data and every combination of variables. Because machine learning’s emergence as a mainstream management tool is relatively recent, it often raises questions. In this article, we’ve posed some that we often hear and answered them in a way we hope will be useful for any executive. Now is the time to grapple with these issues, because the competitive significance of business models turbocharged by machine learning is poised to surge. Indeed, management author Ram Charan suggests that “any organization that is not a math house now or is unable to become one soon is already a legacy company.

Loan Prediction – Using PCA and Naive Bayes Classification with R

Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers getting the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers. Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts, and others provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data. In this blog post, we will discuss about how Naive Bayes Classification model using R can be used to predict the loans.

New Trends in Artificial Intelligence & Machine Learning

Artificial Intelligence has effectively convinced its necessity to the entire world by performing excellently in various industries. Almost all the industries including manufacturing, healthcare, construction, online retail, etc. are adapting to the reality of IoT to leverage its advantages. Machine learning technology is constantly evolving and the current trends in the field promise that every enterprise will be data driven and will have the capacity of using machine learning in the cloud to incorporate artificial intelligence apps. Yes, that’s right! Companies will be successful in analyzing large complex data and providing meticulous insights without spending a huge amount on installing and maintaining machine learning systems.