1. Install datmo
2. Initialize a datmo project
3. Start environment setup
4. Select System Drivers (CPU or GPU)
5. Select an environment
6. Select a language version (if applicable)
7. Launch your workspace
• One command environment setup (languages, frameworks, packages, etc)
• Tracking and logging for model config and results
• Project versioning (model state tracking)
• Experiment reproducibility (re-run tasks)
• Visualize + export experiment history
In the past, most of the focus on the ‘rates’ such as attrition rate and retention rates. HR Managers compute the previous rates try to predict the future rates using data warehousing tools. These rates present the aggregate impact of churn, but this is the half picture. Another approach can be the focus on individual records in addition to aggregate.
There are lots of case studies on customer churn are available. In customer churn, you can predict who and when a customer will stop buying. Employee churn is similar to customer churn. It mainly focuses on the employee rather than the customer. Here, you can predict who, and when an employee will terminate the service. Employee churn is expensive, and incremental improvements will give significant results. It will help us in designing better retention plans and improving employee satisfaction. In this tutorial, you are going to cover the following topics:
• Employee Churn Analysis
• Data loading and understanding feature
• Exploratory data analysis and Data visualization
• Cluster analysis
• Building prediction model using Gradient Boosting Tree.
• Evaluating model performance
• Accessing Data
In large organizations data sources are commonly re-shaped by corrective maintenance and to adapt to application requirements, and applications are changed to meet new requirements. The result is that the data stored in different sources and the processes operating over them tend to be redundant, mutually inconsistent, and obscure for large classes of users. So, accessing data means interacting with IT experts who know where the data are and what they mean in the various contexts, and can therefore translate the information need expressed by the user into appropriate queries. This process can be both expensive and time-consuming.
• Data Quality
Data quality is cited often as a critical factor in delivering high value information services. But how can we check data quality, and how can we decide if it is good if we do not have a clear understanding of the semantics that data should bring? Moreover, how can we judge the quality of external data of business partners, clients, or even public sources, that we connect to? Data quality is also crucial for opening data to external organisations, to favor new business opportunities, or even to the public, which we are seeing more of nowadays in the age of Open Data.
• Process Specification
Information systems are key assets for business organisations, which rely not only on data, but also, for instance, on processes and services. Designing and managing processes is an important aspect of information systems, but deciding what a process should do is tough to do properly without a clear idea of which data the process will access, and how it will possibly change it. The difficulties of doing this properly come from various factors, including the lack of modelling languages and tools for describing process and data holistically, and the problems related to the semantics of data make this task even harder.
• Three-Level Architecture
The key idea of OBDA is provide users with access to the information in their data sources through a three-level architecture, constituted by the ontology, the sources, and the mapping between the two, where the ontology is a formal description of the domain of interest, and is the heart of the system. Through this architecture, OBDA provides a semantic end-to-end connection between users and data sources, allowing users to directly query data spread across multiple distributed sources, through the familiar vocabulary of the ontology: the user formulates SPARQL queries over the ontology which are transformed, through the mapping layer, into SQL queries over the underlying relational databases.
• The Ontology layer in the architecture is the mean for pursuing a declarative approach to information integration, and, more generally, to data governance. The domain knowledge base of the organization is specified through a formal and high level description of both its static and dynamic aspects, represented by the ontology. By making the representation of the domain explicit, we gain re-usability of the acquired knowledge, which is not achieved when the global schema is simply a unified description of the underlying data sources.
• The Mapping layer connects the Ontology layer with the Data Source layer by defining the relationships between the domain concepts on the one hand and the data sources on the other hand. These mappings are not only used for the operation of the information system, but can also be a significant asset for documentation purposes in cases where the information about data is widespread into separate pieces of documentation that are often difficult to access and rarely conforming to common standards.
• The Data Source layer is constituted by the existing data sources of the organization.
• Create swimmable areas
• Clear the water deliberately
• Make sure there is a lifeguard on duty
• Always keep a current map
Subject – Predicate – Object
These describe a single fact. Generally URI’s are used for the subject and predicate. The object is either another URI or a literal such as a number or string. Literals can have a type (which is also a URI), and they can also have a language. Yes, this means triples can have up to 5 bits of data!
For example a triple might describe the fact that Charles is Harrys father.
<http://…/harry> <http://…/1.0#hasFather> <http://…/charles> .
Triples are database normalization taken to a logical extreme. They have the advantage that you can load triples from many sources into one database with no reconfiguration.
• RDF and RDFS
The next layer is RDF – The Resource Description Framework. RDF defines some extra structure to triples. The most important thing RDF defines is a predicate called “rdf:type”. This is used to say that things are of certain types. Everyone uses rdf:type which makes it very useful.
RDFS (RDF Schema) defines some classes which represent the concept of subjects, objects, predicates etc. This means you can start making statements about classes of thing, and types of relationship. At the most simple level you can state things like http://…/1.0#hasFather is a relationship between a person and a person. It also allows you to describe in human readable text the meaning of a relationship or a class. This is a schema. It tells you legal uses of various classes and relationships. It is also used to indicate that a class or property is a sub-type of a more general type. For example “HumanParent” is a subclass of “Person”. “Loves” is a sub-class of “Knows”.
• RDF Serialisations
RDF can be exported in a number of file formats. The most common is RDF+XML but this has some weaknesses.
N3 is a non-XML format which is easier to read, and there’s some subsets (Turtle and N-Triples) which are stricter.
It’s important to know that RDF is a way of working with triples, NOT the file formats.
XSD is a namespace mostly used to describe property types, like dates, integers and so forth. It’s generally seen in RDF data identifying the specific type of a literal. It’s also used in XML schemas, which is a slightly different kettle of fish.
OWL adds semantics to the schema. It allows you to specify far more about the properties and classes. It is also expressed in triples. For example, it can indicate that “If A isMarriedTo B” then this implies “B isMarriedTo A”. Or that if “C isAncestorOf D” and “D isAncestorOf E” then “C isAncestorOf B”. Another useful thing owl adds is the ability to say two things are the same, this is very helpful for joining up data expressed in different schemas. You can say that relationship “sired” in one schema is owl:sameAs “fathered” in some other schema. You can also use it to say two things are the same, such as the “Elvis Presley” on wikipedia is the same one on the BBC. This is very exciting as it means you can start joining up data from multiple sites (this is “Linked Data”).
• Experiment Driven: Machine Learning and Deep Learning
• Data Driven: Enterprise Platforms and Data
• Scale Driven: AI Pipeline and Scalability
• Talent Driven: AI Disruption and Stagnation
Working with text processing, the data analyst faces the following tasks:
• Keyphrase extraction;
• Sentiment analysis;
• Text analysis;
• Entity recognition;
• Language detection;
• Topic modeling.
There are several high-level APIs which may be used to perform these tasks. Among them:
• Amazon Comprehend;
• IBM Watson Natural Language Understanding;
• Microsoft Azure (Text analytics API);
• Google Cloud Natural Language;
• Microsoft Azure (Linguistic Analysis API) – beta;
• Google Translate API;
• IBM Watson Translator;
• Amazon Translate;
• Microsoft Azure Translator Text API.
#3 Expert Analyst
#5 Applied Machine Learning Engineer
#6 Data Scientist
#7 Analytics Manager / Data Science Leader
#8 Qualitative Expert / Social Scientist
#10+ Additional personnel
• Domain expert
• Software engineer
• Reliability engineer
• UX designer
• Interactive visualizer / graphic designer
• Data collection specialist
• Data product manager
• Project / program manager
1. A causal claim is a statement about what didn´t happen.
2. There is a fundamental problem of causal inference.
3. You can estimate average causal effects even if you cannot observe any individual causal effects.
4. If you know that, on average, A causes B and B causes C, this does not mean that you know that A causes C.
5. The counterfactual model is all about contribution, not attribution.
6. X can cause Y even if there is no ‘causal path’ connecting X and Y.
7. Correlation is not Causation
8. X can cause Y even if X is not a necessary condition or a sufficient condition for Y.
9. Estimating average causal effects does not require that treatment and control groups are identical.
10. There is no causation without manipulation
1. The ‘data lake’ is a standard design pattern in today´s organizations for dealing with big data.
2. There are no silver bullets – data lakes must be governed like any other data platform.
3. Data lakes are quickly evolving in definition AND capabilities.
4. Organizations are choosing a new analytic/BI standard for their data lake.
1. Auto-Keras – This is an automated machine learning (AutoML) package
2. Finetune – Scikit-learn style model finetuning for NLP
3. GluonNLP – NLP made easy
4. animatplot – A python package for animating plots build on matplotlib
5. MLflow – Open source platform for the machine learning lifecycle
• data retrieval
• data cleaning
• data exploration and visualization
• statistical or predictive modeling
While these components are helpful for understanding the different phases, they don’t help us think about our programming workflow.
Often, the entire data science life cycle ends up as an arbitrary mess of notebook cells in either a Jupyter Notebook or a single messy script. In addition, most data science problems require us to switch between data retrieval, data cleaning, data exploration, data visualization, and statistical / predictive modeling.
But there’s a better way! In this post, I’ll go over the two mindsets most people switch between when doing programming work specifically for data science: the prototype mindset and the production mindset.
2. Capacity Building
3. Data Understanding
4. Building a Knowledge Repository (Democratizing Data)
5. Focus on Small Wins
6. Repeat After Me: ROI
7. Data Science Roadmap
• Apache Spark MLlib & ML
• Summing Bird