Many people have asked me this question that whenever they get started with data science, they get stuck with the manifold of tools available to them. Although there are handful of guides available out there concerning the problem such as “19 Data Science Tools for people who aren’t so good at Programming” or “A Complete Tutorial to Learn Data Science with Python from Scratch“, I would like to show what tools I generally prefer for my day-to-day data science needs. Read on if you are interested!
When you talk about blockchain in the context of Bitcoin, the connection to Big Data seems a little tenuous. What if, instead of Bitcoin, the blockchain was a ledger for other financial transactions? Or business contracts? Or stock trades? The financial services industry is starting to take a serious look at block chain technology. Oliver Bussmann, CIO of UBS says that blockchain technology could “pare transaction processing time from days to minutes.” The business imperative in financial services for blockchain is powerful. Imagine blockchains of that magnitude. Huge data lakes of blocks that contain the full history of every financial transaction, all available for analysis. Blockchain provides for the integrity of the ledger, but not for the analysis. That’s where Big Data and accompanying analysis tools will come into play.
We’ve covered a few fundamentals and pitfalls of data analytics in our past blog posts. In this blog post, we focus on the four types of data analytics we encounter in data science: Descriptive, Diagnostic, Predictive and Prescriptive.
Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available today, and track the community’s efforts to push the boundaries.
Language models assign probability values to sequences of words. Those three words that appear right above your keyboard on your phone that try to predict the next word you’ll type are one of the uses of language modeling. In the case shown below, the language model is predicting that “from”, “on” and “it” have a high probability of being the next word in the given sentence. Internally, for each word in its vocabulary, the language model computes the probability that it will be the next word, but the user only gets to see the top three most probable words.
Machine learning datasets can be pretty big, since training something from scratch requires a lot of data. Sharing these datasets can range from just transfering it point-to-point to a colleague to making it publicly and widely available for anyone to download from anywhere at any time. Setting up dataset distribution can be cumbersome, and if you also add in requirements for availability, scaleability, encryption, and freedom from censorship, it can be out of reach. Luckily, there are solutions available today that can help you that we’ll survey below. In this post, we’ll talk about how to share machine learning datasets. We’re looking for a place where we can host the file, and anyone with a connection can download it relatively quickly. We assume the dataset is in the megabytes to tens of gigabytes, as most machine learning datasets seem to fall within that range. When you go beyond that, then you usually need to craft unique solutions out of existing tools, which is out of scope for this post.
Numerical algorithms are computationally demanding, which makes performance an important consideration with the use of Python for machine learning. Rolling out Python algorithms from a desktop prototype environment to a production environment with many nodes and magnitudes more data, is a challenge. In this webinar, we will provide tips for data scientists to speed up Python algorithms. First a discussion on algorithm choice, and understanding how effective package tool usage can make large differences in performance gains. Then, we will demonstrate how Intel accelerates Python for numerical computing and machine learning by seeing how Intel performance libraries such as Intel MKL accelerate basic linear algebra operations and solvers, FFTs, arithmetic and transcendental operations. You will also get a behind-the- scenes look at how Intel engineers have optimized Python to scale from Intel Atom or Intel Core based laptops to powerful Intel Xeon and Xeon Phi based clusters to achieve faster performance.
I just read two articles that claim that Python is overtaking R for data science and machine learning. From user comments, I learned that R is still strong in certain tasks. I will survey what these tasks are. The first article by Vincent Granville from DSC uses proxy metrics (as opposed to asking the users). He uses statistics from Google Trends, Indeed job search terms, and Analytic Talent (DSC job database) to conclude that Python has overtaken R.
TensorFlow has gathered quite a bit of attention as the new hot toolkit for building neural networks. To the beginner, it may seem the only thing that rivals this interest is the number of different APIs that you can use. In this article, we go over a few of them, building the same neural network each time. We start with low-level TensorFlow math, and then show how to simplify that code with TensorFlow’s layer API. We also discuss two libraries built on top of TensorFlow: TFLearn and Keras.
If you ask a child to draw a cat, you’ll learn more about the child than you will about cats. In the same way, asking neural networks to generate images helps us see how they reason about the information they’re given. It’s often difficult to interpret neural networks—that is, to relate their functioning to human intuition—and generative algorithms offer a way to make neural nets explain themselves.
This post explains how to use R to automatically write and send emails based on automatically computed analyses (yep, everything automated). This means that when analysis changes or is updated, the email body text changes as well. The email can then be automatically sent to clients based on a trigger event (e.g., only when results are interesting) or periodically. All of this is can be done by using R code in Displayr, as illustrated in this post.
In the last post, we focused on the preparation of a tidy dataset describing consumer perceptions of beverages. In this post, I’ll describe some analyses I’ve been doing of these data, in order to better understand how consumers perceive the beverage category. This type of analysis is often used in sensographics- companies who produce food products (chocolate, sauces, etc.) conduct research to understand the ‘product space,’ e.g. the way in which consumers understand the organization of a product category according to relevant perceptive dimensions, and the place that different products occupy within that space.
This is common case when working with data that your source is a remote database. Usual ways to cope this when using R is either to load all the data into R or to perform the heaviest joins and aggregations with SQL before loading the data. Both of them have cons: the former one is limited by the memory capacity and may be very slow and the later forces you to use two technologies thus is more complicated and prone to errors. Solution to these problems is to use dplyr with dbplyr to communicate with database backend. This allows user to write dplyr code that is translated to SQL and executed at database server. One can say that this combines advantages of the two standard solutions and gets rid of their disadvantages.