I've been aware of Jupyter for a while now, but it is still a tool that can be quite new to some people. If you are doing any kind of data science, you might want to consider taking a look at it. Python may not be a language that was intended for data science, but with a powerful tool like this, it has become one of the easiest (having said that, apparently the new kid on the block is the Julia programming language which was intended for data science).
What is Jupyter?
To put it in simple terms, Jupyter basically lets you package up some data science in way that makes it really portable. You can take your notebook, hand it over to someone else (who is running Jupyter and has the appropriate libraries installed - more on that in a moment) and they can see your data models and see visualizations of your data almost right away. Data can be included alongside the notebook (csv files for example), in a database (as long as you have the database driver available and the db connect info in the notebook), or even just pulled down from the internet.
You can also have other content in the notebook - images, text, etc. Stuff that isn't code. This is why they call it a notebook - you are literally just writing a document that isn't much different from a Word document that contained your resume. The difference is that there are blocks of code that are included that you can run interactively when viewing the notebook.
This makes Jupyter notebooks incredibly powerful teaching tools. Write instructions on the code alongside the actual code itself. The students can run the code as they are reading right away without leaving the instructions. Teachers can give students a notebook with some of the basic framework stuff in place but missing some key code that they need to write. They work on it, running the code interactively as they go, filling in the code and hopefully ending up with the right answer. That answer could be a number, a graph, a matrix of data, whatever. The point is that the homework instructions and the actual document they write their answers in are the same thing. They finish it up and just hand the notebook back. Simple.
You can install Jupyter in a variety of ways, but the easiest way is to just install Anaconda, which includes a wide variety of really useful Python tools including numerical libraries like NumPy and pandas, visualization tools like matplotlib, and code tools like Jupyter. Once you have it installed, you start a notebook by simply running:
$ jupyter notebook
We took that a step further in a morning tutorial that I attended. We installed a tool called JupyterHub onto a Kubernetes cluster using Helm (https://helm.sh). I would definitely recommend looking into these two technologies as well, as they are help driving the future of building out computer infrastructure quickly (although serverless is also driving this, but these technology methods can be complementary). There are some links for the session that I attended that I will share now:
- slides: https://goo.gl/cKji3G
- Mailing lists: http://groups.google.com/forum/#!forum/jupyter
- Deploy your own JupyterHub: http://z2jh.jupyter.org
- Deploy your own BinderHub: http://binderhub.readthedocs.io
The first part was setting up a Kubernetes cluster on Google Cloud Platform, which required turning on the Kubernetes API. Once we had done that, we created some nodes, installed Helm, and then set some permissions so that Helm could perform operations in our Kubernetes namespaces. This was done with just a couple of simple commands that we can run directly in the Google Cloud Shell.
With that done, installing JupyterHub is literally just a matter of providing a simple config file (in our case, just two lines) to a helm command. Helm uses documents called charts in order to determine what needs to be built, but it generally works similarly to most package tools that allow you to easily install, update, and remove packages.
We then started exploring changing the configuration, adding additional features such as SSL, authentication, etc. This only required changing the configuration file and then running a helm command to update our entire ecosystem to match. We also misconfigured the application just to demonstrate that we could easily fix the problem simply by patching the configuration and performing an update.
When it was all done, we simply destroyed the entire infrastructure with one simple Helm command. I chose to save my cluster, but in order to not incur charges on the Google Cloud Platform, I set my available number of Kubernetes nodes to 0. This essentially is like turning off a machine, where the data is still saved on disks, but there is no available computing power, so it doesn't do anything. This proved to be useful later when I needed a way to quickly look at a Jupyter notebook.
Visualizing Data with Matplotlib
After lunch, I attended a couple of advanced data science sessions. The first was a session on Matplotlib, a powerful tool for creating charts and other visual diagrams (notebooks and slides for the session are available here). Data scientists work with a lot of data, and the focus is usually on taking that data and running computations in order to make predictions or classify data in various ways. At some point, however, the results need to be presented in a way that is more meaningful to a wider audience.
The session started with a dive into the theory of how to present data visually. It was something I had not really thought too deeply about, and it turned out to be a much more interesting topic than I was expecting. There are many criteria to consider when building a visualization of data, such as the type of presentation (bar charts, pie charts, scatter plots, etc.), colors (what colors go together, which colors to avoid), and even shapes (some shapes imply action or effects).
Some methods are more effective at conveying meaning than others. For example, people aren't actually that good at interpreting pie charts - they just aren't able to easily see the relative difference in sizes of wedges, and the angle of a wedge doesn't translate as well to a number. We usually give them a secondary label, such as a percentage, to help them. People are much more adept at reading bar charts - they can easily see which values are greater than others and the size is logical in relation to the data being presented.
Color turns out to be a really fascinating topic. Normally I choose colors that suit my preferences - I like dark backgrounds and vibrant foregrounds. The problem is that not everyone sees colors the same. Some people have color blindness, and some people have a harder time distinguishing colors that are close in shade. The brain also does certain things that cause colors to appear different when they are surrounded by other colors. There are websites and tools that you can use to help pick colors and avoid these potential pitfalls (and this is applicable to websites and other visual applications as well).
When we started diving into the use of Matplotlib, I began to see the potential of the tool almost immediately, especially when combined with Jupyter. We were given a notebook that pulled in a variety of different data and presented it in different visualizations. Matplotlib gives you complete control over all the parts of a figure, including the types, colors, labels, range, etc. If you can think of some visualization you want to build, Matplotlib can do it. The details are a little too technical to dive into in this article, but in a future article I'll explore it in more detail.
Dealing with Missing Data
The second session for the afternoon was about missing data (the notebooks and slides for the session are available here). When I was first learning Machine Learning, I immediately recognized the issue of missing data. We can only draw conclusions from the data we have - if we try to fill in the missing data, we risk potentially distorting the results in undesirable ways. It can be done, but it needs to be done carefully.
It turns out, however, that there are many kinds of missing data. The first is called missing completely at random (MCAR). A good way of thinking of this is to imagine a lab where someone is working with a number of petri dish samples. As this person is working, they accidentally knock a few of the dishes onto the floor and contaminate them. The data that would have been collected from those dishes is now missing, but it is missing for a reason that is completely random and has nothing to do with the sources of data.
The second case is called missing at random (MAR). It is similar to the first case, but it has some correlation with the sources of the data. For example, imagine that you are collecting income data from a wide variety of people using surveys. You know things about the people such as age, sex, address, etc. You do not know their income, however. When you get the results back, you find that some of the people do not respond to certain questions or at all. It turns out that women are more likely to not respond. Their data is now missing, but it isn't completely random like the previous case - the missing data is related to the fact that the respondents were women, something that you are aware of and can account for (you can't accurately fill in the missing data, but because you know they are female, you could use other female respondents as a basis for potentially estimating the data).
The third case is called missing not at random (MNAR) and is the hardest to deal with. Imagine we collect our income survey results, but unknown to us, people who have higher incomes are more likely not to respond. This is similar to the previous case except that you don't know which people are higher income without looking at the response data, and if they don't give it to you, you are left with missing data that is much more difficult to estimate for (the cause of the missing data is tied to the data itself).
There are a variety of techniques that you can use to address the missing data, but it is dependent on a variety of factors. Are you missing some of the data for an individual respondent or are you missing all the data? Is it missing completely at random, missing at random, or missing not at random? Are there correlations between the different columns in each data row? Depending on the answers to these questions, you may be able to fill in some of the data with some accuracy, but it needs to be carefully evaluated in order to not skew results.
It has been an incredible conference so far. I built a JupyterHub cluster in the cloud, learned how to visualize my data using a really powerful graphing library, and learned about the types of missing data and how to account for them. It is already clear to me that Jupyter is a powerful tool that is applicable in almost any field that involves data and helps people easily tune their models and share them with others. There will be another writeup later today for Day 2, so stay tuned.