Day 2 and 3 at JupyterCon

Thoughts from the final 2 days of the conference and some reflections on Jupyter, the open source community, and the remarkable things they are doing.

Day 2 and 3 at JupyterCon

I have reached the end of the JupyterCon conference, and the last three days have flown by. The first day was hands-on tutorials that were generally over 90 minutes, but the final two days were shorter sessions of around 45 - 60 minutes. In this article I will try to capture some of the highlights of what I saw during those sessions and the conference as a whole.

Juptyer as a Power Tool

One thing that was clear from almost every session I attended was that Jupyter is a power tool that should be a part of almost anyone's toolbox. It might seem like it is a tool that is focused on Data Science and Machine Learning, but in actuality, it is way more than that. It can be a teaching tool, a code IDE, a presentation tool, a collaborative tool, and much more. With tools like Jupyter Lab that are easily extensible, there is almost nothing that you couldn't do directly in it.

I attended a session on writing extensions for Jupyter Lab, and one example was an extension that pulled in random XKCD cartoons. Another example allowed you to play mp4 movie files. A third extension let you browse GitHub repositories. Extensions are written in NodeJS and can use UI technologies like ReactJS. If you can build it for a standalone website, you can build it to run inside a Jupyter Lab instance. This opens up possibilities for dashboards, applications, monitoring tools, and more. You could literally run your entire business through Jupyter Lab.

That brings up the question, would you actually want to do that? Maybe not. This doesn't take into account some factors such as accounting for high CPU performance requirements, storage requirements, etc. It is a tool that is still evolving, however, and it might be possible to accommodate applications with more diverse operation requirements in the future.

Sharing and Versioning Data Like Code

One really interesting session I attended talked about how we love our Jupyter notebooks, but there is a problem when we start sharing them around. The code in the notebook generally stays intact (unless someone modifies it along the way). The data, however, may not be contained in the notebook, and it can be much harder to ensure that the data stays consistent and that the notebook will continue to work over time.

That is something that the folks at Quilt are working to improve. Instead of pulling data from a database or flat files alongside the notebook, why not pull data from repositories, just like how we currently pull code? When you need a library in a Python program, such as NumPy, you simply import it and start using it:


>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

Now imagine that you could import sets of data the same way:


# import quilt and install data from a repository
import quilt
quilt.install("uciml/iris", hash="da2b6f56f3")

# import iris data
from quilt.data.uciml import iris

# deserialize into a pandas dataframe
df = iris.tables.iris()

All you need to pull the data in is a name (uciml/iris) and optionally a version (given by the hash code) of the data to import. This means that even if the data changes, the notebook will continue to work because it uses a fixed version of it, just as it would continue to work if you used a fixed version of a library. This makes notebooks more resilient to change and makes them useful for much longer.

The Jupyter Open Source Community

Another thing that I learned over the past few days is that Jupyter has a vibrant community of developers who are actively contributing to open source (and recruiting heavily to bring more people into open source). This includes developers in academics, startups, and even large enterprises. They develop tools for their active projects, and then bring those tools back to the open source world so that others can benefit from their work.

Many of the sessions I attended over the past two days were simply about demonstrating some techniques and extensions that teams are using in doing work involving Machine Learning or Data Science. Several were from large financial institutions that are using these techniques not only for regular business but also for areas like security and understanding their customers better for future product development.

A lot of the members of this community are researchers and scientists who don't have large budgets to use for buying professional tools, so having access to such remarkable open source tools is making it possible to do their work at much lower costs than would normally be required. More than that, these tools are becoming so standard that researchers who would normally have difficulties sharing their findings (normally done through publications like Nature or Science) can now share their findings electronically along with the data.

Jupyter as a Bridge and a Teaching Tool

The other really remarkable thing about Jupyter is that it really bridges the gap between technical and non-technical people. A lot of work goes into gathering data, constructing models, training the models, compiling the results, and then presenting them in visualizations for the folks needing information to make important decisions. The fact that Jupyter can do all these things in one place is an incredible achievement.

Microsoft was able to take over the enterprise by offering a suite of tools that most businesses use today. We write text documents in Word, create presentations in PowerPoint, read our email in Outlook, and organize numerical data in Excel. There are alternatives out there for these tools, but the integration between them and the fact that so many of us use them has pretty much made them the standard to follow. Technical and non-technical people alike can use these tools to share ideas and communicate.

Jupyter is still in the early stages of adoption for most companies, but the growth of Data Science and Machine Learning is exploding, and I expect to see it become another standard tool that people install when they set up their computer environment. They may even start using it for doing some things that they normally would have done using Word, PowerPoint, or Excel. I don't expect people to start reading their email in it, but there is no reason that it couldn't be extended to do that too.

Part of bridging that gap between technical and non-technical people is training, and Jupyter is a remarkable tool that has enormous potential for revolutionizing the way that we train people to do almost anything. Textbooks are now being written directly into Jupyter notebooks so that as students are reading the material, they can see interactive videos and diagrams, run sample code, and even begin working on exercises that can give immediate feedback. This kind of training could extend to places where people normally don't have easy access to such things.

Conclusion

It's been a really wonderful conference, and I met a lot of folks who are really excited about the remarkable things that Data Science and Machine Learning are able to tell us. Doctors, scientists, engineers, teachers and people in many other fields are discovering how easy it is to start building models and drawing conclusions from them.

Tools like Jupyter are making this even easier and allowing them to share their findings with others, whether or not those people are technical or not. Tools like Quilt are enabling us to share our data in the same way that we share code. The community is continuing to enhance and refine the tools for Data Science and Machine Learning. New ways of training both technical and non-technical people are being developed.

I've really enjoyed this conference, and I can't wait to see what the future holds for these remarkable tools.

Related Article