Today we’re going to talk to Matei Zaharia. As of this March, he’ll be an Assistant Professor at MIT. He’s currently the CTO at a startup company called DataBricks that uses a rapidly growing piece of software (Spark) he created in the area of “big data analytics.” Spark is powerful and important, but runs behind the scenes for technology entrepreneurs and companies. It doesn’t have the sexiness like a new phone or wearable device that makes the world want to get one immediately. So how did he get it to take off?
Matei, tell us about Spark in simple terms–what it is and how it’s useful for people.
I began the Spark project when I was doing my PhD at UC Berkeley. Spark is a framework for writing distributed applications. Spark makes it easy for programmers as well as less technical people, such as data scientists and analysts, to work with large volumes of data.
What’s the background on how and why you decided to create Spark?
I was working pretty early on with Facebook as well as some of the companies that were starting to build high scale data analysis using the Hadoop project, which is an open source project based on some of the infrastructure used at Google, that eventually other types of companies began using. I was working early on with them on problems within Hadoop, like how to make drop scheduling more efficient, or how we can evenly share access to a cluster with multiple people. In doing that, we noticed that there were quite a few applications that the Hadoop model was not well suited for, and these applications would be very inefficient even if you were to add hundreds of machines. So I designed Spark initially to meet one limitation, which was iterative algorithms, such as machine learning applications, and we generalized it from there to cover other types of applications as well.
People often assume that if you create a great product, people will begin using it and it will spread. There’s a lot of truth to that, but for some products like Spark that are less sexy than consumer electronics like an iPhone, creative marketing helps to gain adoption. One of the things you did with Spark is to create a community – both online and offline – to adopt Spark and continue using it. This is something we do with our HOPE communities and we think is really powerful, so I’d love for you to tell everyone how you did that.
We open sourced Spark in 2010, put it out there and let people try it out. It took a few years for the community to really grow and become large, and there’s a few reasons why it takes time:
First, the software has to be good. The product has to be easy to install and easy to use – which requires engineers and users trying it and providing feedback. Second, people have to hear about it. Third, people want to see someone else being successful with it. As soon as there were the first few companies that were successful and that started talking about it, we began to see many more starting to use it.
Initially, we started as a research project at Berkeley, and when we presented it, people were fairly excited about it. So we released the source code and started doing all the development online, although most people who were developing were from Berkeley. After a few months, however, we saw small patches coming from outside.
The second thing we did was we went to a lot of conferences and talks about it, showed some of our initial use cases. We also started working with a couple of companies that had a specific use cases. They helped give us feedback as well. It was important to develop those use cases so others had a chance to hear about it.
Once we did that, we started to get more emails and questions about it, and one of the biggest things that helped were online discussion groups – Google groups – where we tried to be very responsive about answering all questions. This kept people participating as they felt they could get help and that someone was there to listen to feedback; these things helped create the initial environment that paved the way for Spark to become a very good, easy to use and finished product.
Once we saw quite a few people interested in it, we also began hosting in-person events. One was the Meetup group in the bay area. It’s very common for the engineers to participate in Meetups. One external community member suggested we do a Meetup, so we began hosting these events once a month. Some of the events were at UC Berkeley. New features were coming out on Spark, and many people came to these just because the field of Big Data was new and interesting to them. The second in-person event we did was that we started holding small training workshops and courses about Spark. UC Berkeley held an in-person bootcamp for Big Data in August 2012, and around 150 people came and saw and used Spark to learn big data concepts. After that, this has continued annually, sometimes multiple events each year to give people a chance to learn [Big Data]. These things together helped to create a community.
In the past two years, we’ve also started having user conferences. Our first conference was 450 people in December 2013. Last summer, July 2014 we had a conference with 1,100 people. This year, we have two – one on the East coast with 800 people and one on the West coast with closer to 2000 [attendees]. As the community grows, you have these larger events as well.
There are several elements that cause a product to grow, but one of the most important is being responsive and word of mouth (through the meetups and conferences). As well as the online discussion group and ability to see patches from anyone was there from when we began. You kind of have to have that – so people have a way to ask questions and get help.
So it started off with people at UC Berkeley. What role did community members play, who weren’t necessarily involved in starting, but who helped spread the word?
Good question. We had a bunch of people who provided feedback or asked questions, but maybe never contributed code. They were still very valuable. We want an environment in which the product is improved in the direction users want. We had quite a few people who started contributing code and by middle of 2012 we had more people contributing from outside Berkeley than from inside. That was a nice milestone. One of the more important things is to get feedback, to have people who are willing to tell you how to improve. You also need to spend some time teaching others how to contribute, as in the long-term that’s how you get a large community. We spent a lot of time reviewing external contributors’ patches and merging them. We tried to encourage people to keep contributing. Over time, it’s really paid off.
For someone else who is creating an open source technology that will grow and be sustainable, will continue long-term with large user growth – what advice would you give them? What are the top three things you would advise on how to create a sustainable technology?
Regarding community building, you want to be VERY responsive. To users and contributors. You want to be inclusive, if you think the project has the chance to grow large, then you are going to need as many people as possible to help out with it. You need to nurture those people early on. Make sure they receive something from participating in it and that they feel welcome to do so. It can be difficult to do this if you are focusing on building things yourself, but this is the most important thing.
Regarding software in general, it’s much easier to build and create code when it’s small. It’s important to do the smallest things that solve problems for people but are still simple enough for a small team to build. Be prepared to spend quite a bit of time talking about software and helping the users and spending time on community building. I spent about a third of my time on community-building activities.
What’s the future for Spark? What are next steps?
Spark today is larger than ever before. In terms of open source projects for Big Data processing, it’s by far the most active community out there. We hope to continue having a large community and continue adding plenty of exciting features to Spark.
One of the main things we’ve been doing to support that and to allow fast growth is to design very standard kinds of interfaces and extension points. Similar to how in operating systems like Windows, there’s a standard way to create a printer driver or monitor driver or anything like that. Because of that, they have this huge ecosystem of a variety of companies with their own drivers. With Linux as well, or any large web server, there are similar standard extension points. So this is one thing we are trying to enable more people to build libraries and applications. That’s the biggest thing happening right now in terms of project growth. We’ve seen quite a few applications in academic research as well.
Matei, thank you so much for your time and we’re excited to see Spark continue to have success!