I've talked about Spark in my books and I’d made the prediction that it was a very promising big data platform, back when it was still a novelty. With the enthusiastic support of DataBricks (the main contributor to Spark’s codebase) and a variety of data developers (data scientists who specialize in coding and in building data science tools), Spark became a force to be reckoned with. In fact, even people who were using Hadoop as their go-to big data framework, started to shift to Spark, while their relationship with Hadoop was reduced to using merely its file system. Yet, even with all these developments, there was something lacking in the Spark framework: integration capability with Julia, a new rising star in the programming world, and in data science lately.
I was one of the first data scientists who opens supported Julia (while other data scientists were choosing to play it safe and stick with what they knew). Even though I had acquired sufficient expertise in both R and Python to not need yet another programming language at my toolbox, I learned this new language, with whatever information I could dig up from the web (there were no good books on the language at the time). Julia’s advantages over other programming languages were too many and too powerful to ignore, so I learned it and started using it for my data science projects, in parallel with Python. In fact, one of the first Julia scripts I wrote aimed to help me organize my various Python scripts, so that I could easily pinpoint a particular function, from a bunch of .py files in a folder on my work computer. Yet, Julia was not widely adopted by the data science community and it’s quite likely that its lack of integration with Spark was one of the main causes of this.
Lately, things have started to shift. A couple of people created a couple of Spark packages for Julia, the most active of which is Spark.jl. Although still in its development stage, the whole project seems promising, because it shows that it is possible to link Spark and Julia, even if the connection has to take place through Java (which is one of the languages Julia can bridge to, something unfathomable for other “promising” new languages, like Go).
Naturally, Julia is a self-sufficient tool, so it does not need Spark or any other similar framework to be useful for data science. However, with many people following this or the other big data ecosystem with great zeal, it is unlikely that Julia will get the place it deserves in data science, without making it appealing to these people. Unmistakably, learning Julia for most data scientists is a walk in the park (it has elements of both Python and R, while Scala is very similar to Julia in terms of logic). Yet, knowing how to write a Julia script and actually using such a script in production, as part of a codebase built around Spark, that’s a somewhat different matter! Also, even though Spark doesn't need Julia either, it has a lot to gain by incorporating all the Julian data scientists, since there is no doubt that it’s easier to use Julia than it is to use Scala, when it comes to data science applications.
Will Julia and Spark form a synergy that will transform data science for years to come? Possibly. How? Well, letting go of certain attachments to a certain programming paradigm would definitely help. After all, if you know how to program well, learning to use another programming language, even if it is from a different paradigm, is not that much of a challenge as some people think. Besides, isn't pushing the envelop in tech a big part of data science anyway?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.