Is Spark Necessary in Data Science Today?

10/26/2020

As most of you probably know, Apache Spark is a big data platform (a data governance system), geared towards large-scale data. Employing HDFS (Hadoop Distributed File System), it handles data on a computer cluster (or a cloud), processing it and even building some basic predictive models with it (including Graph-based ones). Written in Scala, it is compatible with two more programming language: Java and Python. However, most people use it simply because it’s much faster than Hadoop, up to 100 times faster, while it offers a more intuitive and versatile toolbox to its users.

Spark fills a unique need in data science: handling big data quickly, in an integrated ecosystem. If you also know Scala, that’s a big plus because you can write your own scripts to execute in this ecosystem, without having to rely on other data professionals. Also, Spark has a large enough community and resources, making it a safe option for data science work, while there are companies like DataBricks which are committed to maintaining it and evolving it. The fact that there is even a conference on Spark (and AI) by the aforementioned company adds to that.

OK, but what about Spark alternatives? After all, Spark can’t be the only piece of software that does this set of tasks well, right? Well, there is also Apache Storm, a big data tool created by Twitter. One advantage it has over Apache Spark is that it can run on any programming language. Also, it’s fault-tolerant and fairly easy to use.

What’s more, there is IBM Infosphere Streams, which specializes in stream processing. Also, the fact that it has an IDE based on the Eclipse one, makes it easier to work with. Although Storm is fast, this one is even faster, while it can also fuse multiple streams together. This can aid your analysis in terms of insights made available. This big data tool works with the Streams Processing Language (SPL) and comes with several auxiliary programs for monitoring and more effective administration.

Additionally, there is Apache Flink, another high-speed and fault-tolerant alternative to Spark (like Apache Storm). Naturally, it’s compatible with a variety of similar tools, like Storm as well as Hadoop’s MapReduce. Because of its design, which focuses on a variety of optimizations and the use of iterative transformations, it’s a great low-latency alternative to Spark.

Finally, there is TIBCO StreamBase, which is designed with real-time data analytics in mind. The niche of this software is that it creates a live data mart communicating with the users through push notifications. Then, they can analyze and create visuals of the data interactively StreamBase appeals also to non-data scientist, through its LiteView Desktop version.

Beyond these four alternatives to Apache Spark there exist several ones which are beyond the scope of this article. The idea here is to show that there are other tools out there for this sort of data governance tasks, some of which can be even better than Spark. In any case, it's good to be aware of what's out there and develop a mindset that is not tied to specific software for the tasks at hand. More about all this in my book Data Science Mindset, Methodologies, and Misconceptions, which I authored a few years back, yet it remains relevant to this day. Check it out when you have a moment. Cheers!

0 Comments

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more