Cryptography has been a passion of mine since the early days of my Science career. You may think, “what does this have to do with data science?” Well, cryptography is in a way the exact opposite of data science; it involves getting a signal complicated enough (enveloped in noise) so that it becomes very hard for others to see what’s there. Since both fields deal with information, similar principles apply through their pipelines. For example, the ectropy metric (not to be confused with entropy), which has been developed for data science tasks, directly applies to this new coding system. However, instead of aiming at high ectropy as it’s often the case in data science, we go for very low (practically zero) ectropy when dealing with cryptography-related data.
The Thunderstorm system is named as such because it is very chaotic (like a storm) and very fast (like thunder). It is implemented in Julia, for extra speed, though the algorithm is fast on its own. What it does is transform a plaintext file into complete gibberish (ciphertext file) in a way that is very hard to reverse, unless you are in possession of the key file used in the whole process. The idea is that even if a single byte is off in the key file you use, when you apply it to the ciphertext file, it will only yield more gibberish, instead of the original file.
The key parts of the Thunderstorm system are:
The whole framework that surrounds the Thunderstorm coding system includes a file comparison method that allows for a high-level similarity estimate of two given files. This metric is a number between 0 and 1, much like every other similarity metric used in data science. When the source and target files are compared, this similarity is super low (practically zero in most cases). The ectropy of the target file is very similar. Needless to say, the ectropy of the key file is usually a flat zero, as this allows for a better quality of encryption.
It is not clear how cryptanalysis will evolve in the years to come, but the brute-force attack is definitely becoming more and more popular, as computing power increases (esp. now that quantum computing is slowly becoming a viable alternative). However, this system is bound to remain safe, as the key file is usually very large, allowing for what is known in Cryptography as “perfect encryption”. On top of that, even if the key is partly known, the decrypted file using that key is bound to be very different than the original, due to the additional layers of security involved in the system, making cryptanalysis an infeasible task. As a result, since there is no way of deriving the key from partially decrypted data, the brute-force attack is the only option for breaking this code, a virtually impossible task, due to the nature of the key (which is in the order of 1000s of bits). Therefore, this system is bound to remain safe, at least if one takes care to use different key files every now and then.
This whole endeavor goes on to show that the fox-like data science mindset applies to related fields too, without too much work. Wherever there is information, there are ways to transform it into something else that is more useful for the application at hand, whether it involves insight-generation or the creation of ambiguity.
A big part of being a fox-like scientist is being able to think outside the box and come up with novel ways of tackling difficult problems. This strategy has been applied in all fields of science, even the more down-to-earth disciplines like Physics and Electronics. Thinking outside the box is great, but is it enough? Well, Romke Jan Berhnard Sloot, a Dutch electronics technician certainly didn't think so. Through his somewhat brief career he went on to not only envision but also implement a novel coding system for video data towards the end of the 90s. This system, which is also known as SDCS (Sloot Digital Coding System), was so revolutionary that never got to become a widely-available tech, just like many of Tesla's ground-breaking inventions.
Contrary to what its name implies, the SDCS is not a compression system. The latter have been researched to death and there is little if any room for improvement in their effectiveness, without requiring a huge amount of computing resources (which generally translates into long waits for the compression / decompression process). What Jan Sloot did was come up with an ingenious method for using shared resources between the transmitter of a video and the receiver, to be able to transfer all the vital information of the video (the data that makes it unique) thereby recreating it on the receiver's end, without having to convey every single frame of it. He allegedly demonstrated this whole thing by implementing his method on a small integrated chip, which when powered, could recreate 16 videos simultaneously. This basically showed that the system was:
Unfortunately not only did the SDCS never become a publicly available tech, but it was also lost making all this know-how now seem more like science fiction. This is because Jan Sloot mysteriously died a couple of days before he was going to hand over the source code of his invention to the people who had agreed to fund him. However, the ideas behind this innovative approach to digital coding still linger and are there for everyone to investigate. Also, the work of E. Laszlo's, a systems professor and Nobel Prize winner (in 2004 and 2005), seems to shed light on this matter. Namely, he wrote about the In-formation concept and how it helps shape our universe (this is greatly different than information, which is more of an expression of it). This In-formation could be accessed and when applied, it is possible to create forms, given the right amount of matter and energy. Although it is very unlikely that Jan Sloot was aware of this fairly esoteric knowledge, which is more relevant to cosmology than anything else, the same principle of In-formation applies to other aspects of the world, including digital coding tech. So, it is not far-fetched to say that Sloot's technology directly applied this principle to convey the essential data of a video and then using the shared video datasets to assemble that video into a coherent structure that could be enjoyed by the recipient of the chip.
It will probably take decades before we reach his level of know-how and reinvent this lost technology. However, in the meantime we can learn from his example and apply that in our own data science endeavors. Specifically, we can adopt his curiosity, an essential skill for every data scientist, and focus on a more hands-on approach to data analytics, rather than the restrictive model-based one that makes data analytics mechanical and irksome. Also, we can adopt his essentialist approach to problem-solving, which involves finding economical ways of devising and implementing a solution to a problem. Deep learning networks may be great, but it's through clever data engineering and the use of conventional, albeit fast, methods that we can tackle data analytics challenges, without having to outsource everything to an AI, whose results we are unable to fully interpret afterwards. Finally, we can adopt his grounded approach to science as he did not only come up with an elegant method, but he actually made it into a technology that everyone could use. We can do something similar in our data science endeavors by not only coming up with useful metrics and algorithms, but also implementing them and optimizing them to the extent that is possible given our know-how. This will help us to not only exercise our creativity but also gain a deeper and more insightful knowledge of data science, in a fox-like way.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.