Cryptography has been a passion of mine since the early days of my Science career. You may think, “what does this have to do with data science?” Well, cryptography is in a way the exact opposite of data science; it involves getting a signal complicated enough (enveloped in noise) so that it becomes very hard for others to see what’s there. Since both fields deal with information, similar principles apply through their pipelines. For example, the ectropy metric (not to be confused with entropy), which has been developed for data science tasks, directly applies to this new coding system. However, instead of aiming at high ectropy as it’s often the case in data science, we go for very low (practically zero) ectropy when dealing with cryptography-related data.
The Thunderstorm system is named as such because it is very chaotic (like a storm) and very fast (like thunder). It is implemented in Julia, for extra speed, though the algorithm is fast on its own. What it does is transform a plaintext file into complete gibberish (ciphertext file) in a way that is very hard to reverse, unless you are in possession of the key file used in the whole process. The idea is that even if a single byte is off in the key file you use, when you apply it to the ciphertext file, it will only yield more gibberish, instead of the original file.
The key parts of the Thunderstorm system are:
The whole framework that surrounds the Thunderstorm coding system includes a file comparison method that allows for a high-level similarity estimate of two given files. This metric is a number between 0 and 1, much like every other similarity metric used in data science. When the source and target files are compared, this similarity is super low (practically zero in most cases). The ectropy of the target file is very similar. Needless to say, the ectropy of the key file is usually a flat zero, as this allows for a better quality of encryption.
It is not clear how cryptanalysis will evolve in the years to come, but the brute-force attack is definitely becoming more and more popular, as computing power increases (esp. now that quantum computing is slowly becoming a viable alternative). However, this system is bound to remain safe, as the key file is usually very large, allowing for what is known in Cryptography as “perfect encryption”. On top of that, even if the key is partly known, the decrypted file using that key is bound to be very different than the original, due to the additional layers of security involved in the system, making cryptanalysis an infeasible task. As a result, since there is no way of deriving the key from partially decrypted data, the brute-force attack is the only option for breaking this code, a virtually impossible task, due to the nature of the key (which is in the order of 1000s of bits). Therefore, this system is bound to remain safe, at least if one takes care to use different key files every now and then.
This whole endeavor goes on to show that the fox-like data science mindset applies to related fields too, without too much work. Wherever there is information, there are ways to transform it into something else that is more useful for the application at hand, whether it involves insight-generation or the creation of ambiguity.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.