Everyone can analyze data these days, given the right programming tool and some library of functions, to express practically the relevant know-how of that person. I've seen people who give away books for free (as it would be impossible for them to get others to buy them) analyze data. As data science becomes more widespread, data analysis becomes a given for a larger portion of the population. But what about data synthesis, however? What's up with that? Let's delve into this.
First of all, let's get some definitions down. Data synthesis is the creation of synthetic data that follows a given pattern. The latter can be given directly to the data generation program, or it can be derived (extrapolated) via data analytics. Synthetic data is ideally indistinguishable from conventional data, and you can use it to train a data model, for example. However, there is something that makes it extremely valuable.
The value of synthetic data lies in the fact that it's not tied to particular individuals, so using this data doesn't pose any PII-related issues. Because of this, it cannot be owned by any specific person, even if it can be leveraged in the data science pipeline, yielding value. Naturally, since there are no shortcuts to value-making, the value (information) of that synthetic data must come from somewhere. So, since it's not practical for someone to have a high-level mathematical representation of this value and give it to a program as a pattern, it's more likely that this value stems from the source data.
So, to have valuable synthetic data (that's also free of PII), we need to have some source data of value, for starters. That's why the only practical way to generate synthetic data that's worth its space on a hard disk is via analytics. Of course, there are ways to generate such data through analytics, as in the case of some specialized deep learning networks (Autoencoders). The catch is that these A.I. systems require lots of data to do their job. After all, analyzing multiple variables isn't easy, even for an A.I. What if there was a way to perform the same task without employing these more advanced data-hungry systems?
Enter the BROOM framework again! We've already described some of its functionality, but what if this was just a prelude for its more sophisticated aspects? Well, fortunately, data synthesis isn't all that different from sampling, if you know what you are doing? And if you can sample a dataset properly, it's not that much more challenging to create new data points aligned with its essence. Naturally, the synthetic data is generated in a stochastic manner since it makes more sense to leverage noise in this process. Otherwise, all the generated data would be the same every time. Oh, and did I mention that this data synthesis process is scalable to as many dimensions as you like? Because if you understand data in-depth, the cardinality of vectors in a dataset is just another number...
I've been trying to answer this question for years. Well, not many years, but still, at least since the second half of the previous decade. Why? Well, I've always liked to explore the boundaries between the continuous and the discrete and since I finally internalized the teaching that everything in this universe is discrete (see: Quantum Physics), I decided to explore that angle and see if there was indeed a way to turn a continuous variable into a discrete one, with minimal information loss.
Over the past few months, I've developed three distinct approaches, depending on how distinct the values of the target variable are (see what I did there?). Let's start with something simple: no target variable at all. So, how can we discretize a continuous variable x? Well, you have to binarize it until there is no more binarization possible! But how do you optimally binarize a variable? That's something that involves densities after you handle all outliers and inliers in x, of course. How do you do that? Well, that's a topic that can fill a whole book chapter, so I'll have to draw the line here, I'm afraid.
What about when there is a target variable? Let's start with a binary one as it's simpler this way. We can employ a robust similarity metric that can assess the similarity of two binary variables, regardless of their alignment or any similarities due to chance. Fortunately, I've developed one such metric, which I call holistic symmetric similarity (HSS), which also works with all sorts of discrete variables. So, by using this metric, we can optimize the split to maximize the HSS score between the binarized x and the target variable y. The same approach works if y is discrete but not binary since I've generalized HSS to handle nominal variables too.
Ok, but what about when y is continuous, though? Well, that takes a bit more creativity since it's not as simple a task as it may seem. Fortunately, it's doable and relatively light, computationally speaking. We can find the threshold that maximizes a custom correlation metric that becomes larger once any non-linearities are tackled. This process doesn't have to be rocket science since I'm sure you can come up with a metric like that if you've been mentored by someone worth his salt in data science. Of course, you could use a translinearity correlation metric, yet, I wouldn't recommend that since it would inevitably pick up signals you wouldn't want it to, plus it's bound to be more computationally heavy.
So, there you have it. You can binarize and therefore discretize any feature x you like, with or without a target variable. The latter can be binary, discrete, or even continuous, depending on the problem at hand. Such a process can help you preserve computational resources and perhaps even enable you to make better and more transparent models (after all, binary variables tend to be easier on the mind, not just on the computer). All this I've done in the OD.jl script, which I cannot share here, unfortunately, as it has dependencies on proprietary code (the BROOM framework), which I'd rather not give away. Still, if you wish to explore this topic further, we can do that in a one-on-one mentoring session or two, given that you have the required commitment to the craft and a genuine interest to learn more about it. Cheers!
I realize that I’ve done this topic before, but perhaps it needs some more attention, as it’s a very useful topic.
Diagrams are great, but they are also challenging. As for the other graphics (particularly those not generated by a plotting library), these can be tough too. But both diagrams and these unconventional graphics are often essential in our line of work, be it as data scientists or data analysts. Let's examine the hows and whys of all this.
First of all, diagrams and graphics, in general, are a means of conveying information more intuitively. When you look at a table filled with numbers and other kinds of data, you need to think about them, and sometimes you have to know something about the context of all this. With diagrams, you may get an idea of the underlying information even if you don't know much about the context. Of course, the latter can help bring about scope and perspective, helping you interpret the diagram better and make it more applicable to the task at hand.
Diagrams and unconventional graphics are paramount in presentations too. Imagine going to a client or a manager with just a code notebook at your disposal! Even if they may appreciate you having done all this work chances are that you'll need more than that to get them on your side and see the real value behind all these ones and zeros! Besides, the adage of "a picture is a thousand words" is valid, even in Analytics work. Data modelers have figured that out a long time ago, which is why diagrams are their bread and butter. Perhaps there is something to be learned from all this.
But how do you go about creating diagrams and unconventional graphics in general? After all, graphics design is a challenging discipline, and it's not realistic to try to do this kind of work without lots of studying and practicing. Also, it's doubtful we'll ever be as good as graphics professionals who often have the talent to drive their know-how. Still, we can learn some basics and create decent-looking diagrams and graphics, to facilitate our data science endeavors.
For starters, we can invest in learning a program like GIMP. This software is an open-source version of Photoshop, and it's well-established and documented. So, if you have a good image or graphic to work with, GIMP can make it shine. Also, programs like LibreOffice Draw can be practically essential for this sort of work, especially if you want to build something from scratch.
Contrary to what some people think, creating graphics is very detailed work, not some artistic endeavor. You need to use both your analytical and your imaginative faculties for such a task, even if the imagination part may seem dominant, at least in the beginning. So, for any graphics-related tasks, remember, zooming in is your friend! As for the properties box of any graphical object, that's your best friend!
Anyway, I could go on and talk about graphics in data science and data analytics work all day, but it’s not possible to do this topic justice in a single blog post. Besides, the best way to learn is by practicing, just like when it comes to building and refining data models for your Analytics work. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.