For over 2 decades there is a puzzle game I've played from time to time, usually to pass the time creatively or to challenge myself in algorithm development. This game, which I was taught by a friend, didn't have a name and I never managed to find it elsewhere so I call it Numgame (as it involves numbers and it's a game). Over the years, I managed to solve many of its levels though I never got an algorithm for it, until now.
The game involves a square grid, originally a 10-by-10 one. The simplest grid that's solvable is the 5-by-5 one. The object of the game is to fill the grid with numbers, starting from 1 and going all the way to n^2, where n is the size of the grid, which can be any number larger than 4 (grids of this size or lower are not solvable).
To fill the grid, you can "move" horizontally, vertically and diagonally, as long as the cell you go to is empty. When moving horizontally or vertically you need to skip 2 squares, while when you move diagonally you need to skip 1. Naturally, as you progress, getting to the remaining empty squares becomes increasingly hard. That's why you need to have a strategy if you are to finish the game successfully.
Naturally, not all starting positions yield a successful result. Although more often than not you'd start from a corner, you may choose to start from any other square in the grid. That's useful, considering that some grids are just not solvable if you start from a corner (see image below; empty cells are marked as zeros)
Before we look at the solution I've come across, try to solve a grid on your own and think about a potential algorithm to solve any grid. At the very least, you'll gain an appreciation of the solution afterward.
Anyway, the key to solving the Numgame levels is to use a heuristic that will help you assess each move. In other words, you'll need to figure out a score that discerns between good and bad positions. The latter result from the various moves. So, for each cell in the grid, you can count how many legitimate ways are there for accessing it (i.e. ways complying with the aforementioned rules). You can store these numbers in a matrix. Then, you can filter out the cells that have been occupied already, since we won't be revisiting them anyway. This leaves us with a list of numbers corresponding to the number of ways to reach the remaining empty cells.
Then we can take the harmonic mean of these numbers. I chose the harmonic mean because it is very sensitive to small numbers, something we want to avoid. So, the heuristic will take very low values if even a few cells start becoming inaccessible. Also, if even a single cell becomes completely inaccessible, the heuristic will take the value 0, which is also the worst possible score. Naturally, we aim to maximize this heuristic as we examine the various positions stemming from all the legitimate moves of each position. By repeating this process, we either end up with a full grid or one that doesn't progress because it's unsolvable.
This simple problem may seem obvious now, but it is a good example of how a simple heuristic can solve a problem that's otherwise tough (at least for someone who hasn't tackled it enough to figure out a viable strategy). Naturally, we could brute-force the whole thing, but it's doubtful that this approach would be scalable. After all, in the era of A.I. we are better off seeking intelligent solutions to problems, rather than just through computing resources at them!
(image by Arek Socha, available at pixabay)
Lately, I've been working on the final parts of my latest book, which is contracted for the end of Spring this year. As this is probably going to be my last technical book for the foreseeable future, I'd like to put my best into it, given the available resources of time and energy. This is one of the reasons I haven't been very active on this blog as of late. In this book (whose details I’m going to reveal when it’s in the printing press) I examine various aspects of data science in a quite hands-on way. One of these aspects, which I often talk about with my mentees, is that of scale.
Scaling is very important in data science projects, particularly those involving distance-based metrics. Although the latter may be a bit niche from a modern standpoint where A.I. based systems are often the go-to option, there is still a lot of value in distances as they are usually the prima materia of almost all similarity metrics. Similarity-based systems, aka transductive systems, are quite popular even in this era of A.I. based models. This is particularly the case in clustering problems, whereby both the clustering algorithms and the evaluation metrics (e.g. Silhouette score/width) are based on distances for evaluating cluster affinity. Also, certain dimensionality reduction methods like Principle Components Analysis (PCA) often require a certain kind of scaling to function optimally.
Scaling is not as simple as it may first seem. After all, it greatly depends on the application as well as the data itself (something not everyone is aware of since the way scaling/normalization is treated in data science educational material is somewhat superficial). For example, you can have a fixed range scaling process or a fixed center one. You can even have a fixed range and fixed center one at the same time if you wish, though it's not something you'd normally see anywhere. Fixed scaling is usually in the [0, 1] interval and it involves scaling the data so that its range is constant. The center point of that data (usually measured with the arithmetic mean/average), however, could be distorted. How much so depends on the structure of the data. As for the fixed center scaling, this ensures that the center of the scaled variable is a given value, usually 0. In many cases, the spread of the scaled data is fixed too, usually by setting the standard deviation to 1.
Programmatic methods for performing scaling vary, perhaps more than the Stats educators will have you think. For example, in the fixed range scaling, you could use the min-max normalization (aka 0-1 normalization, a term that shows both limited understanding of the topic and vagueness), or you could use a non-linear function that is also bound by these values. The advantage of the latter is that you can mitigate the effect of any outliers, without having to eradicate them, all through the use of good old-fashioned Math! Naturally, most Stats educators shy away at the mention of the word non-linear since they like to keep things simple (perhaps too simple) so don’t expect to learn about this kind of fixed-range scaling in a Stats book.
All in all, scaling is something worth keeping in mind when dealing with data, particularly when using a distance-based method or a dimensionality reduction process like PCA. Naturally, there is more to the topic than meets the eye, plus as a process, it's not as basic as it may seem through the lens of package documentation or a Stats book. Whatever the case, it's something worth utilizing, always in tandem with other data engineering tools to ensure a better quality data science project.
Hello everyone and happy new year! I hope you all had a good holiday break. I thought about it quite a bit and I've decided this year to go a different direction with the videos I make as I plan to focus more on courses. Stay tuned for more news on this matter in the weeks to come...
Just wanted to wish you all Happy Holidays! It's been a great year and I appreciate your support through this blog. I won't be posting anything new in the next couple of weeks as I'll b traveling. Feel free to check out some of my older posts, though.
I hope your holidays are insightful, inspirational, and intriguing!
More important than remembering facts and methods related to data science problems is the trinity of inspiration, intuition, and imagination, with intelligence binding them all together. However, without inspiration, none of the stuff we know about data science is bound to grow much as our knowledge and know-how gradually crystallize and start giving in to entropy. So, I'd like to take a moment and remind everyone (including myself) the value of inspiration, even in a fairly technical field such as data science (I don't mention A.I. here because A.I. is its own source of inspiration, especially when one considers the applications of it).
So, what's your data science inspiration like? Where does it come from? What does it incentivize you towards? These are questions we need to ask ourselves from time to time, in order to make our learning of the field a sustainable process. The input of other data scientists is important in helping that but they may not always inspire us, especially after we grow out of the initial stages of our learning. This beginner’s mind although powerful is also fleeting and once it gives way to a more pragmatic view of data science, it is easy to lose our original enthusiasm for the field. That’s where inspiration comes in.
For me, the source of inspiration in data science is two-fold: first of all, it is my own research on the field, unbound by an academic agenda or a particular ideology (e.g. futurism). Such research is still disciplined but at the same time somewhat free, as in freedom (you can’t have research void of cost, unfortunately, even if that cost is just the time you dedicate to it). The other source of inspiration is mentoring, particularly students who are committed to learning data science through a structured and disciplined manner, such as the Thinkful courses on the subject. Naturally, I’d be happy to mentor other data science aspirants but so far this hasn’t taken place, for various reasons.
Beyond these, the educational material I create as well as the conferences I participate in can be a great source of inspiration too. However, these are not things that happen frequently enough so as to consider them as primary sources of inspiration, no matter how impactful they can be at times. In practice, they often act as conduits of inspiration, to a certain extent, something that’s also valuable. After all, all these aspects of my data science presence are interconnected and feed off each other.
What about you? What’s your inspiration for data science like? Does it come from a particular application, methodology, or educational material? How do you ensure that inspiration is part of your data science life?
The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
As mentioned in a previous post, translinearity is a concept describing the fluidity of the linear and the non-linear, as they are combined in a unified framework. However, linear relationships are still valuable, particularly if you want to develop a robust model. It's just that the rigid classification between linear and non-linear is arbitrary and meaningless when it comes to such a model. To clarify this whole matter I started exploring it further and developed an interesting heuristic to measure the level of non-linearity on a scale that's intuitive and useful.
So, let's start with a single feature or variable. How does it fare by itself in terms of linearity and non-linearity? A statistician will probably tell you that this sort of question is meaningless since the indoctrination he/she has received would make it impossible to ask anything that's not within a Stats course's curriculum. However, the question is meaningful even though it's not as useful as the follow-up questions that can ensue. So, depending on the data in that feature, it can be linear, super-linear, or sub-linear, in various degrees. The Index of Non-Linearity (INL) metric gauges that and through the values it takes (ranging from -1 to 1, inclusive) we can assess what a feature is like on its own. Naturally, these scores can be easily shifted by a non-linear operator (e.g. sqrt(x) or exp(x)) while all linear operators (e.g. standard normalization methods) do not affect these scores. Also, at the current implementation of INL, the value of the heuristic is calculated using three reference points in the variable.
Having established that, we can proceed to explore how a feature fares in relation to another variable (e.g. the target variable in a predictive analytics setting). Usually, the feature is used as the independent variable and the other variable as the dependent one, though you can explore the reverse relationship too, using this same heuristic. Interestingly the problem is not as simple now because the two variables need to be viewed in tandem. That's why all the reference points used shift if we change the order of the variables (i.e. the heuristic is not symmetric). Whatever the case, it is still possible to calculate INL with the same idea but taking into account the reference values of both variables. In the current implementation of the heuristic, the values can go a bit off-limits, which is why they are bound artificially to the [-1, 1] range.
Naturally, metrics like INL are just the tip of the iceberg in this deep concept. However, the existence of INL illustrates that it is possible to devise heuristics for every concept in data science, as long as we are open to the possibilities the data world offers. Not everything has been analyzed through Stats, which despite its indisputable value as a data science tool, it is still just one framework, a singular way of looking at things. Fortunately, the data-scapes of data science can be viewed in many more ways leading to intriguing possibilities worth exploring.
Lately, I've been busy with preparations for my conference trips, hence my online absence. Nevertheless, I found time to write something for you all who keep an open mind to non-hyped data science and A.I related content. So, this time I'd like to share a few thoughts on programming for data science, from a somewhat different perspective.
First of all, it doesn't matter that much what language you use, if you have attained mastery of it. Even sub-Julia languages can be useful if you know how to use them well. However, in cases where you use a less powerful language, you need to know about lambda functions. I mastered this programming technique only recently because in Julia the performance improvement is negligible (unless your original code is inefficient to start with). However, as they make for more compact scripts, it seems like useful know-how to have. Besides, they have numerous uses in data science, particularly when it comes to:
Another thing that I’ve found incredibly useful, and which I mastered in the past few weeks, is the use of auxiliary functions for refactoring complex programs. A large program is bound to be difficult to comprehend and maintain, something that often falls into the workload of someone else you may not have a chance to help out. As comments in your script may also prove insufficient, it’s best to break things down to smaller and more versatile functions that are combined in your wrapper function. This modular approach, which is quite common in functional programming, makes for more useful code, which can be reused elsewhere, with minor modifications. Also, it’s the first step towards building a versatile programming library (package).
Moreover, I’ve rediscovered the value of pen and paper in a programming setting. Particularly when dealing with problems that are difficult to envision fully, this approach is very useful. It may seem rudimentary and not something that a "good data scientist" would do, but if you think about it, most programmers also make use of a whiteboard or some other analog writing equipment when designing a solution. It may seem like an excessive task that may slow you down, but in the long run, it will save you time. I've tried that for testing a new graph algorithm I've developed for figuring out if a given graph has cycles (cliques) in it or not. Since drawing graphs is fairly simple, it was a very useful auxiliary task that made it possible to come up with a working solution to the problem in a matter of minutes.
Finally, I discovered again the usefulness of in-depth pair-coding, particularly for data engineering tasks. Even if one's code is free of errors, there are always things that could use improvement, something that can be introduced through pair-coding. Fortunately, with tools like Zoom, this is easier than ever before as you don't need to be in the same physical room to perform this programming technique. This is something I do with all my data science mentees, once they reach a certain level of programming fluency and according to the feedback I've received, it is what benefits them the most.
Hopefully, all this can help you clarify the role of programming in data science a bit more. After all, you don't need to be a professional coder to make use of a programming language in fields like data science.
Everyone in data science (and even beyond data science to some extent) is familiar with the process of sampling. It’s such a fundamental method in data analytics that it’s hard to be unaware of it. The fact that’s so intuitive as well makes it even easier to comprehend and apply. Besides, in the world of Big Data, sampling seems to be not only useful but also necessary! What about data summarization though? How does that fit in data science and how does it differ from sampling?
Both data summarization and sampling aim to reduce the number of data points in the data set. However, they go about it in very different ways. For starters, sampling usually picks the data points randomly while in some cases, it takes into account an additional variable (usually the target variable). The latter is the case of stratified sampling, something essential if you want to perform proper K-fold cross-validation for a classification problem. Data summarization, on the other hand, creates new data points that aim to contain the same information as the original dataset, or at least retain as much of it as possible.
Another important difference between the two methodologies is that data summarization tends to be deterministic, while sampling is highly stochastic. This means that you cannot use data summarization instead of sampling, at least not repeatedly as in the case of K-fold cross-validation. Otherwise, you’ll end up with the same results every time, something that doesn’t help with the validation of the models at hand! Perhaps that’s one of the reasons why data summarization is not so widely known in the data science community, where model validation is a key focus of data science work.
What’s more, if sampling is done properly, it can maintain the relationships among the variables at hand (obviously this would entail the use of some heuristics since random sampling alone won’t cut it). Data summarization, on the other hand, doesn't do that so well, partly because it focuses on the most important aspects of the dataset, discarding everything else. This results in skewing the variable relationships a bit, much like a PCA method changes the data completely when it is applied. So, if you care about maintaining these variable correlations, data summarization is not the way to go.
Finally, due to the nature of the data involved, data summarization could be used for data anonymization and even data generation. Sampling, however, wouldn't work so well for these sorts of tasks, even though it could be used for data generation if the sampling is free of biases (something which can also be attained if certain heuristics are applied). All this illustrates the point that although these two methods are quite different, they are also applicable in different use cases so they don’t exactly compete with each other. It’s up to the discerning data scientist to figure out when to use which, adding value to the project at hand.
Lately, I've made some progress on a data science research project I've been working on for the past couple of years. I’ve hinted about it in previous posts, though due to the nature of this work I’ve abstained from going into any details. Besides, most people are not that open to new ideas, unless they are marketed by some established company or some renowned professor.
Anyway, the other day I made a breakthrough in this work, something that can have significant implications in how we deal with private data. What’s more, I've developed a new way of summarizing a dataset (which is innately different from sampling it), with minimal loss of information. This opens new avenues of research and the possibilities of new data science and A.I. methods are vast. Naturally, I'll need to look into this more, so any online writing I do will have to take second priority.
Parallel to that, I’ve been working on another project lately, something I plan to continue for the foreseeable future. However, an important part of it is completed, which I’ll make sure I’ll announce in the next few days.
As a result to all this, I’m now more open to hosting other people’s articles on data science and A.I. topics, given that they are not spammy in any way. Back-links are also acceptable, given that they are towards relevant sites to the articles. So, if you have something you’d like to contribute to the blog, now is a great opportunity to do so.
Whatever the case, I plan to continue writing on this blog albeit at a slower pace for the time being, so stay tuned!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.