Everyone in data science (and even beyond data science to some extent) is familiar with the process of sampling. It’s such a fundamental method in data analytics that it’s hard to be unaware of it. The fact that’s so intuitive as well makes it even easier to comprehend and apply. Besides, in the world of Big Data, sampling seems to be not only useful but also necessary! What about data summarization though? How does that fit in data science and how does it differ from sampling?
Both data summarization and sampling aim to reduce the number of data points in the data set. However, they go about it in very different ways. For starters, sampling usually picks the data points randomly while in some cases, it takes into account an additional variable (usually the target variable). The latter is the case of stratified sampling, something essential if you want to perform proper K-fold cross-validation for a classification problem. Data summarization, on the other hand, creates new data points that aim to contain the same information as the original dataset, or at least retain as much of it as possible.
Another important difference between the two methodologies is that data summarization tends to be deterministic, while sampling is highly stochastic. This means that you cannot use data summarization instead of sampling, at least not repeatedly as in the case of K-fold cross-validation. Otherwise, you’ll end up with the same results every time, something that doesn’t help with the validation of the models at hand! Perhaps that’s one of the reasons why data summarization is not so widely known in the data science community, where model validation is a key focus of data science work.
What’s more, if sampling is done properly, it can maintain the relationships among the variables at hand (obviously this would entail the use of some heuristics since random sampling alone won’t cut it). Data summarization, on the other hand, doesn't do that so well, partly because it focuses on the most important aspects of the dataset, discarding everything else. This results in skewing the variable relationships a bit, much like a PCA method changes the data completely when it is applied. So, if you care about maintaining these variable correlations, data summarization is not the way to go.
Finally, due to the nature of the data involved, data summarization could be used for data anonymization and even data generation. Sampling, however, wouldn't work so well for these sorts of tasks, even though it could be used for data generation if the sampling is free of biases (something which can also be attained if certain heuristics are applied). All this illustrates the point that although these two methods are quite different, they are also applicable in different use cases so they don’t exactly compete with each other. It’s up to the discerning data scientist to figure out when to use which, adding value to the project at hand.
Lately, I've made some progress on a data science research project I've been working on for the past couple of years. I’ve hinted about it in previous posts, though due to the nature of this work I’ve abstained from going into any details. Besides, most people are not that open to new ideas, unless they are marketed by some established company or some renowned professor.
Anyway, the other day I made a breakthrough in this work, something that can have significant implications in how we deal with private data. What’s more, I've developed a new way of summarizing a dataset (which is innately different from sampling it), with minimal loss of information. This opens new avenues of research and the possibilities of new data science and A.I. methods are vast. Naturally, I'll need to look into this more, so any online writing I do will have to take second priority.
Parallel to that, I’ve been working on another project lately, something I plan to continue for the foreseeable future. However, an important part of it is completed, which I’ll make sure I’ll announce in the next few days.
As a result to all this, I’m now more open to hosting other people’s articles on data science and A.I. topics, given that they are not spammy in any way. Back-links are also acceptable, given that they are towards relevant sites to the articles. So, if you have something you’d like to contribute to the blog, now is a great opportunity to do so.
Whatever the case, I plan to continue writing on this blog albeit at a slower pace for the time being, so stay tuned!
Rhythm in learning is something that most people don't think about, mostly because they take it for granted. If you were educated in a structure-oriented country, like most countries in the West, this would be instilled in you (contrary to countries like Greece where disorder and lack of any functional structure reign supreme). However, even then you may not value it so much because it is not something you're conscious of always. The need to be aware of it and make conscious effort comes about when you are on your own, be it as a freelancer or a learner in a free-form kind of course (i.e. not a university course of a boot camp). And just like any other real needs, this needs to be fulfilled in one way or another.
The idea of this article came about from a real situation, namely a session with one of my mentees. Although she is a very conscientious learner and a very good mentee, she was struggling with rhythm, mostly due to external circumstances in her life. Having been there myself, I advised her accordingly. The distillation of this is what follows.
So, rhythm is not something you need to strive for as it's built-in yourself as an innate characteristic. In other words, it's natural, like breathing and should come by on its own. If it doesn't, it's because you've put something in its way. So, you just need to remove this obstacle and rhythm will start flowing again on its own. This action of removal may take some effort but it's a one-time thing (unless you are in a very demanding situation in your life, in which case you need to re-set your boundaries). But how does rhythm manifest in practice? It's all about being able to do something consistently, even if it's a small amount certain days.
In my experience with writing (a truly challenging task in the long run, particularly when there is a deadline looming over you), I make it a habit of writing a bit every day, even if it's just a single paragraph or the headings and subheadings structure of a new chapter. Sometimes I don't feel like working on a book at all, in which case I take the time to annotate the corresponding Jupyter notebooks or write an article on this blog. Whatever the case, I avoid idleness like the plague since it's the killer of rhythm.
When it comes to learning data science and A.I., rhythm manifests as follows. You cultivate the habit of reading/coding/writing something related to the topic of your study plan or course curriculum. Even a little bit can go a long way since it's not that bit that makes the difference but the maintenance of your momentum. It's generally harder to pick up something that has gone rusty in your mind, particularly coding. However, if you coded a bit the previous day, it's so much easier. If you get stuck somewhere, you can always work on another drill or project. The important thing is to never give up and go idle.
Frustration is oftentimes inevitable but if you leverage it properly, it can be a powerful force as it has elements of willpower in it, willpower that doesn't have a proper outlet and it trapped. This is what can cause the break of rhythm but what can also remedy it. You always have the energy to carry on, even at a slower pace sometimes. You just need to tap into it and apply yourself. That's when having a mentor can do wonders, yet even without one, you can still manage, but with a bit more effort. It's all up to you!
It may seem strange to have an article on this topic in this blog, but since hashgraph is a promising technology that I've already talked about in the past, it may be worthwhile to make an exception.
As you may have heard, the Hedera platform is a hashgraph-based network that promises high speeds, very low cost, and a high level of security in it. All this is through the use of a new technology that one of its founders, Dr. Leeman Baird, created over the years. The idea is to use a clever combination of the gossip protocol along with virtual voting to ensure consensus in a network of computers, keeping track of various transactions. Up until now, this network has been used with a series of other applications but as of this year, a financial application has also become available. This takes the form of a cryptocurrency called hbar, which promises to be a worthwhile alternative to the blockchain-based cryptos.
Whether hbar is going to make it or not remains to be seen, since BitCoin, Ethereum, Dash, and some other cryptos have attracted a large enough community to establish themselves, even if they are based on inferior technologies than hbar. Don't get me wrong, I think blockchain tech is amazing and may continue bringing about benefits to its users. Hashgraph, however, is superior in many ways, plus it has a legitimate company behind it, something that inspires confidence in many of its users. Some of these users are established companies such as Boeing, so it's not some hyped tech that may or may not exist a year from now.
Hbar is being traded as of last week (September 17th to be exact) on the internet, after several months of beta-testing. Currently, it is available for trade on major crypto exchange sites, such as Bittrex and it's at a very low price (around 0.036 USD per token), even lower than the ICO one (0.12 USD). You can monitor its price from the Hedera-based site www.hbarprice.com where you can also learn additional information about the company and the various services they offer.
Just like other innovative technologies, a hashgraph-based cryptocurrency seems a bit ahead of its time. In a way, it reminds me of the Julia language, which has been better in many ways than other data science programming platforms, yet it is still to receive the recognition it deserves. Whether this is due to the inertia of the tech people or the excessive promotion that its competitors receive is unknown. Whatever the case, those who make use of such technologies benefit even if the majority of people never fully accept them as worthwhile alternatives. So, I don't expect hbar to dominate the crypto market any time soon, but I'd be interested in following its course.
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Translinearity is the super-set of what’s linear, so as to include what is not linear, in a meaningful manner. In data analytics, it includes all connections among data points and variables that make sense in order to maintain robustness (i.e. avoid any kind of over-fitting). Although fairly abstract, it is in essence what has brought about most modern fields of science, including Relativistic Physics. Naturally, when modeled appropriately, it can have an equally groundbreaking effect in all kinds of data analytics processes, including all the statistical ones as well as some machine learning processes. Effectively, a framework based on translinearity can bridge the different aspects of data science processes into a unified whole where everything can be sophisticated enough to be considered A.I. related while at the same time transparent enough, much like all statistical models.
Because we have reached the limits of what the linear approach has to offer through Statistics, Linear Algebra, etc. Also, the non-linear approach, although effective and accessible, are black boxes, something that may remain so for the foreseeable future. Also, the translinear approach can unveil aspects of the data that are inaccessible with the conventional methods at our disposal, while they can help cultivate a more holistic and more intuitive mindset, benefiting the data scientists as much as the projects it is applied on.
So far, Translinearity is implemented in the Julia ecosystem by myself. This is something I've been working on for the past decade or so. I have reason to believe that it is more than just a novelty as I have observed various artifacts concerning some of its methods, things that were previously considered impossible. One example is optimal binning of multi-dimensional data, developing a metric that can assess the similarity of data points in high dimensionality space, a new kind of normalization method that combines the benefits of the two existing ones (min-max and mean-std normalization, aka standardization), etc.
Translinearity is made applicable through the systematic and meticulous development of a new data analytics framework, rooted in the principles and completely void of assumptions about the data. Everything in the data is discovered based on the data itself and is fully parametrized in the corresponding functions. Also, all the functions are optimized and build on each other. A bit more than 30 in total, the main methods of this model cover all the fundamentals of data analytics and open the way to the development of predictive analytics models too.
Translinearity opens new roads in data analytics rendering conventional approaches more or less obsolete. However, the key outcome of this new paradigm of data analytics is the possibility of a new kind of A.I. that is transparent and comprehensible, not merely comprehensive in terms of application domains. Translinearity is employed in the more advanced deep learning systems but it’s so well hidden that it escapes the user. However, if an A.I. system is built from the ground-up using translinear principles, it can maintain transparency and flexibility, to accompany high performance.
It's interesting how even though there are a zillion ways to assess the similarity between two vectors (each representing a single-dimensional data sample) when it comes to doing the same thing with matrices (each representing a whole sample of data) the metrics available are mediocre at best. It's really strange that when it comes to clustering, for example, where this is an important part of the whole process, we often revert to crude metrics like Silhouette Width to figure out if the clusters are similar enough or not. What if there was a way to assess similarity more scientifically, beyond such amateur heuristics?
Well, fortunately, there is a way, at least as of late. Enter the Congruency concept. This is basically the idea that you can explore the similarity of two n-dimensional samples through the systematic analysis of their components, given that the latter are orthogonal. If they are not orthogonal, it shouldn't be difficult to make them orthogonal, without any loss of information. Whatever the case, it's important to avoid any strong relationships among the variables involved as this can skew the whole process of assessing similarity.
The Congruency concept is something I came up with a few months ago but it wasn't until recently that I managed to implement it in a reliable and scalable way, using a new framework I've developed in Julia. The metric takes as inputs the two matrices and yields a float number as its output, between 0 and 1. The larger this number is the more similar (congruent) the two matrices are. Naturally, the metric was designed to be robust regardless of its dimensionality, though if there are a lot of noisy variables, they are bound to distort the result. That's why it performs some preprocessing first to ensure that the variables are independent and as useful as possible.
Applications of the Congruency metric (which I coined as dcon) go beyond clustering, however. Namely, it can be used in assessing sampling algorithms too (usually yielding values 0.95+ for a reliable sample) as well as synthetic data generation. Since the metric doesn't make any assumptions about the data, it can be used with all kinds of data, not just those following a particular set of distributions. Also, as it doesn't make use of all dimensions simultaneously, it is possible to avoid the curse of dimensionality altogether.
Things like Congruency may seem like ambitious heuristics and few people would trust it when the more established statistical heuristics exist as an option. However, there comes a time when a data scientist starts to question whether a statistic's / metric's age is sufficient for establishing its usefulness. After all, what is now old and established was once new and experimental, let's not forget that...
Being a data science author is not a simple matter. With the bookshelves brimming with data science books these days, one may come to think of this as being something easy and accessible to everyone. Perhaps the latter is true since nowadays everyone can publish a data science book through some publisher with very low standards or he can publish the book himself, thanks to Amazon and other sites that are happy to make your book available to everyone. Some people stoop so low as to give away their book for free, something that says more about the quality of their book than it does for their generosity (of course there are exceptions to this, since many academics prefer this approach since the academic publishers make their books inaccessible to most of their students due to the high price tag they force on them). Whatever the case, being a data science author involves more than just putting a book out there for the world to view and perhaps read.
In my experience for the past 10 years or so, authoring a book is quite different to just writing one and making it accessible to the public. Authoring a book is all about providing a certain level of quality and going through the oftentimes exhausting process of revisions and edits, once the first draft is completed. Fortunately, the first book I authored was on something I had spent 5 years working on, namely my PhD project. The book was my PhD thesis, which is much like a normal technical book, though geared towards a more limited audience.
Other books I've authored were mostly through a publisher, except for some ebooks and a novel ("I, AGI: the adventures of an advanced Artificial Intelligence"). Every time it was a challenge of sorts, through one through which I could grow as a writer. Here is a list of the things I learned that are necessary to author a book:
Beyond these, several other things are necessary for authoring a book, perhaps too many to list in a blog article. However, for anyone serious about writing, these are a good place to start. Cheers!
These days I was on vacation (this image should give you a hint!), so no post this week unfortunately... However, as of next week (or even later this week, depending on my workload), I should have something for you. In the meantime, you can check out some of my older posts. Until next time!
These days I'm working feverishly on a book project so there is no time for any new data science / A.I. related post here. If you want something else to read, feel free to check my articles on beBee, such as the latest one, available here. Parallel to all this, I'm preparing another educational project, something I'll talk more about later on. Stay tuned!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.