Nowadays there are many options out there for getting something in print. Some of them are for digital publications only, while others cover various media (including videos). As someone builds a reputation, more options become available to him, and the possibility of getting published in one of the more recognizable names becomes more tangible. After all, what better way to promote one’s work than an established channel, right?
Well, things may be more complicated than they appear on the surface. The publication process involves four main stages, getting a contract going, producing the book, promoting the book, and then collecting royalties based on the sales. A bigger publisher is bound to be adept at the 3rd stage, since it has well-established channels for promoting its authors, but it can screw you over in the other 3 stages. A smaller publisher may require some help promoting your book, but it’s not anything excessive (as in the case of self-publishing). Let’s look at how the 3 stages where the big publishers fall behind are, in more detail.
The first stage takes a while. While a small publisher may be fairly straight-forward about it, a larger publishing house may take its sweet time to settle on an agreement about a title. You are basically expected to do all the work for them, including market research, and prove to them beyond a shadow of doubt that the book you are going to author is going to be successful. When I was experimenting with a big publisher, this took about a month.
The second stage is torture. A bigger publisher has a big reputation to uphold, so it’s not going to take any risks. Every milestone of the book-writing project is meticulously monitored. Regular meetings with a member of the production team are not uncommon. Although these may be helpful in some cases, the whole thing feels more like a full-time job. Eventually, you may come to see the whole project as a drag. Being with a small publisher doesn't have this much bureaucracy, while the project remains a creative endeavor to a large extent. Also, changes in the outline are permissible.
The fourth stage is also a bit strange. Although I haven’t reached that stage with a big publisher, it is quite clear from the contract that the royalties are not going to be that much (a typical commission is about 12%), while the book may be bundled with other titles, for marketing purposes. Being with a smaller publisher usually guarantees a higher royalties percentage, while you may have a say on how the book is sold (you may even create events to promote and sell your book, with the publisher’s support).
Also, a larger publisher may choose to discontinue the book production process at any time, effectively breaching your contract. Although you could theoretically take legal action against them, it is usually not worth the effort, since the costs involved make the whole process pointless. Besides, would you be willing to finish a project with a company that has openly tried to stop it? Smaller publishers are more honorable in that respect and develop wholesome relationships with their authors.
Finally, with the data science field changing so rapidly, publishers may be quite cautious about what book-writing projects they undertake. So, a larger publisher is bound to go with the safer options, producing books that are on more or less mainstream topics. If you have a different topic in mind, or a different angle for a topic, then too bad. Smaller publishers are more willing to take risks in that respect.
Of course, that’s not to say that all small publishers are great. There are small publishers that are a total waste of time. However, if you do your research, you can find a small publisher that makes sense for your book project. For me that publisher was (and still is) Technics Publications. What would yours be?
So, my latest video is now available online at the Safari portal. I didn't post this yesterday, as I had already published an article for the blog. As I have been writing more articles that I can get published on DSP, I had to resort to this blog again. Also, I am not currently working on a book, so I have more time for writing for other channels (e.g. this blog, beBee, etc.).
Anyway, if you have a subscription for Safari, check out my video. I’m certain it would be worth your time. As always, I’m open to feedback via the “contact” page of this blog.
Being one of the first supporters of this programming language, at least for data science, it saddens me to talk to people in the tech industry and find out that they have never heard about this language. It’s even worse when this ignorance of it comes from people involved in data science and A.I., people who should have at least tried it. With Julia becoming more and more relevant among programmers (see the latest newsletter to get an idea), it seems like an oxymoron that it is still relatively obscure in people’s minds, particularly people involved in data science.
One reason why this happens is that the institutions that deliver data science know-how (I wouldn't call it education yet, since they don’t cover soft skills or the whole mindset aspect of the craft) are ignorant of Julia. This makes sense in a way since their all-knowing instructors are experts in the mainstream languages of data science, namely Python, Scala, and R. So, if the people you trust to teach you about data science are rookies in Julia, they’ll probably not even mention it, just like people in computer science don’t talk much about Quantum Computing, or security experts about how conventional security systems are sitting ducks when it comes to code-breaking that could be performed by QC systems.
Another reason is that people learning data science are overwhelmed with the numerous technologies and tools of the field. As a result they take a pick on what to learn and they tend to gravitate towards the tools that have the most literature around them. Also, these tools tend to have been tried and tested the longest, so they are more risk-free. Since there are enough risks in getting into a new field, people tend to want to minimize additional ones if they can help it, so they give Julia a pass.
Moreover, Julia has gained a lot of traction in academia, since many researchers are open-minded enough to give it a try, while they are also fed up with the (oftentimes proprietary) systems they use. Matlab may be great if someone else pays for the license to use it, but it’s doubtful that you’d pay for it yourself after your studies are over, especially if you end up with a measly salary as a junior data science researcher. Because of all this, Julia may have started to appear as an academic programming language to some people, something that is good for researchers but not for people in the real world.
Of course, all these ideas about the Julia language are nothing but misconceptions about it. After all, it doesn't try to replace any other language, since it is highly compatible with many other programming languages, such as Python, R, C/C++, and even Java. So, if you feel like you have to choose between Julia and the language you are comfortable with, then probably you are gravely misinformed about the language. There is a reason why the company that develops it and supports it is doing so well. There is also a reason why many companies are using it (though they don’t always talk about it, for obvious reasons).
So, if you still find Julia an obscure programming language for data science, you may want to divert your skepticism towards those who try to ignore it, for their own reasons. Maybe those people have formed views about it based on ignorance, rather than experience with it. Perhaps if you take the time to learn it and use it a bit, you’ll change your mind about it.
Short answer: yes. Longer answer: definitely, as long as they make a conscious effort to cultivate the necessary parts of this mindset and integrate them into a functional whole. Easier said than done, right? Perhaps. Maybe that’s why some companies ask for someone that has 15+ years of experience in the field, even if the field didn't exist 15 years ago! What they may really be asking is for someone who knows what this field entails and knows how to make things happen, using the corresponding methodologies. So, the question that naturally arises is “how can someone get this understanding of the field without having to spend a large part of their career in it?”
There are several strategies to accomplish that, none of which are easy or something that you can learn in a bootcamp. Even really good data science courses, may not be sufficient for this purpose. The reason is that the mindset of a data scientist is very diverse and not something you can put into a syllabus. There is a reason why the brightest data science practitioners seek a mentor, or some kind of personal learning experience, in order to gain some kind of mastering of the craft. Yet, as I’ve explained in the Mentoring in Data Science video, the mentor is not there to answer all your questions, even if he could answer most of them. The role of the mentor is to help you become your own mentor eventually. Of course there are exceptional people out there that don’t require a mentor, since they know everything they need to know, or they have the resources and resourcefulness to obtain this knowledge on their own. When I meet one such person I’ll be sure to blog about them!
Apart from being part of a mentorship, you can learn about the mindset of the data scientist by practicing science, in a data analytics setting. This is quite different from taking this or the other tool, applying it, and then creating some insightful visuals from the results. Practicing science also involves conducting experiments, asking deep questions, and challenging yourself and what you know. It’s realizing that all scientific theories are disprovable and not taking anything as gospel, since you are secure in the knowledge that everything in science is in flux. The only thing that’s perhaps immune to this constant change, is the mindset, the essence of the role of the data scientist. One robust way to attain this understanding is to strip away all the transient aspects of the role, one by one, through scientific research. In other words, you need to become the craft, rather than merely practice it like a technician of sorts.
In my latest book I underline several aspects of the data science craft that I’ve found, through both experience and research. They are relevant and useful for bringing about the data science mindset in someone. Of course, it is next to impossible to cover all the angles in a single book, but it is a good start. Applicable to all levels of data science practitioners, this book can at the very least make you fascinated about data science and motivate you to learn more about it, without getting consumed by the techniques or the aspects of it that are more in vogue these days (e.g. artificial intelligence). After all, just like everything else in science, data science is more of a process than anything else. It’s up to you to make it an insightful and intriguing one...
A few years back, working remotely was a sci-fi idea, something people would look forward to in that futuristic society that they would one day inhabit. Over the years, due to certain technological advents, this remote possibility has turned into a tangible option for many jobs, include those related to data science.
Working remotely is also something that many companies (especially start-ups) offer as part of their package deal to new hires. Because, who doesn't want to avoid the commute a few days a month and focus on what’s important in their work? Besides, unless you are a PM or something, you probably don’t need to be in the office all the time to attend meetings and other responsibilities that require your physical presence. In fact, with modern collaboration systems (e.g. Slack) and real-time communication platforms (e.g. Zoom), even meetings can take place virtually. So, what’s stopping people from embracing the remote work possibility full-time? Well, there are several factors, but they all boil down to two things: trust and efficiency. Most companies don’t trust their employees enough to grant them the remote working option for every day of the week. Also, the belief that people work better (more efficiently) when they are in physical proximity, is one that’s hard to shake from people’s minds. These ideas are valid to some extent, so before blaming the companies for not allowing you to do your data science work in the comfort of your home (or local coffee shop), better take a look at the other side of this partnership.
Most information workers today may have the technical skills they need for their work but they may be lacking when it comes to other skills that are essential for being able to work remotely in an efficient manner. Namely, things like self-discipline, good communication, adaptability, are not that common as you would expect. Also, not everyone is able to organize his work when on his own. Still, there are plenty of people who have all these qualities (I've worked with quite a few myself), so they are not something unfathomable. If you think about it, every PhD student cultivates these skills during her project. Meetings with advisors / supervisors may take place in a physical location but most of the time you are on your own, often times during time periods when others are resting. So, being able to work remotely basically boils down to having a strong sense of responsibility and self-leadership.
Having the option of working remotely in a data science setting makes sense for other reasons too. Typically, a data scientist liaises with a small number of people in the organization and unless they are new to the role, they know how to carry a conversation quite well and convey all the relevant information succinctly and effectively. Not every data scientist is an orator, but even the less social ones know how to communicate well, to various kinds of audiences. So, physical presence is not really a requirement for this.
Also, with most of the data science work taking place in a remote location anyway (e.g. the cloud), a data scientist can manage even if he is at home, as long as he has a secure connection setup (e.g. a VPN). So, physical presence in the company is again more of an optional thing, rather than a necessity.
Finally, day-to-day data science doesn't need a lot of resources, other than access to a computer cluster, usually in the cloud, and a fairly decent computer, so being inside a company building doesn't make things easier always. In fact, all the distractions and space limitations may make the whole matter more difficult for the employee, not to mention yield an additional cost for the company.
Perhaps sharing the same physical location with your co-workers has its advantages that no VoIP system can offer. However, making physical presence a requirement, rather than just an option, is an antiquated practice that is bound to give way to more practical and more preferable possibilities in the future, such as full-time remote working.
Over the years I have been a bit harsh on Statistics, esp. ever since I got exposed to the propaganda that Stats is the way to go in Data Science. This idea that statistical analyses would be able to help us tackle big data problems didn't (and still doesn't) make any sense, esp. if you have run experiments on various data analytics methods, statistical and otherwise, for several years. Although statistical methods have merit, they have been proven to be less effective or efficient as machine learning methods, esp. A.I. methods, like ANNs. Yet, statistics are still useful in some ways, still.
Data analytics involves more than just building models. Before we reach that stage where we have a dataset that we are ready to use for predicting or analyzing something, to build a data product or derive some useful insights, we need to build that dataset. To do that, we often need to get our hands dirty by doing a lot of experiments with the data itself, using a variety of methods. Some of these methods derive from statistics. For example, we may need to explore the relationship between two variables (or over all the pairs of variables available). This is made possible with various methods, such as correlation and covariance. Even if these tools are suboptimal, they are a good starting point and in many cases, they may even suffice. Also, PCA and SVD remain very popular dimensionality reduction methods that are under the statistics umbrella.
Another example where statistics come in handy is when you need to check the validity of a hypothesis. Although there are some simulation-based methods that can do that, statistics has a variety of tools that cover several possibilities of variables and their distributions, enabling us to test our hypotheses in a methodical and rigorous manner. Of course, we may still need to do some analysis beyond that, to establish the stability of our results, but there is no doubt that statistical tests can be useful as a first step.
Finally, when it comes to sampling, statistics is usually our go-to framework. This set of simple techniques for obtaining a subset of a larger dataset may seem, well, simplistic, but it’s essential. After all, even the most sophisticated machine learning models are bound to fail (over-fit), if sampling isn't done right. There is a reason why statistics became a popular data analytics framework, and it’s quite likely that sampling played an important role in this (though I’ll need to run some tests to establish an exact measure of the likelihood!).
So, even if A.I. and machine learning are the foxy way to go when it comes to data science, statistics have a place in the data scientist’s toolbox too. Plus, with so many people in data science focusing on the new and better tools that are in vogue these days, maybe a differentiator of a competent data scientist in the future will be how well she can handle statistical concepts and carry out basic tasks in a methodical manner. Besides, if it’s one thing that statistics can teach us it’s being methodical and scientific in how we conduct our analyses, qualities that are timeless in the data science field and foxy in their own way.
I've talked about Spark in my books and I’d made the prediction that it was a very promising big data platform, back when it was still a novelty. With the enthusiastic support of DataBricks (the main contributor to Spark’s codebase) and a variety of data developers (data scientists who specialize in coding and in building data science tools), Spark became a force to be reckoned with. In fact, even people who were using Hadoop as their go-to big data framework, started to shift to Spark, while their relationship with Hadoop was reduced to using merely its file system. Yet, even with all these developments, there was something lacking in the Spark framework: integration capability with Julia, a new rising star in the programming world, and in data science lately.
I was one of the first data scientists who opens supported Julia (while other data scientists were choosing to play it safe and stick with what they knew). Even though I had acquired sufficient expertise in both R and Python to not need yet another programming language at my toolbox, I learned this new language, with whatever information I could dig up from the web (there were no good books on the language at the time). Julia’s advantages over other programming languages were too many and too powerful to ignore, so I learned it and started using it for my data science projects, in parallel with Python. In fact, one of the first Julia scripts I wrote aimed to help me organize my various Python scripts, so that I could easily pinpoint a particular function, from a bunch of .py files in a folder on my work computer. Yet, Julia was not widely adopted by the data science community and it’s quite likely that its lack of integration with Spark was one of the main causes of this.
Lately, things have started to shift. A couple of people created a couple of Spark packages for Julia, the most active of which is Spark.jl. Although still in its development stage, the whole project seems promising, because it shows that it is possible to link Spark and Julia, even if the connection has to take place through Java (which is one of the languages Julia can bridge to, something unfathomable for other “promising” new languages, like Go).
Naturally, Julia is a self-sufficient tool, so it does not need Spark or any other similar framework to be useful for data science. However, with many people following this or the other big data ecosystem with great zeal, it is unlikely that Julia will get the place it deserves in data science, without making it appealing to these people. Unmistakably, learning Julia for most data scientists is a walk in the park (it has elements of both Python and R, while Scala is very similar to Julia in terms of logic). Yet, knowing how to write a Julia script and actually using such a script in production, as part of a codebase built around Spark, that’s a somewhat different matter! Also, even though Spark doesn't need Julia either, it has a lot to gain by incorporating all the Julian data scientists, since there is no doubt that it’s easier to use Julia than it is to use Scala, when it comes to data science applications.
Will Julia and Spark form a synergy that will transform data science for years to come? Possibly. How? Well, letting go of certain attachments to a certain programming paradigm would definitely help. After all, if you know how to program well, learning to use another programming language, even if it is from a different paradigm, is not that much of a challenge as some people think. Besides, isn't pushing the envelop in tech a big part of data science anyway?
Recently, the tech news bubble featured a very interesting phenomenon that had caught Facebook’s A.I. experts by surprise. Namely, two AIs that were developed by the company’s AI team were found to communicate in unexpected ways between themselves, during the testing phase of the project. Although they were using what would qualify as “bad English,” the people monitoring these communications were unable to understand what information was conveyed and they argued that these AIs may have invented their own language!
Although this claim would require a considerable amount of data before it is proven or disproven, the possibility of these young AIs having overstepped their boundaries is quite real. Perhaps this is not something to lose sleep over, since it’s hardly a catastrophic event, it may still be a good proxy of a situation that no-one would like to experience, that of AIs getting out of control. Because, if they can communicate independently now, who knows what they could do in the future, if they are treated with the same recklessness that FB has demonstrated? The outcomes of this may not be as obvious as those being portrayed in sci-fi films. On the contrary, they are bound to be very subtle, so much so that they would be very hard to detect, at least at first. Some would classify them as system bugs, but these would not be the kind of bugs that would cause a system failure and make some coders to want to smash their computer screen. These bugs would linger in the code until they manifest in some unforeseen (and possibly unforeseeable) imbalance. Best case scenario, they could cause the users of the popular social medium to get frustrated or even complain about the new system. I don’t want to think about what the worst case scenario would be...
Of course the A.I. fanboys are bound to disregard this matter as a fluke or an unfortunate novelty that people turn into an issue when it isn’t. They would argue that small hick-ups like this one are inevitable and that we should just power through. Although there is something admirable about the optimism of these people, the thing is that this is a highly complex matter that technology experts like Elon Musk and Bill Gates have repeatedly warned us about. This is not like a problem with a car that may cause a simple road accident, if left unattended. This is the equivalent of a problem with the control tower of an airport that could cause lots of planes to crash all over the place. Fortunately, there are contingencies that prevent such catastrophes when it comes to airports, but can we say the same about the A.I. field?
There are different ways to respond to this kind of situation and I’m not saying that we should panic or start fearing A.I. That wouldn't help much, if at all. A more prudent response would be to see this as a learning experience and an opportunity to implement fail-safes that will keep this sort of A.I. behavior under wraps. After all, I doubt that the absence of helpful AIs in FB or any other social medium are going to drive people away from social networks, while the presence of an unpredictable AI probably would...
Geometry is probably one of the most undervalued aspects of Mathematics. So much so, that people consider it something that is relevant mainly for those pursuing that particular discipline, as in their minds geometry is divorced from other, more practical fields, such as data analytics. However, geometry has always been an applied discipline, intertwined with engineering. As data science and data analytics in general is closely linked to engineering, at least in certain principles, it makes sense to at least consider the relationship between geometry and data analytics.
Geometry involves the study and use of visual mathematical concepts, such as the line, the circle, and other curves, to solve various problems or prove relationships that may be used to solve other, more complex problems. The latter are referred to as theorems and are the core of the scientific literature of geometry. So, unlike other more theoretical parts of mathematics, geometry is practical at its core since it endeavors to solve real-world problems. Although the latter have become increasingly sophisticated since geometry was in its glory days (antiquity), many problems todays still rely on geometry for their solution (e.g. the field of optics, the calculation of trajectories of rockets, and more). Besides, since the times of Descartes, the famous philosopher-mathematician, geometry has become more quantifiable, particularly with his invention of analytical geometry.
Data analytics is in essence a field of applied mathematics, with an emphasis on numeric data, the kind that features heavily in geometry. Although direct connections between the shapes and the proportions of geometry with the data analytics concepts are few and far in between, the mindset is very similar. After all, both disciplines require the practitioner to find out some unknown quantity using some known data, in a methodical and logical manner. In geometry, these correspond to a particular point, shape, or mathematical relationship. In data analytics, these are variables that take the form of features (through refinement, selection, and processing in general) and target variables. Of course, data analytics (esp. data science), has a variety of tools available that facilitate all these, while in geometry it’s just the practitioner’s imagination, a pencil, some paper, and a couple of utensils. However, the mental discipline behind both fields is of the same caliber, while creativity plays an important role in both.
I’m not saying that geometry alone will make someone a good data analytics professional, or that you should give up your data science courses to take up geometry. However, if you have the time and you can also see something elegant in geometry problems, then it can be a very useful past-time, much more useful than other, strictly analytical endeavors. After all, imagination hasn't gone out of fashion, at least not in the applied sciences, so anything that can foster this faculty, while at the same time encourage mental discipline, is bound to be helpful. As a bonus, spending time with geometry is bound to help your visualization skills and enable you to view certain data analytics problems from a different angle (no pun intended). Besides, the same mindset that helped people build pyramids and accomplish several other architectural feats, is what forged many modern algorithms in machine learning, for example, turning some abstract idea or question into something concrete and measurable, be it a design or a process. Isn't that one of the key attributes of a data analytics project?
Why the Role of A.I. in the Job Market Is Very Much a Business Decision Technical Professionals Can Contribute to
Lately there is a lot of talk about AIs potentially taking people’s jobs in the future and how this is either catastrophic, or some kind of utopia (or, less often, some other stance in between). Although we as data science and A.I. professionals have little to do with the high-level decisions that have some influence on this future, perhaps we are not so detached from the reality of the situation. I’m not talking about the A.I. choir that is happy to recite its fantasies about an A.I.-based future that is akin to the sci-fi films that monetize this idea. I’m talking about grounded professionals who have some experience in the development of A.I. systems, be it for data science or other fields of application.
The problem with business decisions is that they are by their nature related to quite complex problems. As such, it is practically impossible to solve them in a clear-cut manner that doesn't invite reactions, or at least some debate. That’s why those individuals who have the courage to make these decisions are paid so handsomely. It’s not the time they put in, but the responsibility they undertake, that makes their role of value. However, it is important to make these decision as future-proof as possible, something that these individuals may not be able to do on their own. That’s why they have advisors and consultants, after all. Besides, even if some of the decision-makers are technical and can understand the A.I. matters, they may lack the granularity of comprehension that an A.I. professional has.
People who make business decisions often see A.I. as a valuable resource that can help their organization in many ways (particularly cut down on some costs, via automation or increased efficiency in time-consuming or expensive processes). However, they may not always see the implications of these moves and the shortcomings of this, still not yet mature, technology. A.I. systems are not objective, nor immune to errors. After all, most of them are black boxes, so whatever processes they have in place for their outputs are usually beyond our reach, and oftentimes beyond our comprehension. Just like it is impossible to be sure what processes drive our decisions based on our brain patterns, it is perhaps equally challenging to pinpoint how exactly the decisions of an A.I. are forged. That’s something that is probably not properly communicated to the decision makers on A.I. matters, along with the fact that AIs cannot undertake responsibility for these decisions, no matter how sophisticated these marvels of computing are.
Perhaps some more education and investigation into the nature of A.I. and its limitations is essential for everyone who has a say in this matter. It would be irresponsible to expect one set of people to navigate through this on their own and then blame them if their decisions are not good enough or able to withstand the test of time. This is a matter that concerns us all and as such we all need to think about it and find ways to contribute to the corresponding decisions. A.I. can be a great technology and integrate well in the job market, if we approach it responsibly and with views based on facts rather than wishful thinking.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.