There is a huge amount of hype around data science right now, on the back of previous hype around big data, and this has lead to a number of misconceptions, myths and untruths which we discuss here:
Myth No 1: Data Science is Now Largely Automated
Given the rapid advances and availability of data science frameworks and data and warehousing solutions, and because these tools are becoming so much simpler to use, there is a perception that data science is now easy, and largely automated using these tools - just take all your data, push it to the cloud, run some analysis, and miraculously insights and solutions pop out.
The reality is that data science requires skilled human oversight at multiple stages, from defining the problem at the outset, critically assessing data sources and data quality, data preparation, selection and tuning of algorithms, interpretation of the analysis, through to prioritising which results are most valuable and actionable. To guarantee success, good data science requires a sound appreciation of the properties and quality of the data, domain expertise relevant to the problem, a solid understanding of the pros and cons of algorithms used, and critical analytical skills in interpreting results.
If the problem is framed badly, data is poorly selected and prepared, or bad choices are made in algorithm selection, then the results may look reasonable but may be invalid. Data science often involves trial and error, data may need to be re-prepared several times, and different algorithms may need to be tried. It's a time-consuming process that requires skilled practitioners. This is a fast moving field and there remains a high proportion of 'art' in the science, and today perhaps the biggest challenge is finding and retaining suitably skilled resource.
Myth No 2: You need a Lot of Data
Given that data science is often associated heavily with big data, there is a presumption that you need lots of data to benefit from data science. It's important to remember that data science is not just about machine learning, but also about well tried statistical techniques (such as clustering and regression). Whilst prediction and pattern identification tasks using machine learning often do often benefit from large amounts of data, many data science initiatives do not require petabytes of data to provide useful results. The important feature of data is the quality, not always the size. Consequently nor do you necessarily need huge compute clusters to perform useful analysis.
Myth No 3: You Can Use Just Any Old Data
There are a lot of misconceptions about big data, and in particular unstructured data, the main one being that data no longer has to have schema, and that it really does not matter what the data is: data science and machine learning will discover remarkable insights anyway.
By 'schema' we mean some meaningful mapping of a data elements to a descriptive label (for example 'Customer Name'). In a structured representation (such as a SQL database) this mapping would typically be defined using attributes in a column header. SQL tables also employ techniques to ensure data quality and consistency (for example by the use of primary keys, and constraints that elements are not 'null').
The majority of data science projects today rely of unstructured data, increasingly streamed in real time. Even though the data is unstructured you will need to impose some kind of schema on the data to make is useful to most data science algorithms. The main difference here is that the schema may need to be more dynamic, it can vary between feeds, it may need to be more tolerant to missing elements, and it is essentially applied late (think of it more like a form of 'late binding').
Myth No 4: You Need to Use Deep Learning
Deep Learning has rightfully received a huge amount of positive press, mainly due to the impressive advancements in recent years in image and text recognition. It is not applicable for every scenario however. Again, it's important to understand the breadth of data science, the importance of well-tried statistical techniques, as well as machine learning and deep learning. These are all very different tools for different purposes.
Deep learning typically requires a lot of data, significant compute resources, and the network architecture and associated training techniques are often intimately designed and optimised around the specific use case. Whilst deep learning has proved spectacularly successful at solving problems - especially in computer vision, it is not suitable or efficient to use it for every situation.
Myth No 5: You Will See Immediate Benefits with a Quick ROI
The overhyped claims for big data and machine learning, and the availability of excellent cloud toolsets, may give the impression that even in a short space of time an organisation is going to reap significant benefits from data science. In some cases this may be true, but for the vast majority of use cases data science should be considered an important and strategic part of the business, that will take time to embed and demonstrate useful insights.
Any data science initiative is going to require investment and time, and in terms of payback, we need to consider a number of factors: the specific use case and scale of problem; investment in infrastructure, the cost of setting up a data science team, the cost of integrating and preparing data, and whether any insights gained are actionable. The first place to start is formulating the question(s) that you wish to answer, and much of data science involves exploration, manipulation and preparation of the data, which can be extremely time consuming.
Unless you are going to outsource the project, you need to hire skilled data science practitioners. You will need to set up a data warehouse, computing resource, and tools for integrating key data feeds. As we highlighted in Myth No 1, data science projects are typically highly iterative and involve much trial-and-error in practice. A lot of time may be spent in identifying, preparing and integrating data sources. The selection and tuning of appropriate algorithms can also also time-consuming. Consequently there is often a lot of work to do before we can even think about visualising and providing conclusions on results. And once we have those results it may not always be feasible to action any insights gained (for example, analysis might reveal that a significant restructure of the supply chain of a logistics company could offer major savings. It may be that there is insufficient budget to carry out those changes).
Myth No 6: It's Ok to Publish if the data Has Been Anonymised
This very much depends on the use case, the data, the regulatory concerns, and what we mean by 'anonymised', but as a general rule you should not assume that removing obvious artefacts such as user names and personal details is good enough to de-identify the data. If the data is based on human subjects (even if de-identified) then you should gain informed consent, and seek all necessary approvals.
Even if the data has been anonymised it may be feasible to correlate a number of attributes and re-identify individuals or groups of individuals through careful querying; for example by using geolocation information, or by associating facts around medical conditions. It is not mathematically possible to guarantee anonymity, especially where naive anonymisation has been performed. Increasing techniques such as differential privacy and homomorphic encrypted are required to provide more sophisticated de-identification. In an increasingly regulated world, data privacy laws such as the European General data Protection Act (GDPR) mean that we need to be much more vigilant around the storage and movement of data, and allowing 3rd parties to to run analytics on that data.
Summary
Data Science is revolutionising how organisation think about strategy, operational efficiency, and identifying new opportunities. It provides insights in areas that may even be counter-intuitive. This becomes increasingly important when we consider how fast technology and business paradigms are changing. Organisations need to be more agile, and data science is becoming a critical tool for many businesses, as a way to optimise costs and remain competitive. Going forward it's likely that the commoditisation and democratising of data science and machine learning frameworks will make the process easier to consume and embed within business processes. Continued advancements in the field will lead to better algorithms, improved automation in algorithm selection and tuning, and more autonomous learning techniques. However, today it's never been an better time to do bad data science, when in the wrong hands.
Further Reading
For further reading see Data Science by John Kelleher and Brendan Tierney, and 'The Data Science Design Manual', by Steven Skiena.