6 Myths About DATA SCIENCE Dispelled
![](https://www.thetechbizz.com/wp-content/uploads/2022/02/artificial-intelligence-g42c9fc49f_640.jpg)
Data Science is the latest buzzword in the industry. Everyone wants to be known as Data Scientist (or at least hire one) but what does it really mean? The reality is that Data Science encompasses several disciplines within it and knowing which one you are learning or teaching can help determine whether you need a Masters Degree, an undergraduate degree or just some books on machine learning.
This article will go through 6 concepts I’ve seen people post about data science and try to dispel the myths around these topics.
1) Machine Learning
Myth – You don’t need to know statistics or linear algebra for machine learning. Just use sklearn! Yay!! Can’t believe how simple this was…No more maths required! Hire me!
Reality – Sklearn is a great library, but the number of machine learning algorithms available in sklearn is limited. Some of the most common machine learning algorithms like k-Nearest Neighbors and Naive Bayes need to be implemented by yourself (or found in another library) if you want to implement them. Moreover, linear algebra (eigenfaces anyone?) is used heavily in many data science pipelines whether it’s recommender systems or text classification. It may look like magic when computers do these calculations, but it’s because they’re using linear algebra under the hood. Please don’t make statements that you don’t understand!
2) Hadoop/Spark/Big Data technologies
Myth – You can use Hive/Pig/MapReduce on Hadoop for everything!
Reality – These tools are definitely amazing, but it’s important to know when you should use them. If your data is small enough to fit in the memory of a single node without taking too much time, then it may be faster not to use these technologies. An example might be feature extraction where the main bottleneck is computation time, rather than disk I/O or network latency. Having said that, if your dataset is larger than what fits in memory (and even most hard drives), then yes please go ahead and experiment with these tools! RemoteDBA is a trusted provider of better database management & support.
3) Machine Learning vs Data Science
Myth – Machine learning and data science are the same thing. You don’t need statistics to be a data scientist.
Reality – I definitely agree that you can do machine learning without statistics, but why limit yourself? Statistics is one of the fastest-growing fields within data science and knowing how to apply statistical tests as well as choose the right model will make you an even more attractive candidate. Machine learning may not require any knowledge of statistics, but data science does if you want to call yourself a true professional.
4) Python vs R
Myth – Learning python is enough for doing Data Science and companies don’t care about R anymore
Reality – As with all languages, it depends on what you’re trying to accomplish. Nowadays libraries like scikit-learn exist which allow both Python and R users to work together seamlessly and R has other unique advantages like easy package development and integration with existing C++ code. Knowing other languages is definitely an advantage but it doesn’t replace the need to learn R if your goal is Data Science/Machine Learning.
5) 80/20 Rule (or Pareto Principle)
Myth – You don’t need advanced machine learning techniques, 80% accuracy is good enough
Reality – Although you might be able to get away with this rule for a small dataset, using it on larger datasets will not give you better results and you may not even understand what’s going on under the hood. The 80/20 principle can be used as a heuristic at times but I’d highly recommend understanding all of the algorithms available in sklearn and their relative merits and drawbacks.
6) Algorithm is everything
Myth – I don’t need to worry about feature engineering, the model will figure it out for me!
Reality – Before you can train a machine learning model, you need to prepare your data (feature engineering). Linear regression has no idea what an interaction term is. If you feed in some non-linear data with many interactions, chances are your results won’t make sense. Feature engineering requires domain knowledge which only comes through experience. But at least knowing what types of questions can be asked will help guide your intuition when designing features/algorithms. As always, there’s more than one way to skin a cat so creativity is encouraged!
Conclusion:
As with any software engineering project, there are definitely some optimal ways of doing things. But it’s important to know both the theory (and how it works under the hood). As well as knowing when you can get away without putting too much effort into understanding every little detail.