Home>Blog > Data Science Comparison
Data Science

Data Science Comparison

By -

Data Science

Data science is actually a new discipline.

Job titles include data scientist, chief scientist, senior analyst, director of analytics etc. It covers all industries and fields, but especially digital analytics, search technology, marketing, fraud detection, astronomy, energy, health-care, social networks, finance, forensics, security, mobile, telecommunications, weather forecasts, and fraud detection. Unlike most other analytic professions, data scientists are assumed to have great business acumen and domain expertise.

The diagram below shows the overlapping relationship among Data Science (Data Mining, Machine Learning, Big Data) and Traditional Analytic disciplines (Statistics Business Intelligence and Operations Research etc..


Diagram 1: The degree, functionality and direction of overlapping among different analytic disciplines

Data Science overlaps with

*Computer science: computational complexity, Internet topology and graph theory, distributed architectures such as Hadoop, data plumbing (optimization of data flows and in-­memory analytics), data compression, computer programming (Python, Perl, R) and processing sensor and streaming data (to design cars that drive automatically)

*Big Data: data science fully encompasses the domain for any collection of data sets so large or complex that it becomes difficult to process them using traditional data processing Applications

*Statistics: design of experiments including multivariate testing, cross-­validation, stochastic processes, sampling, model-­free confidence intervals, but not p-­value nor obscure tests of hypotheses that are subjects to the curse of big data

*Machine Learning and Data Mining: data science indeed fully encompasses these two domains.

*Pattern Recognition: data science includes recognition of patterns and regularities in data, although is in some cases considered to be nearly synonymous with machine learning.

*Operations Research: data science encompasses most of operations research as well as any techniques aimed at optimizing decisions based on analysing data.

*Business Intelligence: every BI aspect of designing/creating/identifying great metrics and KPI's, creating database schemas, dashboard design and visuals, and data-­driven strategies to optimize decisions and ROI, is data science.

Comparison with Traditional Analytic Disciplines

*Machine learning: Machine learning is the common term for supervised learning methods and originates from artificial intelligence. Very popular computer science discipline, data-intensive, part of data science and closely related to data mining. Machine learning is about designing algorithms (like data mining), but emphasis is on prototyping algorithms for production mode, and designing automated systems (bidding algorithms, ad targeting algorithms) that automatically update themselves, constantly train/retrain/update training sets/cross-validate, and refine or discover new rules (fraud detection) on a daily basis. Python is now a popular language for ML development. Core algorithms include clustering and supervised classification, rule systems, and scoring techniques. A sub-domain, close to Artificial Intelligence is deep learning.

*Pattern Recognition: It is a branch of machine learning that focuses on the recognition of patterns and regularities in data, although is in some cases considered to be nearly synonymous with machine learning. It has its origins in engineering, and the term is popular in the context of computer vision. Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to perform "most likely" matching of the inputs, taking into account their statistical variation. Generally speaking pattern recognition may involve much more to formalize, explain and visualize the pattern while machine learning traditionally focuses on maximizing the recognition rates.

*Big Data: It also consists of structured, unstructured or semi-structured data that large enterprises produce. A domain at center stage of data science is the explosion of new data generated from smart devices, web, mobile and social media. While data science requires a versatile skill-set, and needs someone who can work at every stage of an analysis and tackles problems that involve data manipulation, advanced statistical analysis (particularly those that require custom computational or algorithmic techniques), and interpretive and expository skills.

*Data mining: This discipline is about designing algorithms to extract insights from rather large and potentially unstructured data (text mining), sometimes called nugget discovery, for instance unearthing a massive Botnets after looking at 50 million rows of data. Techniques include pattern recognition, feature selection, clustering, supervised classification and encompasses a few statistical techniques (though without the p-values or confidence intervals attached to most statistical methods being used). Instead, emphasis is on robust, data-driven, scalable techniques, without much interest in discovering causes or interpretability. Data mining thus have some intersection with statistics, and it is a subset of data science. Data mining is applied computer engineering, rather than a mathematical science. Data miners use open source and software such as Rapid Miner.

*Statistics: Currently, statistics is mostly about surveys (typically performed with SPSS software), theoretical academic research, financial institute and insurance analytics (marketing mix optimization, cross-selling, fraud detection, usually with SAS and R), statistical programming, social sciences, global warming research (and space weather modeling), economic research, clinical trials (pharmaceutical industry), medical statistics, epidemiology, biostatistics, and government statistics. Because of the big influence of the conservative, risk-adverse pharmaceutical industry, statistics has become a narrow field not adapting to new data, and not innovating, losing ground to data science, industrial statistics, operations research, data mining, machine learning -- where the same clustering, cross- validation and statistical training techniques are used, albeit in a more automated way and on bigger data. Many professionals who were called statisticians 10 years ago, have seen their job title changed to data scientist or analyst in the last few years. Modern sub-domains include statistical computing, statistical learning (closer to machine learning), computational statistics (closer to data science), data-driven (model-free) inference, sport statistics, and Bayesian statistics (MCMC, Bayesian networks and hierarchical Bayesian models being popular, modern techniques). Other new techniques include SVM, structural equation modeling, predicting election results, and ensemble models.

*Industrial statistics: Statistics frequently performed by non-statisticians (engineers with good statistical training), working on engineering projects such as yield optimization or load balancing (system analysts). They use very applied statistics, and their framework is closer to six sigma, quality control and operations research, than to traditional statistics. Also found in oil and manufacturing industries. Techniques used include time series, ANOVA, experimental design, survival analysis, signal processing (filtering, noise removal, deconvolution), spatial models, simulation, Markov chains, risk and reliability models.

*Mathematical optimization: Solves business optimization problems with techniques such as the simplex algorithm, Fourier transforms (signal processing), differential equations, and software such as Matlab. These applied mathematicians are found in big companies such as IBM, research labs, NSA (cryptography) and in the finance industry. These professionals sometimes solve the exact same problems as statisticians do, using the exact same techniques, though they use different names. Mathematicians use least square optimization for interpolation or extrapolation;; statisticians use linear regression for predictions and model fitting, but both concepts are identical, and rely on the exact same mathematical machinery: it's just two names describing the same thing. Mathematical optimization is however closer to operations research than statistics.

*Actuarial sciences: Just a subset of statistics focusing on insurance (car, health, etc.) using survival models: predicting when you will die, what your health expenditures will be based on your health status (smoker, gender, previous diseases) to determine your insurance premiums. Also predicts extreme floods and weather events to determine premiums. These latter models are notoriously erroneous (recently) and have resulted in far bigger payouts than expected. Actuarial sciences is indeed data science (a sub-domain).

*Operations research: Abbreviated as OR. It is about decision science and optimizing traditional business projects: inventory management, supply chain, pricing. They heavily use Markov Chain models, Monter-Carlo simulations, queuing and graph theory, and software such as AIMS, Matlab or Informatica. Big, traditional old companies use OR, new and small ones (start-ups) use data science to handle pricing, inventory management or supply chain problems. Many operations research analysts are becoming data scientists, as there is far more innovation and thus growth prospect in data science, compared to OR. Also, OR problems can be solved by data science. OR has a significant overlap with six-sigma, also solves econometric problems, and has many practitioners/applications in the army and defense sectors. Car traffic optimization is a modern example of OR problem, solved with simulations, commuter surveys, sensor data and statistical modeling.

*Quant: Quant people are just data scientists working for Wall Street on problems such as high frequency trading or stock market arbitraging. They use C++, Matlab, and come from prestigious universities, earn big bucks but lose their job right away when ROI goes too South too quickly. They can also be employed in energy trading. Quants have backgrounds in statistics, mathematical optimization, and industrial statistics.

*Artificial intelligence: The intersection with data science is pattern recognition (image analysis) and the design of automated systems to perform various tasks, in machine-to- machine communication mode, such as identifying the right keywords (and right bid) on Google AdWords (pay-per-click campaigns involving millions of keywords per day). An old AI technique is neural networks, but it is now loosing popularity. To the contrary, neuroscience is gaining popularity.

*Computer science: Data science has some overlap with computer science: Hadoop and Map-Reduce implementations, algorithmic and computational complexity to design fast, scalable algorithms, data plumbing, and problems such as Internet topology mapping, random number generation, encryption, data compression, and steganography (though these problems overlap with statistical science and mathematical optimization as well).

*Econometrics: It seems to be separated from statistics. So many branches disconnected themselves from statistics, as they became less generic and start developing their own ad-hoc tools. But in short, econometrics is heavily statistical in nature, using time series models such as auto-regressive processes. Also overlapping with operations research (itself overlapping with statistics!) and mathematical optimization (simplex algorithm). Econometricians like ROC and efficiency curves (so do six sigma practitioners. Many do not have a strong statistical background, and Excel is their main or only tool.

*Data engineering: Performed by software engineers (developers) or architects (designers) in large organizations (sometimes by data scientists in tiny companies), this is the applied part of computer science to power systems that allow all sorts of data to be easily processed in-memory or near-memory, and to flow nicely to (and between) end-users, including heavy data consumers such as data scientists. A sub-domain currently under attack is data warehousing, as this term is associated with static, conventional databases, data architectures, and data flows, threatened by the rise of NoSQL, NewSQL and graph databases. Transforming these old architectures into new ones (only when needed) or make them compatible with new ones, is a lucrative business.

*Business intelligence: Abbreviated as BI. Focuses on dashboard creation, metric selection, producing and scheduling data reports (statistical summaries) sent by email or delivered/presented to executives, competitive intelligence (analyzing third party data), as well as involvement in database schema design (working with data architects) to collect useful, actionable business data efficiently. Typical job title is business analyst, but some are more involved with marketing, product or finance (forecasting sales and revenue). They typically have an MBA degree. Some have learned advanced statistics such as time series, but most only use (and need) basic statistics, and light analytics, relying on IT to maintain databases and harvest data. They use tools such as Excel (including cubes and pivot tables, but not advanced analytics), Brio (Oracle browser client), Birt, Micro-Sreategy or Business Objects (as end-users to run queries), though some of these tools are increasingly equipped with better analytic capabilities. Unless they learn how to code, they are competing with some polyvalent data scientists that excel in decision science, insights extraction and presentation (visualization), KPI design, business consulting, and ROI/yield/business/process optimization. BI and market research (but not competitive intelligence) are currently experiencing a decline, while AI is experiencing a come-back. This could be cyclical. Part of the decline is due to not adapting to new types of data (e.g. unstructured text) that require engineering or data science techniques to process and extract value.

*Data analysis: This is the new term for business statistics since at least 1995, and it covers a large spectrum of applications including fraud detection, advertising mix modeling, attribution modeling, sales forecasts, cross-selling optimization (retails), user segmentation, churn analysis, computing long-°©time value of a customer and cost of acquisition, and so on. Except in big companies, data analyst is a junior role;; these practitioners have a much more narrow knowledge and experience than data scientists, and they lack (and don't need) business vision. They are detail-°©oriented and report to managers such as data scientists or director of analytics. In big companies, someone with a job title such as data analyst III might be very senior, yet they usually are specialized and lack the broad knowledge gained by data scientists working in a variety of companies large and small.

*Business analytics: Same as data analysis, but restricted to business problems only. Tends to have a bit more of a financial, marketing or ROI flavor. Popular job titles include data analyst and data scientist, but not business analyst (see business intelligence entry for business intelligence, a different domain).

Top Stories