This is the first article in a series of tutorials on data science. We will cover the following topics in this article:
 Types of data
 Mean
 Median
 Impact of outliers on mean
 Mode
Without delving too deep into the coding aspect, we will see what mean, median, and mode are, and how to derive them in Python. We will discuss codes in the subsequent articles that focus on Python libraries. Let us begin by discussing the three different types of data:
 Numerical Data
 Categorical Data
 Ordinal Data
1. Numerical Data
It’s probably the most common type of data. Basically, it represents some quantifiable thing that you can measure. Some examples are heights of people, page load times, and stock prices. Numerical data can be subdivided into two types: 1.1)Discrete data Discrete data refers to the measure of things in whole numbers (integers). For example, the number of purchases made by a customer in a year. Since the number of things that a person buys cannot be three and a half, or four and a third – it must be a whole number like four or five things – this kind of data falls under the discrete category. 1.2) Continuous data In contrast to discrete data, continuous data includes all numbers possible between any two integers or whole numbers. For example, the height of something. It could be 9.2345 inches or 9.7219 inches, or any other fraction between the two whole numbers nine and ten. Another example could be the amount of rainfall recorded in a day. Again, the amount does not necessarily have to be a whole number. It could be 6.5 mm or 23.1 mm of rainfall, depending on the shower God’s fancy.
2. Categorical Data
This type of data is nonnumeric. We use it to quantify things in categories like gender, ethnicity, nationality, political party, etc. We can assign numbers to the categories, but the numbers would not, in that case, represent their value per say. They will only separate one type from the other – type one from type two or three. For example, while calculating India’s population, Bangalore could be city number one, Mumbai number two, and so on. The data collected, however, would still represent the number of people in Bangalore and Mumbai, and not the population of one and two. These numbers have no value of their own in this context.
3. Ordinal Data
Ordinal data is an amalgamation of numerical and categorical data. Simply put, this data type consists of categories that are in order. The intervals between categories are not known. Good examples of this data type are movie or music ratings that use stars to denote quality. Numbers simply represent the good and bad categories. A movie with a 5star rating is obviously very good as opposed to a movie with only 1star, which, very likely, is terrible. Note that the numbers in this example do denote value. Mathematically speaking, 5 is greater than 1. This difference in value is used to differentiate good films from bad. Good films receive a higher rating of 4 or 5, while bad films only get a lower rating of 1 or 2.
Mean
Mean is simply another name for average. To calculate the mean of a data set, divide the sum of all values by the number of values. Consider the following set of numbers: {5,2,2,7}. The mean is (5 + 2 + 2 + 7) / 4 = 16 / 4 = 4. We use the symbol “xbar” to represent the mean of a sample data. The formula to compute the mean for a set of n values is:We will explain terms like standard deviation and normal distribution in subsequent blogs. For now, all we need to keep in mind is the sample size (10,000), and the mean (25,000). Don’t worry about other components like numpy for code, or the criteria for calculation.Code:
1
2
3

import numpy as np
expenditure = np.random.normal(25000, 15000, 10000)
np.mean(expenditure)

Median
Median, in simple words, is the number that lies in the middle of a list of ordered numbers. The numbers may be in the ascending or descending order. Let us consider the following data set: 0,2,3,4,5,1,2,0,6 After sorting these numbers in the ascending order, we get the following list: 0,0,1,2,2,3,4,5,6 2 – the number in the center (fifth from either side) – is the median in this example.The median is easy to find when there are odd number of elements in the data set. When there are even number of elements, you need to take the average of the two numbers that fall in the center of the ordered list. So, if we consider the following data set: 0,0,1,4,2,3 After sorting the numbers, we get the following list:0,0,1,2,3,4 The average of 1 and 2, in this case, is the median.Median = (1 + 2) / 2 = 1.5 Median is 1.5. Let us now see how to find the median in Python. To get the median of a data set in Python, run the script “np.median(expenditure)” in Jupyter notebook.The median of expenditures from the previous example is 25,179.05. In this case, it is not very far from the mean, which is 25,120.24. Before we discuss mode, let us understand what outliers are, and how they impact the mean of a data set.
 Any value in a dataset that is at an abnormal distance from all other values can be termed as an outlier. Outliers generally tend to skew the mean radically.
 Outliers can be present in the dataset with very high value or with a very low value.
Let us see how by passing a large value (1000000000) manually in the expenditure and then calculating the mean and median. Code:
1
2
3

expenditure = np.append(expenditure, [1000000000])
np.median(expenditure)
np.mean(expenditure)

What we find is that the large value, or the outlier, changes the median to some extent (from 25,179.05 to 24,932.93), and the mean to a great extent (from 25,120.244 to 1,24,822.14). The outlier is an abnormal value because of its potential to skew the mean of a data set radically, and thereby misrepresenting the data set altogether.
Mode
Mode is not used as often as mean or median. It is that value which appears the most number of times in a data set. For example, in the following data set, 0 appears the most number of times. Therefore, it is the mode.0,0,1,2,3,0,4,5,0 Mode in Python: Let’s generate a random expenditure set data using the script below.expenditure = np.random.randint(15, high=50, size=200) expenditure
1
2

from scipy import stats
stats.mode(expenditure)

35 is the most frequently occurring value in the random dataset. Therefore, it is the mode of the data set.