• Get application security done the right way! Detect, Protect, Monitor, Accelerate, and more…
  • Mean, median, and mode are fundamental topics of statistics. You can easily calculate them in Python, with and without the use of external libraries.

    These three are the main measures of central tendency. The central tendency lets us know the “normal” or “average” values of a dataset. If you’re just starting with data science, this is the right tutorial for you.

    Mean, median, mode the three measurements of central tendency

    By the end of this tutorial you’ll:

    • Understand the concept of mean, median, and mode
    • Be able to create your own mean, median, and mode functions in Python
    • Make use of Python’s statistics module to quickstart the use of these measurements

    If you want a downloadable version of the following exercises, feel free to check out the GitHub repository.

    Let’s get into the different ways to calculate mean, median, and mode.

    Calculating the  Mean in Python

    The mean or arithmetic average is the most used measure of central tendency.

    Remember that central tendency is a typical value of a set of data.

    A dataset is a collection of data, therefore a dataset in Python can be any of the following built-in data structures:

    • Lists, tuples, and sets: a collection of objects
    • Strings: a collection of characters
    • Dictionary: a collection of key-value pairs

    Note: Altought there are other data structures in Python like queues or stacks, we’ll be using only the built-in ones.

    We can calculate the mean by adding all the values of a dataset and dividing the result by the number of values. For example, if we have the following list of numbers:

    [1, 2, 3, 4, 5, 6]

    The mean or average would be  3.5 because the sum of the list is 21 and its length is 6. Twenty-one divided by six is 3.5. You can perform this calculation with the below calculation:

    (1 + 2 + 3 + 4 + 5 + 6) / 6 = 21

    In this tutorial, we’ll be using the players of a basketball team as our sample data.

    Creating a Custom Mean Function

    Let’s start by calculating the average (mean) age of the players in a basketball team. The team’s name will be “Pythonic Machines”.

    pythonic_machine_ages = [19, 22, 34, 26, 32, 30, 24, 24]
    
    def mean(dataset):
        return sum(dataset) / len(dataset)
    
    print(mean(pythonic_machine_ages))

    Breaking down this code:

    • The “pythonic_machine_ages” is a list with the ages of basketball players
    • We define a mean() function which returns the sum of the given dataset divided by its length
      • The sum() function returns the total sum (ironically) of the values of an iterable, in this case, a list. Try to pass the dataset as an argument, it’ll return 211
      • The len() function returns the length of an iterable, if you pass the dataset to it you’ll get 8
    • We pass the basketball team ages to the mean() function and print the result.

    If you check the output, you’ll get:

    26.375
    # Because 211 / 8 = 26.375

    This output represents the average age of the basketball team players. Note how the number doesn’t appear in the dataset but describes precisely the age of most players.

    Using mean() from the Python Statistic Module

    Calculating measures of central tendency is a common operation for most developers. That’s because Python’s statistics module provides diverse functions to calculate them, along with other basic statistics topics.

    Since it’s part of the Python standard library you won’t need to install any external package with PIP.

    Here’s how you use this module:

    from statistics import mean
    
    pythonic_machine_ages = [19, 22, 34, 26, 32, 30, 24, 24]
    
    print(mean(pythonic_machine_ages))

    In the above code, you just need to import the mean() function from the statistics module and pass the dataset to it as an argument. This will return the same result as the custom function we defined in the previous section:

    26.375

    Now you have crystal clear the concept of mean let’s continue with the median measurement.

    Finding the Median in Python

    The median is the middle value of a sorted dataset. It is used — again — to provide a “typical” value of a determined population.

    In programming, we can define the median as the value that separates a sequence into two parts — The lower half and the higher half —.

    To calculate the median, first, we need to sort the dataset. We could do this with sorting algorithms or using the built-in function sorted(). The second step is to determine whether the dataset length is odd or even. Depending on this some of the following process:

    • Odd: The median is the middle value of the dataset
    • Even: The median is the sum of the two middle values divided by two

    Continuing with our basketball team dataset, let’s calculate the players’ median height in centimeters:

    [181, 187, 196, 196, 198,  203, 207, 211, 215]
    # Since the dataset is odd, we select the middle value
    median = 198

    As you can see, since the dataset length is odd, so we can take the middle value as the median. However, what would happen if a player just got retired?

    We would need to calculate the median taking the two middle values of the dataset

    [181, 187, 196, 198, 203, 207, 211, 215] 
    # We select the two middle values, and divide them by 2
    median = (198 + 203) / 2
    median = 200.5

    Creating a Custom Median Function

    Let’s implement the above concept into a Python function.

    Remember the three steps we need to follow to get the median of a dataset:

    • Sort the dataset: We can do this with the sorted() function
    • Determine if it’s odd or even: We can do this by getting the length of the dataset and using the modulo operator (%)
    • Return the median based on each case:
      • Odd: Return the middle value
      • Even: Return the average of the two middle values

    That would result in the following function:

    pythonic_machines_heights = [181, 187, 196, 196, 198, 203, 207, 211, 215]
    after_retirement = [181, 187, 196, 198, 203, 207, 211, 215]
    
    def median(dataset):
        data = sorted(dataset)
        index = len(data) // 2
        
        # If the dataset is odd  
        if len(dataset) % 2 != 0:
            return data[index]
        
        # If the dataset is even
        return (data[index - 1] + data[index]) / 2

    Printing the result of our datasets:

    print(median(pythonic_machines_heights))
    print(median(after_retirement))

    Output:

    198
    200.5

    Note how we create a data variable that points to the sorted database at the start of the function. Although the lists above are sorted, we want to create a reusable function, therefore sorting the dataset each time the function is invoked.

    The index stores the middle value — or the upper-middle value — of the dataset, by using the integer division operator. For instance, if we were passing the “pythonic_machine_heights” list it would have the value of 4.

    Remember that in Python sequence indexes start at zero, that’s because we’re able to return the middle index of a list, with an integer division.

    Then we check if the length of the dataset is odd by comparing the result of the modulo operation with any value that isn’t zero. If the condition is true, we return the middle element, for instance, with the “pythonic_machine_heights” list:

    >>> pythonic_machine_heights[4]
    # 198

    On the other hand, if the dataset is even we return the sum of the middle values divided by two. Note that data[index -1] gives us the lower midpoint of the dataset, while data[index] supplies us with the upper midpoint.

    Using median() from the Python Statistic Module

    This way is much simpler because we’re using an already existent function from the statistics module.

    Personally, if there is something already defined for me, I would use it because of the DRY —Don’t repeat yourself — principle (in this case, don’t repeat other’s code).

    You can calculate the median of the previous datasets with the following code:

    from statistics import median
    
    pythonic_machines_heights = [181, 187, 196, 196, 198, 203, 207, 211, 215]
    after_retirement = [181, 187, 196, 198, 203, 207, 211, 215]
    
    print(median(pythonic_machines_heights))
    print(median(after_retirement))

    Output:

    198
    200.5

    Computing the Mode in Python

    The mode is the most frequent value in the dataset. We can think of it as the “popular” group of a school, that may represent a  standard for all the students.

    An example of mode could be the daily sales of a tech store. The mode of that dataset would be the most sold product of a specific day.

    ['laptop', 'desktop', 'smartphone', 'laptop', 'laptop', 'headphones']

    As you can appreciate, the mode of the above dataset is “laptop” because it was the most frequent value in the list.

    The cool thing about mode is that the dataset musn’t be numeric. For instance, we can work with strings.

    Let’s analyze the sales of another day:

    ['mouse', 'camera', 'headphones', 'usb', 'headphones', 'mouse']

    The dataset above has two modes: “mouse” and “headphones” because both have a frequency of two. This means it’s a multimodal dataset.

    What if we can’t find the mode in a dataset, like the one below?

    ['usb', 'camera', 'smartphone', 'laptop', 'TV']

    This is called a uniform distribution, basically, it means there is no mode in the dataset.

    Now you have a quick grasp on the concept of mode, let’s calculate it in Python.

    Creating a Custom Mode Function

    We can think of the frequency of a value as a key-value pair, in other words, a Python dictionary.

    Recapitulating the basketball analogy, we can use two datasets to work with: The points per game, and the sneaker sponsorship of some players.

    To find the mode first we need to create a frequency dictionary with each one of the values present in the dataset, then get the maximum frequency, and return all the elements with that frequency.

    Let’s translate this into code:

    points_per_game = [3, 15, 23, 42, 30, 10, 10, 12]
    sponsorship = ['nike', 'adidas', 'nike', 'jordan',
                   'jordan', 'rebook', 'under-armour', 'adidas']
    
    def mode(dataset):
        frequency = {}
    
        for value in dataset:
            frequency[value] = frequency.get(value, 0) + 1
    
        most_frequent = max(frequency.values())
    
        modes = [key for key, value in frequency.items()
                          if value == most_frequent]
    
        return modes

    Checking the result passing the two lists as arguments:

    print(mode(points_per_game))
    print(mode(sponsorship))

    Output:

    [10]
    ['nike', 'adidas', 'jordan']

    As you can see, the first print statement gave us a single mode, while the second returned multiple modes.

    Explaining deeper the code above:

    • We declare a frequency dictionary
    • We iterate over the dataset to create a histogram — the statistical term for a set of counters (or frequencies) —
      • If the key is found in the dictionary then, it adds one to the value
      • If it’s not found we create a key-value pair with a value of one
    • The most_frequent variable stores — ironically — the biggest value (not key) of the frequency dictionary
    • We return the modes variable which consists of all the keys in the frequency dictionary with the most frequency.

    Note how important is variable naming to write readable code.

    Using mode() and multimode() from the Python Statistic Module

    Once again the statistics module provides us a quick way to do basic statistics operations.

    We can use two functions: mode() and multimode().

    from statistics import mode, multimode
    
    points_per_game = [3, 15, 23, 42, 30, 10, 10, 12]
    sponsorship = ['nike', 'adidas', 'nike', 'jordan',
                   'jordan', 'rebook', 'under-armour', 'adidas']

    The code above imports both functions and define the datasets we’ve been working with.

    Here comes the little difference: The mode() function returns the first mode it encounters, while multimode() returns a list with the most frequent values in the dataset.

    Consequently, we can say the custom function we defined is actually a multimode() function.
    print(mode(points_per_game))
    print(mode(sponsorship))
    

    Output:

    10
    nike

    Note: In Python  3.8 or greater the mode() function returns the first mode it found. If you have an older version you’ll get a StatisticsError.

    Using the multimode() function:

    print(multimode(points_per_game))
    print(multimode(sponsorship))

    Output:

    [10]
    ['nike', 'adidas', 'jordan']

    To Sum Up

    Congratulations! If you followed so far, you learned how to calculate the mean, median, and mode, the main central tendency measurements.

    Although you can define your custom functions to find mean, median, and mode, it’s recommended to use the statistics module, since it’s part of the standard library and you need to install nothing to start using it.

    Next, read a friendly introduction to data analysis in Python.