When a developer is first learning to code the focus is, understandably, on getting the code to work. For a while, that is the only metric that matters. However, as you gain experience and work on larger, more complex projects; it is crucial to learn about code quality because two snippets of code could accomplish the same thing, but one is far better than the other. The more you know about data structures and algorithms, the better you'll be at writing higher quality code.
First, I will define data structures and algorithms and explain why this matters. Data structures simply refers to how data is organized. Algorithm can be an intimidating word but an algorithm is simply a set of instructions for completing a specific task. This is important because these things have a huge impact on the quality and efficiency of code.
Two important ways to measure code quality are maintainability and efficiency. Maintainability involves things like readability, organization, and modularity, and is important when you start to work with a larger team on a larger project. Efficiency involves reliability and speed. This is where most of the focus on data structures and algorithms will be, because two snippets of code can achieve the same task but one can be much faster and more reliable than the other, and that's what we're focusing on here.
To illustrate this with a very simple example, let's look at these two functions whose task is to print all even numbers from 2 to 100.
def print_numbers_version_one():
number = 2
while number <= 100:
if number % 2 == 0:
print(number)
number += 1
def print_numbers_version_two():
number = 2
while number <= 100:
print(number)
number += 2
In this simple example, it's easy to see that Version 1 takes twice as many steps as Version 2 to accomplish the same task. Version 1 loops 100 times, while Version 2 loops only 50 times. As we dive deeper into data structures and algorithms we'll learn how computer code interacts with data structures, how that impacts code efficiency and how to measure it, and what to look out for because there can be differences in code efficiency far greater than this that are much less obvious to spot.
Data Structure Operations
In order to understand the efficiency of any code, we first have to gain some understanding of how code interacts with data structures. For these examples we'll be referencing arrays, which I will assume if you're reading this you have worked with before but still will call out that arrays are a basic data structure with two important characteristics. The length of an array is how many data elements it holds. The index of an array is a number that identifies where in the array a piece of data lives, which begins at 0 in most programming languages.
Many data structures have four basic operations, or ways they're used: Read, Search, Insert, and Delete.
- Reading refers to looking up something at a specific spot within the data structure.
- Searching refers to looking for a particular value in a data structure.
- Insertion refers to adding an additional value to a data structure.
- Deletion refers to removing a value from our data structure.
Measuring Speed
It is not very useful to attempt to measure the efficiency or speed of a given algorithm or snippet of code by the time that it takes to complete, because that can vary based on the hardware that is running the code. We instead measure it based on the amount of steps that it takes to complete, as we did in the previous code snippet up above. Measuring the number of steps an algorithm takes is useful because that is consistent across all machines. The terms speed, time complexity, efficiency, performance, and runtime can be used interchangably in this context, they all refer to the number of steps an algorithm takes. With this in mind, we will take a look at the four operations on an example array, and note how many steps they take to achieve.
cities = ["Springfield", "Athens", "Sparta", "Dublin", "Rome"]
Read
When a computer allocates an array, it also makes a note at which memory address the array begins. Because of this, a computer has the ability to jump to any particular index in the array and see what is there in just one step. Therefore, reading from an array is a very efficient operation. If we want to ask the computer what value is at index 3 of our array, it will be able to tell us in just one computational step that the value is "Dublin"
.
What if instead of asking that, we asked the computer if the value "Dublin"
is located in the array at all, or what index it is at? We would have to perform a search operation on the array.
Search
An important fact to remember about computers is that while a computer has immediate access to all of its memory addresses (which is why we can read so quickly), it does not know what values are contained in them at all times. Because of this, searching an array can be a tedious process. To find out if, and where, a value is contained within an array, the computer has to check each index of the array until it finds the value we are searching for. If we search our array for "Dublin"
, the computer will start at the beginning of the array and check what is at cities[0]
, then cities[1]
, then cities[2]
, then cities[3]
before finding "Dublin"
.
In our array, it took four steps to find the value we were looking for. For an array of N length, searching it can take up to N steps. However, if we were to search our array for "Rome"
, this would only take one step, as it is at index 0 of the array. There are other ways to search arrays, which I will share in another post, but this basic one is called linear search. In any case, it's important to know searching is less efficient than reading because it can take many steps, while reading always just takes one.
Insertion
When allocating memory addresses to an array, the computer always keeps track of the size of the array. We also know that the computer knows at which memory address the array begins. Because of these two facts, inserting a piece of data onto the end of an array is a piece of cake, the computer can do it in one step. However, the number of steps can very depending on where we want to insert a new piece of data, because in order to insert before the end of the array, the computer will need to shift each piece of data that comes after that point in order to make room for that piece of data. Say for our example, that we want to insert the value "Chicago"
into our array. We know that inserting it at the end would only take one step, but say we want to insert that value right after "Springfield"
, so at index 1, for some reason. Maybe we want the American cities to come before international ones. In order to do that, the computer would have to move cities[1]
, cities[2]
, cities[3]
, and cities[4]
each one memory address to the right before it could insert "Chicago"
into the array at cities[1]
. This insert operation would take a total of 5 steps. The worst-case scenario for insertion would be inserting a value into the beginning of an array, as that would require the computer to shift each value in the array before the insertion. For an array of N length, the most steps an insertion would take would be N + 1.
Deletion
Knowing how a computer deals with arrays during insertion, we can think of deletion sort of as the opposite of insertion. When we delete a piece of data from an array, we then shift each piece of data that comes after that down one index. If we were to delete "Rome"
from the end of our array, it would take only one step to do so. However, if we delete "Athens"
from our array from index 1, the first step would be deleting that piece of data. Then we would shift "Sparta"
from cities[2]
to the empty slot at cities[1]
, "Dublin"
from cities[3]
to cities[2]
, and "Rome"
from cities[4]
to cities[3]
. This means deleting index 1 from our array would require 4 steps. For an array on N length, the most steps a deletion could possilby take would be N.
Now that we have a basic understanding of the efficiency of different operations on standard arrays, I think that is a good spot to wrap up. In my next blog post I will introduce a couple more data structures, and see how different operations and algorithms perform with them.