Sorted Arrays, Binary Search, & Big O

Sorted Arrays, Binary Search, & Big O

by Kenny Porterfield | Wed, Apr 21 2021

Category: Data Structures & Algorithms

Being able to write code that runs efficiently, a programmer needs to have a solid understanding of data structures and algorithms. A data structure is simply the way data is organized, and an algorithm is a set of instructions for completing a task. Oftentimes with code, there are multiple ways to accomplish a goal, but there can be a big difference between the efficiency of the code when it is written by someone simply trying to get a job done versus when it is written by someone with a solid understanding of data structures and algorithms. That is why I'm here today to write some more about data structures & algorithms.

 

In my last blog post (and first on data structures & algorithms), I explored the array as the data structure and talked about four basic operations done on data structures: reading, searching, inserting, and deleting. I'm going to start today by introducing another data structure: the set.

 

Sets

A set is a data structure that stores only unique values. There are different types of sets, but for now we'll stick with array-based sets. So for our purposes, a set is an array that does not allow duplicate values in it. Let's examine how the different operations perform on a set vs a normal array. 

 

Reading from a set is going to be the same as reading from an array. It doesn't change the way a computer works, which is still that it takes one step for the computer to look up what's at a particular index.

 

Searching and deleting from a set will also be the same as from an array. The computer will perform the same steps in the same order to do those operations.

 

Inserting is where we see a difference. This is because sets do not allow duplicates, so in order to insert into a set, the computer will first have to check each index in the array to make sure that the value we're trying to insert doesn't already exist in the array. This is because the computer doesn't know off the top of its head what values are in the array or set. So every insertion first requires a search to happen. 

 

cities = ["Springfield", "Athens", "Sparta", "Dublin", "Rome"]

 

Looking back at the same array we used previously, if this were an array-based set and we wanted to insert the value "Chicago", the computer will need to check each index of the array to determine whether "Chicago" exists in the array already or not. Searching each index of the set will take N steps for a set of N length. Then, if the value we are trying to insert is not already in the set, the computer will have to perform the insertion. Depending on where in the set we want to insert the value, this could take a maximum of N+1 steps. In total, the maximum number of steps involved in inserting a value into a set is 2N+1.

 

It's important to consider the needs of your application and weigh which type of data structure is preferable, because sets are a useful data structure when you need to ensure there is no duplicate data, but can slow down operations.

 

Sorted Arrays

The second data structure we're going to take a look at is sorted arrays. A sorted array is simply an array in which the values are always kept in a specific order. Reading and deleting from a sorted array are still going to take the same amount of steps to do as a standard array. We'll notice again that inserting into a sorted array will require more steps than inserting into a standard array, because the computer will have to perform a search on the array to determine where the value should be inserted. The sorted array has an advantage when it comes to searching though.

 

With linear search on a standard array, the computer has to search each index in the array to determine whether a particular value exists in it, until it finds the value. If the value does not exist in the array, then the computer will have to search each index in the array, meaning for an array of N length, a search could take N steps. With a sorted array, the computer only has to search until it passes over where that value would be to determine if the value exists in the array or not. Taking our array from earlier, and sorting it alphabetically, our sorted array would look like this:

 

cities = ["Athens", "Dublin", "Rome", "Sparta", "Springfield"]

 

If we wanted to search this sorted array for "Chicago", the operation would only take the computer 2 steps. The first will be checking cities[0], where it will find the value is not "Chicago" but the value is before "Chicago" alphabetically so it will continue the search. The second step the computer would check the value in cities[1], where it would find "Dublin", which comes after "Chicago" in the alphabet. Therefore, the computer can already determine that "Chicago" is not in the sorted array and won't need to perform any more steps searching for it. 

 

Binary Search

So far, the only search algorithm we've examined our data structures with is linear search. When we have a sorted array, however, we have the ability to improve our search operation by using a different algorithm called binary search, and it is a much faster algorithm.

 

Binary search uses the divide and conquer strategy to drastically improve our search operation. Consider this number guessing game: If I asked you to guess a number between 1 and 100 and I would tell you whether the number is higher or lower until you guessed the right number, what would be the best strategy to find out the number? First, you'd guess 50, because by guessing halfway, regardless of what the number is, it will eliminate half of the possible options when I tell you "higher" or "lower". If I said "lower" then the next best thing you could guess would be 25, to eliminate half of the remaining numbers again.

 

Using this method, the maximum amount of guesses you'd have to make before finding the number would be 7. If you were to just start at 1 and keep guessing the next number, akin to linear search, this could take you up to 100 guesses. Furthermore, each time we double the amount of values we have to check, the maximum number of steps for binary search only increases by 1, which makes sense when you think about it. Because our first guess eliminates half of the possiblities, we need to double the amount of possibilities to add a single step. Binary search on 200 values would take 8 steps, while linear search would take 200. 

 

Here is a JavaScript implementation of binary search in action:

 

let binarySearch = function (arr, x) {
   
    let start=0, end=arr.length-1;
          
    // Iterate while start does not meet end
    while (start<=end){
  
        // Find the mid index
        let mid=Math.floor((start + end)/2);
   
        // If element is present at mid, return True
        if (arr[mid]===x) return true;
  
        // Else look in left or right half accordingly
        else if (arr[mid] < x) 
             start = mid + 1;
        else
             end = mid - 1;
    }
   
    return false;
}

  

We already know the best way to measure an algorithms efficiency is by determining the number of steps that it takes. Next we'll learn a system that we can use to quantify the efficiency of an algorithm concisely and consistently called Big O Notation. 

 

Big O Notation

Big O Notation comes from the math world, and we won't dive too deeply into the mathematical jargon or complexities of it. We'll instead just cover at a high level some of the ways we are going to use it in our context, as it relates to programming.

 

O(1)

What about the read operation which so far has always only taken the computer one step? We refer to that as O(1). An interesting thing about O(1) is that it doesn't mean the operation takes only one step. It means that no matter how many elements a data structure has in it, the number of steps to complete the operation stays the same. If an algorithm takes 1, 4, or 100 steps to complete a task, as long as it takes that amount of steps each time regardless of the size of the data structure, it is considered O(1). O(1) is considered the fastest kind of algorithm because as the data increases, the performance of the algorithm stays the same. Generally speaking, this is a huge advantage. O(1) is referred to as having constant time, because the number of steps taken stays constant regardless of the size of the data.

 

O(N)

Several times already, when analyzing the number of steps a computer takes to perform an opertion, I've referenced it as "for an array of N length, an operation could take N steps". This is what O(N) refers to. O(N) is also known as having linear time, because as the size of the data increases, the performance of the algorithm increases with it at an equal rate.

 

O(log N)

What about binary search? The performance of the algorithm increases as the size of the data increases, so it's not constant. But as the size of the data increases, the number of steps the algorithm takes does not increase with it in a linear fashion either. Algorithms like this are referred to as O(log N), and have a time complexity of log time. O(log N) describes an algorithm that, like binary search, increases by one step each time the data is doubled in size. Log in O(log N) is short for logarithm. Logarithms are the opposite of exponents. For example, log10 100 = 2 whereas 102 = 100.

 

There is more to Big O than what we have here, and we will continue building onto our understanding of Big O as we learn more data structures and algorithms. These notations cover what we have encountered so far, so it is a good stopping point for this post. If any of this stuff didn't make sense to you, check out my first post on Data Structures & Algorithms (101). If it still doesn't make sense, or you have anything else to add or talk about, let me know in the comments below. Keep learning, and if this stuff interests you I'd recommend picking up and reading A Common-Sense Guide to Data Structures and Algorithms, by Jay Wengrow.



Check out other Data Structures & Algorithms blogs.

Comments

Login or create an account to leave a comment.

Recent Blog Posts

Get in touch

Kenny Porterfield
kjport3@gmail.com
LinkedIn

Or, send me a message from here: