Beginner Exercism • python

Hamming

Challenge Overview

Calculate the Hamming distance between two DNA strands.

Introduction

Your body is made up of cells that contain DNA. Those cells regularly wear out and need replacing, which they achieve by dividing into daughter cells. In fact, the average human body experiences about 10 quadrillion cell divisions in a lifetime!

When cells divide, their DNA replicates too. Sometimes during this process mistakes happen and single pieces of DNA get encoded with the incorrect information. If we compare two strands of DNA and count the differences between them, we can see how many mistakes occurred. This is known as the “Hamming distance”.

The Hamming distance is useful in many areas of science, not just biology, so it’s a nice phrase to be familiar with :)

Instructions

Calculate the Hamming distance between two DNA strands.

We read DNA using the letters C, A, G and T. Two strands might look like this:

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT
^ ^ ^  ^ ^    ^^

They have 7 differences, and therefore the Hamming distance is 7.

Implementation notes

The Hamming distance is only defined for sequences of equal length, so an attempt to calculate it between sequences of different lengths should not work.

Dig Deeper

Range

range

def distance(strand_a, strand_b):
    if len(strand_a) != len(strand_b):
        raise ValueError("Strands must be of equal length.")
    count = 0
    for index in range(len(strand_a)):
        if strand_a[index] != strand_b[index]:
            count += 1
    return count

This approach starts by checking if the two strands are of equal length. If not, a ValueError is raised.

After that is checked, a <count> variable is initialized to 0. The count variable will be used to keep track of the number of differences between the two strands.

range in Python is a built-in function that returns a sequence of numbers. range produces an infinite sequence, but it can be limited by providing a <stop> argument. The range function can also take a <start> argument and a <step> argument. The inputs are built up like this: range(<start>, <stop>, <step>). When only a <stop> argument is provided, <start> defaults to 0 and <step> defaults to 1.

We use range to iterate over the indexes of the strand_a string. We do that by passing the length of the string to range by using the built-in function len. Iterating over range gives us an index number we can use to access a character in the string.

We can then compare the character at the index in strand_a to the character at the same index in strand_b. If the two values are not equal, we increment the <count> variable by 1.

After the loop completes, we return the <count> variable.

Zip

zip

def distance(strand_a, strand_b):
    if len(strand_a) != len(strand_b):
        raise ValueError("Strands must be of equal length.")
    count = 0
    for nucleotide_a, nucleotide_b in zip(strand_a, strand_b):
        if nucleotide_a != nucleotide_b:
            count += 1
    return count

This approach starts by checking if the two strands are of equal length by using len. If not, a ValueError is raised.

After that is checked, a <count> variable is initialized to 0. The count variable will be used to keep track of the number of differences between the two strands.

We use zip to iterate over the characters in strand_a and strand_b simultaneously. zip is a built in function. It takes any number of iterables and returns an iterator of tuples. Where the i-th tuple contains the i-th element from each of the argument iterables. For example, the first tuple will contain the first element from each iterable, the second tuple will contain the second element from each iterable, and so on until the shortest iterable is exhausted.

In Python, strings are iterable.

Here is an example of using zip to iterate over two strings:

>>> zipped = zip("GGACGG", "AGGACG")
>>> list(zipped)
[('G', 'A'), ('G', 'G'), ('A', 'G'), ('C', 'A'), ('G', 'C'), ('G', 'G')]

We then use the zip iterator to iterate over the tuples. We unpack the tuple into two variables, nucleotide_a and nucleotide_b. You can read more about unpacking in the concept concept:python/unpacking-and-multiple-assignment.

We then compare the characters nucleotide_a and nucleotide_b. If they are not equal, we increment the count variable by 1.

After the loop is finished, we return the count variable.

Sum

sum

The benefit of using sum is that we can use a generator expression to create a list of booleans. We can then pass that generator to sum and it will iterate through and add up all the booleans. Where True is treated as 1 and False is treated as 0. Then that total is returned.

This can make the code a bit more concise.

Here is an example using sum with zip:

def distance(strand_a, strand_b):
    if len(strand_a) != len(strand_b):
        raise ValueError("Strands must be of equal length.")
    return sum(nucleotide_a != nucleotide_b for nucleotide_a, nucleotide_b in zip(strand_a, strand_b))

This approach starts by checking if the two strands are of equal length by using len. If not, a ValueError is raised.

After that is checked, a <count> variable is initialized to 0. The count variable will be used to keep track of the number of differences between the two strands.

This approach uses the zip function to iterate over two strings. You can read more about how to solve this exercise with zip in the zip approach.

What differs in this approach is that we use a generator expression to create booleans. The generator expression returns an iterator over the tuples returned by zip. Within the iteration, the tuples are unpacked into two variables, nucleotide_a and nucleotide_b. We can then compare nucleotide_a and nucleotide_b. If they are not equal, True is produced. If they are equal, False is produced. The generator expression is then passed to the sum function.

sum will then iterate over the generator expression and add up all the booleans. Where True is treated as 1 and False is treated as 0. You can read more about this behavior in Boolean as numbers. Finally the totaled booleans are returned.

This approach is also doable with range but it is a bit more verbose:

def distance(strand_a, strand_b):
    if len(strand_a) != len(strand_b):
        raise ValueError("Strands must be of equal length.")
    return sum(strand_a[index] != strand_b[index] for index in range(len(strand_a)))

Source: Exercism python/hamming

💡 View Solution

def distance(strand_a, strand_b):
    if len(strand_a) != len(strand_b):
        raise ValueError("Strands must be of equal length")
    return sum(a != b for a, b in zip(strand_a, strand_b))