Menengah Exercism • rust

Isogram

Ringkasan Tantangan

Determine if a word or phrase is an isogram.

Dig Deeper

Filter with All on a HashSet

`filter` and `map` with `all` on a `HashSet`

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = HashSet::new();
    candidate
        .bytes()
        .filter(|c| c.is_ascii_alphabetic())
        .map(|c| c.to_ascii_lowercase())
        .all(|c| hs.insert(c))
}

With this approach you will instantiate and update a HashSet to keep track of the used letters.

A use declaration allows directly calling HashSet instead of calling it with its entire namespace. Without the use declaration, the HashSet would be instantiated like so

let mut hs = std::collections::HashSet::new();

After the HashSet is instantiated, a series of functions are chained from the candidate &str.

Since all of the characters are ASCII, they can be iterated with the bytes method. Each byte is iterated as a u8, which is an unsigned 8-bit integer.
The filter method borrows each byte as a reference to a u8 (&u8). Inside of its closure it tests each byte to see if it is_ascii_alphabetic. Only bytes which are ASCII letters will survive the filter to be passed on to the map method.
The map method calls to_ascii_lowercase on each byte.
Each lowercased byte is then tested by the all method by using the insert method of HashSet. all will return true if every call to insert returns true. If a call to insert returns false then all will “short-circuit” and immediately return false. The insert method returns whether the value is newly inserted. So, for the word "alpha", insert will return true when the first a is inserted, but will return false when the second a is inserted.

Refactoring

using the `str` method to_ascii_lowercase and no `map`

You might want to to call the str method to_ascii_lowercase and save calling map, like so

candidate
    .to_ascii_lowercase()
    .bytes()
    .filter(|c| c.is_ascii_alphabetic())
    .all(|c| hs.insert(c))

However, changing the case of all characters in a str raised the average benchmark a few nanoseconds. It is a bit faster to filter out non-ASCII letters and to change the case of each surviving byte. Since the performance is fairly close, either may be prefered.

using `filter_map`

Since filter and map are used, this approach could be refactored using the filter_map method.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = HashSet::new();
    candidate
        .bytes()
        .filter_map(|c| c.is_ascii_alphabetic().then(|| c.to_ascii_lowercase()))
        .all(|c| hs.insert(c))
}

By chaining the then method to the result of is_ascii_alphabetic, and calling to_ascii_lowercase in the closure for then, the filter map passes only lowercased ASCII letter bytes to the all method. In benchmarking, this approach was slightly slower, but its style may be prefered.

supporting Unicode

By substituting the chars method for the bytes method, and by using the char methods is_alphabetic and to_lowercase, this approach can support Unicode characters.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = std::collections::HashSet::new();
    candidate
        .chars()
        .filter(|c| c.is_alphabetic())
        .map(|c| c.to_lowercase().to_string())
        .all(|c| hs.insert(c))
}

Usually an approach that supports Unicode will be slower than one that supports only bytes. However the benchmark for this approach was significantly slower, taking more than twice as long as the bytes approach. It can be further refactored to use the str to_lowercase method and remove the map method to cut the benchmark down closer to the byte approach.

use std::collections::HashSet;

pub fn check(candidate: &str) -> bool {
    let mut hs = std::collections::HashSet::new();
    candidate
        .to_lowercase()
        .chars()
        .filter(|c| c.is_alphabetic())
        .all(|c| hs.insert(c))
}

To more completely support Unicode, an external crate, such as unicode-segmentation, could be used. This is becasue the std::char can not fully handle things such as grapheme clusters.

Bit field using a for loop

Bit field using a `for` loop

const A_LCASE: u8 = 97;
const Z_LCASE: u8 = 122;
const A_UCASE: u8 = 65;
const Z_UCASE: u8 = 90;

pub fn check(candidate: &str) -> bool {
    let mut letter_flags: u32 = 0;

    for letter in candidate.bytes() {
        if letter >= A_LCASE && letter <= Z_LCASE {
            if letter_flags & (1 << (letter - A_LCASE)) != 0 {
                return false;
            } else {
                letter_flags |= 1 << (letter - A_LCASE);
            }
        } else if letter >= A_UCASE && letter <= Z_UCASE {
            if letter_flags & (1 << (letter - A_UCASE)) != 0 {
                return false;
            } else {
                letter_flags |= 1 << (letter - A_UCASE);
            }
        }
    }
    return true;
}

This solution uses the ASCII value of the letter to set the corresponding bit position.

First, some const values are set. These values will be used for readability in the body of the check function.

An unsigned 32-bit integer (u32) will be used to hold the twenty-six bits needed to keep track of the letters in the English alphabet.

The for loop loops through the bytes of candidate. Each letter is a u8 which is tested for being a through z or A through Z. The ASCII values defined as the const values are used for that. The ASCII value for a is 97, and for z is 122. The ASCII value for A is 65, and for Z is 90.

If the lower-cased letter is subtracted by a, then a will result in 0, because 97 minus 97 equals 0. z would result in 25, because 122 minus 97 equals 25. So a would have 1 shifted left 0 places (so not shifted at all) and z would have 1 shifted left 25 places.
If the upper-cased letter is subtracted by A, then A will result in 0, because 65 minus 65 equals 0. Z would result in 25, because 90 minus 65 equals 25. So A would have 1 shifted left 0 places (so not shifted at all) and Z would have 1 shifted left 25 places.

In that way, both a lower-cased z and an upper-cased Z can share the same position in the bit field.

So, for an unsigned thirty-two bit integer, if the values for a and Z were both set, the bits would look like

      zyxwvutsrqponmlkjihgfedcba
00000010000000000000000000000001

You can use the bitwise AND operator to check if a bit has already been set. If it has been set, you know the letter is duplicated and you can immediately return false. If it has not been set, you can use the bitwise OR operator to set the bit. If the loop completes without finding a duplicate letter (and returning false), the function returns true.

Bit field used functionally

const A_LCASE: u8 = 97;

pub fn check(candidate: &str) -> bool {
    candidate
        .bytes()
        .filter_map(|c| {
            c.is_ascii_alphabetic()
                .then(|| 1u32 << (c.to_ascii_lowercase() - A_LCASE))
        })
        .try_fold(0u32, |ltr_flags, ltr| {
            (ltr_flags & ltr == 0).then(|| ltr_flags | ltr)
        })
        .is_some()
}

This solution uses the ASCII value of the letter to set the corresponding bit position.

First, a const value is set with the ASCII value for a.

Since all of the characters are ASCII, they can be iterated with the bytes method. Each byte is iterated as a u8, which is an unsigned 8-bit integer, and is passed to the filter_map method.
The closure inside filter_map first tests if the byte is_ascii_alphabetic. If so, the byte is passed to the then method, where the byte is set to_ascii_lowercase in its closure. To understand what else is happening in then, consider the following:
If the lower-cased letter is subtracted by a, then a will result in 0, because 97 minus 97 equals 0. z would result in 25, because 122 minus 97 equals 25. So a would have 1u32 shifted left 0 places (so not shifted at all) and z would have 1 shifted left 25 places.

So, for the unsigned thirty-two bit integer (1u32), the value for a would look like

      zyxwvutsrqponmlkjihgfedcba
00000000000000000000000000000001

and the value for z would look like

      zyxwvutsrqponmlkjihgfedcba
00000010000000000000000000000000

The filter map passes only lowercased ASCII letter bytes converted to u32 values to the try_fold method.

        // code snipped
        .try_fold(0u32, |ltr_flags, ltr| {
            (ltr_flags & ltr == 0).then(|| ltr_flags | ltr)
        })
        .is_some()
}

The try_fold has its accumulator set to a u32 initialized to 0. The closure inside try_fold uses the bitwise AND operator to check if the bit for the letter position has not already been set.
If it has been set, you know the letter is duplicated and try_fold will “short circuit” and immediately pass None to the is_some method, which willl return false.
If it has not been set, the bitwise OR operator is used in the then method to set the bit. If all of the iterations of try_fold complete without finding a duplicate letter (and returning None), the function returns true from the is_some method.

Refactoring

Since a filter_map is used, this approach could be refactored to use filter and a map.

pub fn check_bits(candidate: &str) -> bool {
    candidate
        .bytes()
        .filter(|c| c.is_ascii_alphabetic())
        .map(|c| 1u32 << (c.to_ascii_lowercase() - A_LCASE))
        .try_fold(0u32, |ltr_flags, ltr| {
            (ltr_flags & ltr == 0).then(|| ltr_flags | ltr)
        })
        .is_some()
}

In benchmarking, this approach was slightly slower, but its style may be preferred.

Source: Exercism rust/isogram