Text Search to Widen the Gaze

Mar 21, 2021

A simple self review habit that converts some "unknown unknowns" to "known unknowns". Leverage text search to find related code in a codebase too large to memorize.

The Habit

Before I open up my pull request and ask colleagues for feedback I like to take a step back and review it myself. The so called self review checklist.

I've come to like a remarkably simple item on that self review checklist so much I nowadays tend to do it a couple of times during actual development and not just at the end.

I call it "Text Search to Widen the Gaze".

Just look at the git diff, choose a couple of key parts of the code, and text search for them in the whole code base of your company.

Surprisingly often this leads to further insights. Such as:

There is duplicate code you should update as well.
The bug you are fixing exists in another few places.
A colleague already coded something similar and you should likely talk with them.

Example

You are a Java backend developer currently collaborating closely with the React frontend team, building the endpoints they need for that new frontend app.

One of the endpoints returns a list of images to show in the frontend and the images have names:

[
    {
        "url": "https://storage.googleapis.com/bucket/path/to/image1.jpg",
        "name": "Image 1"
    },
    {
        "url": "https://storage.googleapis.com/bucket/path/to/image10.jpg",
        "name": "Image 10"
    },
    {
        "url": "https://storage.googleapis.com/bucket/path/to/image11.jpg",
        "name": "Image 11"
    },
    {
        "url": "https://storage.googleapis.com/bucket/path/to/image2.jpg",
        "name": "Image 2"
    },
    ...
    {
        "url": "https://storage.googleapis.com/bucket/path/to/image3.jpg",
        "name": "Image 9"
    }
]

The frontend team are asking you to alter the sorting in the endpoint. They want the images sorted by name but in such a way that "Image 10" and "Image 11" appear at the end and not after "Image 1".

After some Googling you find a couple of articles on the subject

and an open source library implementing this so called "alphanum" or "humans natural" sort: https://github.com/gpanther/java-nat-sort

You add the dependency to the pom.xml

<dependency>
    <groupId>net.grey-panther</groupId>
    <artifactId>natural-comparator</artifactId>
    <version>1.1</version>
</dependency>

and rewrite the following old code

domainObject.getImages().stream()
    .sorted(Comparator.comparing(Image::getName))
    .collect(Collectors.toList());

into this new code

domainObject.getImages().stream()
    .sorted(Comparator.comparing(Image::getName, CaseInsensitiveSimpleNaturalComparator.getInstance()))
    .collect(Collectors.toList());

Before opening the pull request you choose to widen the gaze using text search.

You first choose some interesting text to search for based on exploratory questions:

Question1: Is this Maven artifact already used elsewhere?
Text: <artifactId>natural-comparator</artifactId>

Question2: Is some other implementation of the algorithm used elsewhere?
Text: NaturalComparator
Text: AlphanumComparator

Question3: Are there other sorted streams that would be better of sorted with this comparator?
Text: .sorted(Comparator.comparing(

Finding1: It turns out another team on the company uses this Maven artifact extensively already. You choose to go talk with them for a bit about the pros and cons of this library, and the result is they will talk about this kind of sorting on the next guild meeting because it's useful and oddly enough is not included in the JDK or Guava.

Finding2: It also turns out that another team is using another implementation of the algorithm. That one does not support case insensitive sorting though and they copied the java class into the code. The result is they will switch over to the maven library you found because they think that seems better.

Finding3: Finally you find another three places where sorting should likely be done using this comparator. You ask the frontend developers if it would be good to sort these too and they say that yes, that would make a lot of sense. You scope up the ticket slightly to include sorting in these other few places.

To Onboard Yourself

This habit can be really helpful when onboarding yourself at a new job. You get to see other parts of the codebase and you will have seen the code for a reason, making it easier to remember. You might also find good reasons to talk to colleagues, perhaps even in other teams than your own should you wish for it.

I suppose it can also be useful for real veterans at a company some times? Especially if the company is large and the code evolves fast.

To Go that Extra Mile

While this habit might seem like something that will just create more work for you I think that's a short term effect. Long term it's a time time saver if done right.

If you did just what the ticket said, then likely another three similar tickets will be written over the coming months, and you could have predicted that and saved your team mates some time.

The idea is not to dive deep into the rabbit hole or conduct endless yak shaving. The idea is to scope up the ticket just a little bit so that it's not just narrowly the ticket at hand. This will help you feel ownership for the company codebase and seem a more responsible developer.

It also opens up these interesting conversations with colleagues about best practices where knowledge sharing happens.

In Practice: How to do it?

To search the whole company code base I find it convenient to check out all of the source code and have a project with all of it in IntelliJ.

Checking out all git repos manually can be tedious.

Here's how to automate it for GitHub:

function __tools_clone_pipe {
    eval "urls=( $(cat) )"
    local urls_size="${#urls[@]}"
    for (( index=0; index<${urls_size}; index++ )); do
        local url="${urls[${index}]}"
        git clone "${url}"
    done
}

function __tools_clone_github {
    # https://github.com/settings/tokens
    # The token only needs the repo permissions.
    local TOKEN="asdf"
    local ORG="asdf"
    curl --silent --header "Authorization: bearer ${TOKEN}" "https://api.github.com/orgs/${ORG}/repos?per_page=100&page=1" \
        | jq -r ".[] | select(.archived == false) | .ssh_url" \
        __tools_clone_pipe
}

Here's how to automate it for GitLab:

#!/usr/bin/env bash

# This script will clone all GitLab repositories within a specific group.
# The repositories will be cloned into a corresponding directory structure.
# Archived repositories will be excluded.

# Make sure you have the "jq" command installed:
# $ brew install jq

# Make sure you have added your SSH key to GitLab:
# https://gitlab.com/-/profile/keys

# Create a personal access token
# https://gitlab.com/-/profile/personal_access_tokens
GITLAB_TOKEN="TODO"

# Which group should we clone?
GITLAB_GROUP="TODO"

# Change the base url if you run self hosted.
GITLAB_BASE_URL="https://gitlab.com"

function repos {
    local GROUP="${1}"
    repos_projects "${GROUP}"
    for SUBGROUP in $(repos_subgroupids "${GROUP}"); do
        repos "${SUBGROUP}"
    done
}

function repos_projects {
    local GROUP="${1}"
    curl --silent --header "Authorization: Bearer ${GITLAB_TOKEN}" "${GITLAB_BASE_URL}/api/v4/groups/${GROUP}/projects/?private=true&per_page=1000&page=1" | jq '.[] | select(.archived == false) | .ssh_url_to_repo' | tr -d '"'
}

function repos_subgroupids {
    local GROUP="${1}"
    curl --silent --header "Authorization: Bearer ${GITLAB_TOKEN}" "${GITLAB_BASE_URL}/api/v4/groups/${GROUP}/subgroups" \
        | jq '.[] | .id'
}

function clone {
    local URL="${1}"

    local DIR="${URL}"
    # git@gitlab.com:group/subgroup/repo.git

    local DIR="${DIR#*/}"
    # subgroup/repo.git

    local DIR="${DIR%.*}"
    # subgroup/repo

    git clone "${URL}" "${DIR}"
}

function cloneeach {
    eval "local urls=( $(cat) )"

    local CL='\033[92m' # lime
    local CP='\033[95m' # pink
    local CA='\033[96m' # aqua
    local CD='\033[39m' # default

    urls_size="${#urls[@]}"
    for (( index=0; index<${urls_size}; index++ )); do
        url="${urls[${index}]}"
        index_human=$((index+1))
        echo -e "${CL}=======[ REPOSITORY${CP} ${index_human}/${urls_size}${CL} :${CA} ${url}${CL} ]=======${CD}"
        clone "${url}"
    done
}

echo "Finding all the repositories to be cloned..."
repos "${GITLAB_GROUP}" | cloneeach

Create a directory “code” and check out all repos into that directory. Next if you use Maven you can create a meta pom.xml for all of it using this command:

#!/usr/bin/env bash

# Generate meta pom.xml for all sub folders.

if [[ -f "pom.xml" ]]; then
    echo "A pom.xml already exists. Please delete or remove it before running this command."
    exit 1 # False
fi

content='<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>se.oloflarsson</groupId>
    <artifactId>code</artifactId>
    <version>1.0.0-CODE-SNAPSHOT</version>

    <packaging>pom</packaging>

    <modules>
'

# Create entry for each file system child ...
entries=()
for directory in *; do
    # ... that is a directory ...
    if [[ ! -d "${directory}" ]]; then
        continue
    fi

    # ... and contains a pom.
    if [[ -f "${directory}/pom.xml" ]]; then
        entries+=("${directory}")
        continue
    fi
done

for entry in "${entries[@]}"; do
    content="${content}        <module>${entry}</module>
"
    done

    content="${content}    </modules>
</project>
"

echo "${content}" > "pom.xml"
echo "Successfully created pom.xml"

Open this pom.xml as a project in IntelliJ. While it takes a while to index all of it you can search really fast once indexing is complete.