Finding keywords-in-context
Metin Yazici
2021-09-14 - 6 months ago
5 min read

Last month I published my first npm TypeScript package kwic-ts with an example application that is supposed to demonstrate its usage.

I thought it'd be nice to scribble some words down about this journey. And I am here to talk about why and how I started, what I learned, and what difficulties I faced against along the way.

Roughly speaking, I spent about 4 months on this project and mostly spending 1h - 1.5h per week on average.

Why build it?

There is another KWIC package in the npm registry that I didn't try or run myself, but I read the source code and I realized that the package does more than it is supposed to, like stopwords, punctuation removal, etc.

As I greatly believe in UNIX philosophy that each program should do one thing and do it well, I decided to give it a shot and create a new package that does one thing: finding keywords-in-context.

The beginning of the project

The first step was to read about how to start a project in TypeScript. I was then ready of setting up the fundamentals, namely git, npm, and typescript.

As I've internalized the wisdom of taking a moment of thinking and designing, before wildly starting the implementation, I started off with a simply file where I thought of how the API would look like. Therefore, my first commit after the initial didn't contain any serious code, but a concise file with some design sketches.

Although I'm not a maximalist of Test Driven Development, I definitely appreciated having tests in such a -relatively small- project that the output can mostly be predicted. My next step was setting up a test suit with jest and ts-jest. As I moved forward in the progress, having tests was such a relief.

First unplanned feature

After some progress, I was thinking that it could be a good idea to create an example app along with the package to present the package capabilities.

When I started developing the example app meanwhile, I realized that I can show the highlights of the words inside the <textarea> where users can edit and see the matches in the same place. Although it was definitely possible to implement it on my own, I decided to go for a solution already there. I come across a react package, called react-highlight-within-textarea that highlights the text inside the <textarea> easily.

At this point, I realized that I have to find the range position of the matched words, a simple concordance matching wouldn't be enough. I decided to re-consider my plan and tackle it as next.

Solving the first bug

I quickly implemented an algorithm that calculates the range of the words from all the matches. Although it seemed to work in the beginning, after some testing with the app, I realized that it's broken by design.

It was because I had an assumption about the tokenization that was stripping all extra whitespace away and was only tokenizing the words. I should have tokenized not only the words but also the extra spaces. Even though I was thinking that is the right direction, I wanted to be extra sure. I've then checked the similar packages both in Python and R, and I found out they have similar approach in terms of tokenizing text.

In principle, I always do reviews of the other work to learn about the subject more and to validate my first ideas; however, this time, I admit that I was a bit late to check that.

Later, I changed how the tokenization happens (#3) and I am glad that it worked well.

About TypeScript

I can say that my first serious experience with TypeScript packaging is quite good. I haven't had any good chance before to create a finished project with typed languages. Those are the main things I really liked about TypeScript:

  1. It's still JavaScript in essence
  2. Static typing changes the mentality of how you program
  3. TypeScript LSP shines for auto-suggestions and linting

The tech stack I used to build is typescript, jest, babel-jest, ts-jest for testing; tsconfig-paths, tsc-alias for absolute path imports; eslint for linting; prettier for formatting.

Although I'm fairly content, I was a bit surprised that some of the TypeScript tooling still relies on external packages that I think they should have been handled by TypeScript long term ago, e.g. absolute path resolution.

Finishing up

After the ranges feature, and a working example application, I was ready to publish the package to the npm registry. I had to unpublish the first few versions due to testing, but that went okay, as it was okay to unpublish a package within 72 hours.

Probably due to the the notorious leftpad incident, npm has a strict policy when it comes to unpublishing packages. However, I couldn't publish the package with the same version even though it's deleted from the npm, I had to bump the patch version up. I don't know how but it seems like npm reserves the version numbers even though the packages are removed. But maybe they do it only for a while.

Questions or comments? Join via Reddit.