This post is written by Viraj Anchan, Full Stack Engineer at Haptik.
There’s an old saying that goes, “You only get one chance to make a first impression.” And more often than not, that impression lasts. For a messaging app like Haptik, that first impression comes in the form of a text message from an assistant. Which is why spelling, punctuation, and grammatical mistakes can ruin the user experience.
While autocorrect can be useful in preventing embarrassing mistakes, it is not always right. On Wednesday, February 29, 2012, a high school student in Gainesville, Georgia, tried to send a friend a text with the message “gunna be at west hall today.” The Autocorrect feature on this student’s iPhone changed it to “gunman.” Making it worse, it was sent to the wrong number, and that put two Georgia schools on lockdown. At Haptik, we acknowledge the challenges and consequences that develop from the use of Autocorrect technologies.
In an ideal scenario, Autocorrect would consistently and correctly distinguish between what we actually type and what we intend to type with little to no user intervention. However, this utopian world is not the one we live in. Assistants at Haptik cater to millions of queries belonging to multiple domains. In today’s fast paced social world, simple mistakes spread like wildfire, especially if they are typo’s. And to prevent any such disaster, we use a combination of Autocorrect techniques.
We have created our own dictionary of commonly misspelled words. Our internal chat tool autocorrects misspelled words in real time before the message is sent. The front-end autocorrects the misspelled words and then sends it to the back-end for further correction. In the back-end, we use LanguageTool to autocorrect punctuation and grammatical mistakes.
LanguageTool is an Open Source Java based grammar-checker for English, French, German, Polish, Romanian, and more than 20 other languages. It takes a text and returns a list of possible errors. To detect errors, each word of the text is assigned its part-of-speech tag and each sentence is split into chunks, e.g. noun phrases. Then the text is matched against all the checker’s pre-defined error rules. If a rule matches, the text is supposed to contain an error at the position of the match. The rules describe errors as patterns of words, part-of-speech tags and chunks. Each rule also includes an explanation of the error.
To improve the accuracy, we have disabled spell check rules (HUNSPELL_RULE, HUNSPELL_NO_SUGGEST_RULE, and MORFOLOGIK_RULE) in the LanguageTool. We have also created our own rules which are customized according to our use cases. We are using a Python wrapper for LanguageTool. We had to optimize it so that it takes less time to autocorrect. Earlier the Java server would restart whenever the function was called. We changed it so that the java server starts only when the function is called for the first time. The java server then runs continuously till our API server runs. Our version is available as a pip package (grammar-check).
Here’s the basics of how we do Autocorrect using the grammar-check package.
tool = grammar_check.LanguageTool('en-GB')
matches = tool.check(text)
text = grammar_check.correct(text, matches)
You can write your own grammar rules in 3 ways:
1) Java – You can extend LanguageTool’s Rule class and implement the match (AnalyzedSentence) method. If your rule doesn’t work on the sentence level, implement TextLevelRule instead.
2) XML – Most LanguageTool rules are contained in rules/xx/grammar.xml, whereas xx is a language code like en or de. In the source code, this folder will be found under languagetool-language-modules/xx/src/main/resources/org/languagetool/; the standalone GUI version contains them under org/languagetool/.
Here’s an example of a complete rule.
<token postag="NNS"><exception postag="VBZ|NN|JJ.*" postag_regexp="yes"></exception><exception>data</exception></token>
<token><exception postag="NN|NNP|NN:.*" postag_regexp="yes"></exception></token>
<message>Did you mean <suggestion>these</suggestion>?</message>
<example type="correct">These errors are easy to fix.</example>
<example type="incorrect"><marker>This</marker> errors are easy to fix.</example>
<example correction="These" type="correct">This forms a sharp contract with...</example>
3) Python – Create your own rules in python and apply that rule before correct() method of grammar_check is called.
It is good to see that LanguageTool community is actively contributing rules in more than 20 languages. To create your own pip package, read this documentation.
“One thing about open source is that even the failures contribute to the next thing that comes up. Unlike a company that could spend a million dollars in two years and fail and there’s nothing really to show for it, if you spend a million dollars on open source, you probably have something amazing that other people can build on.”
~ Matt Mullenweg