With the ever-growing demand for automatic question answering — Haptik built a document answering feature to let our Intelligent Virtual Assistant (IVA) consume data and answer questions directly from the customer’s product brochures and documents. Our Haptik bots reply with answers from the document segments and provide a link that opens the document at the exact page and highlights the answer. Highlighting text and providing a link to the exact answer is important to provide a good user experience and while this might seem like an easy task, there are a lot of challenges and edge cases involved, and in this post, we detail how we went to build a PDF text highlighter programmatically.
Solutions explored
Below are the two solutions we explored with their pros and cons
1. Using Text Fragments: We can link to a specific portion of a page using a text fragment by adding #:~:text=text_to_highlight to the URL. The browser highlights the text when the page is loaded and scrolls the fragment into view.
- Pros: easy to implement
- Cons: browser specific, loses on refresh
2. Highlight in HTML after converting PDF to HTML: Convert your PDF document to HTML files ensuring an exact layout. And on page load, execute a javascript function that takes the required parameters like the text to be highlighted, loads your HTML document and highlights the right DOM elements
- Pros: gives total flexibility to highlight any text we want
- Cons: PDF to HTML converters modifies the order and nesting of HTML elements to achieve an exact layout, making highlighting on those HTML documents a challenging task with several edge cases.
We chose to convert PDF to HTML document despite the challenges to provide the best user experience using the klokoy/pdf2htmlex. It encompasses a few creative hacks using CSS, Javascript and background elements to achieve the exact layout. Below is the command you can use to convert a pdf file to an HTML file.
docker run -it --rm -v path_to_folder/:/folder -v klokoy/pdf2htmlex pdf2htmlEX folder/input_filename.pdf folder/output_filename.html |
Highlighting the exact text
Once the PDF was converted to HTML using the above, we need to find the start and end DOM elements containing the first and last words of the text to be highlighted so we could highlight the content in between them. Highlighting the exact text from HTML is challenging for many reasons, some of which are listed below.
- The start and end DOM elements may be sheltering some extra text along with the content we are searching for.
- The words we search for might be split across multiple DOM elements.
- The HTML content need not have its DOM elements in the same order as they appear on the PDF, as we use CSS tricks to preserve the layout.
- The HTML content might have some special Latin characters.
- The representation of new lines and multiple spaces in the HTML.
Below we are sharing how these challenges could be handled:
1. Finding the start and end DOM elements
As we know, the content we are searching for might be concatenated or split across multiple DOM elements; a direct search on the first and last words will not work. First, get the start and end indexes of the text to highlight from the easily accessible parent div like the page.
Once we have the start and end indexes, traverse through the line DOM elements and increment the line text length until we hit the parent(line) DOM containing the start DOM index. Identify the child start DOM element in the line using the remaining index to be traversed. Identify the end DOM element using the same procedure. Below is the code snippet implementing the discussed logic.
# get start and end indexes pageContent = document.getElementsByClassName("pf")[page_no].textContent startIndex = pageContent.indexOf(text_to_highlight) endIndex = startIndex + text_to_highlight.length currLineIndex = 0; startElement = null; endElement = null; for(var i=1; i<lineCount; i++){ # get line content lineContent = lineElement.textContent lineContent = preprocessForMultiSpaceAndLatinCharacters(lineContent) # increment current line index with the processed line length lineSize = lineContent.length; currLineIndex += lineSize; # use the start index, compare to see if the start DOM element is in the current line if(startElement == null && startIndex < currLineIndex){ remainingIndex = (startIndex) - (currLineIndex - lineSize) startElement = get_matching_dom_from_line_element(lineElement, remainingIndex) } # use the end index, compare to find out if the end DOM element is in the current line if(endElement == null && endIndex < currLineIndex) { remainingIndex = (endIndex)- (currLineIndex - lineSize) endElement = get_matching_dom_from_line_element(lineElement, remainingIndex) break; } } |
2. Handle Latin characters
Another common issue is the converted HTML file may have special Latin characters in it, whereas you might be searching using the processed English character text. The challenge with replacing latin characters is to have mapping for all the latin characters, as there is no straightforward translator available. We used this latin_map dictionary and the below code snippet to convert them into processed English before searching or getting the indexes for your text.
latin_string.replaceAll(/[^A-Za-z0-9\[\] ]/g,function(a){return latin_map[a]||a} |
3. Multi-space issues
To preserve the exact layout, the converter adds extra spaces to HTML DOM elements if needed. This leads to issues like text match not found or text being partially highlighted due to extra spaces. To handle this, while traversing through the HTML DOM extracted texts, always replace the multi-spaces with a single space using the below code snippet.
text_seperated_by_single_spaces = text_with_multispace_seperation.replaceAll(/[\s]{2,}/g, ' ') |
The new line characters are also handled in the HTML either using single or multi-space if the text you are searching for has new line characters, please do replace them using regex before traversing.
4. Highlight the answer given start and end DOM elements.
Once we have the above, We create a selection instance and set the range using the start and end DOM elements. You can also highlight, scroll and zoom the text. Below is the code snippet for the above code.
|
Conclusion
We decided to convert the PDF to HTML documents before highlighting to make it non-browser specific. It came with challenges as it has its way of dealing with special characters and rendering text in DOM elements. Hence we tailored our way of identifying the start and end DOM elements to highlight the content in between. We have shared some code snippets on how to handle the special characters, multi spaces and on how to scroll, zoom and highlight the right text.