Searching for Knowledge in Guru

Guru goes beyond basic keyword search by applying AI and ML to understand the meaning of search terms (semantic search). Your team's past search activity helps Guru augment the ranking of search results, ensuring the most relevant content (either Guru Cards or externally linked sources) are as high in the results list as possible. Guru's model is constantly learning and improving.

Using Guru's Search to Locate Information

After you type a search term or keyword in the web app or browser extension search bar, several things happen:

  • If you are on a plan that includes Answers, and you end your search with a question mark "?" Guru will use generative AI to answer your question directly, without needing to review search results.
  • Guru will use information that has been added to Guru via Cards, or connected to Guru via sources, to surface relevant search results.
    • Clicking on the result will direct you to the Card or the external source. You will only receive search results which you have access to within the original source.

Filtering

  • You can filter your results by:
    • Source
    • Guru Attribute (Collection, Author, Tag, Verification Status, Favorites)

How Guru Search Works

Below are more technical details around how Guru's search functions:

Interpreting Search Terms

  • Guru checks to see if there are alternate forms of the search term(s) and decides what to do with any stop words.
    • For example, if you entered "ran", Guru will include "run", "running", and "runs".
    •  Stop words include "a," "the," "is," "are," "but," "for," and more.
  • Guru does a spell check (also known as fuzzy matching).
  • Guru evaluates punctuation marks and spacing.
    • For example, "self service" versus "self-service".
  • Guru considers potential synonyms.
    • For example, if a searcher provides "vacation" in the search bar, Guru may also consider Cards with the word "holiday" in them.
  • Guru considers double quotation marks used when you are searching for a specific word combination or an exact phrase.
    • For example, "engineering onboarding" will find Cards that have that exact phrase only.
    • If the phrase inside the double quotation marks contains stop words, the stop words will be excluded from the exact match search. 
  • Guru uses machine learning (“ML”) to form a representation of the meaning of the search terms. This is part of Guru’s semantic search capabilities.

Gathering relevant results

After analyzing the search term(s) provided, Guru will find Cards that are relevant to those terms. Relevance is based on many factors, including:

  • Where matches are found in a Card.
    • For example, a match could be found in the Card's title, Tags, attachment, content, etc.
  • How many “matches” to the search term(s) there are in the Card.
  • How well the meaning of the Card matches the meaning of the search terms as determined by an ML process.
  • How much interaction a Card has received and if that interaction happened recently or a long time ago.
    • Interactions include favoriting, viewing, and copying.
    • Recent interactions are a little bit more important than old interactions.
  • How recently a Card was created.

We use an ML process to determine the best weight for all of these different factors combined. These values are updated on a regular cadence based on historical search activity.

In addition to the inputs derived from Card content and actions on Cards, Guru also uses data about which search terms have successfully led other members of your team to Cards as a way to ensure that the most relevant Cards are as high in the results list as possible.

The process for finding relevant Cards in search is a complex combination of several sub-processes that are constantly being evaluated, tested, and adjusted. The factors mentioned above are a simplified representation of this process.

Title Search

This feature is designed to help you quickly find the content you are familiar with or that is highly relevant based on the title of the Card. The three results that appear in the dropdown under the search bar in the extension and web app are based only on the content of Card titles; no other parts of the Card are considered for this search. This search will generate up to three results as you type, and they will change as you add or edit what you’ve typed.

Several of the same processes that contribute to regular search results also influence title search:

  • Alternate forms of the term(s) are provided (i.e. run vs runs).
  • How closely the Card title aligns with the search terms provided. For example, in a query that includes multiple words, finding more of those words in the title of the Card is better.
  • The interactions a Card has received and how recently they occurred.
  • How recently a Card was created.
  • Exact match queries will be treated the same as with normal search, Guru will only find results that match the terms provided very closely.

Notably, spellchecking and ML processes for interpreting the meaning of search terms and Card content are not part of the process of returning results for this type of search.

Building the search index

Every time a Card is created or updated, Guru runs a process to pull information from several components in the Card. Normally, this process completes within a few seconds of publishing a Card. Attachment content usually finishes processing within a minute or two, following the Card content.

These are the parts of a Card that are added to the search index:

  • Card title
  • Card body
  • Tags
  • Any text that can be extracted from attachments or uploaded files (like PDFs)
  • Attachment file names
  • A representation of the meaning of the Card as determined by a machine learning model. This is part of Guru’s semantic search capabilities.

How text is extracted from attachments

When an Author uploads an attachment to a Card, Guru runs a process to discover if there is text in the file. This text-identifying process is based on Optical Character Recognition (OCR) and a machine learning model that has been taught to recognize text visually. This OCR-based process works with handwritten and printed characters.

File types Guru indexes for search:

  • PDFs
  • Word (and open source equivalents)
  • PowerPoint (and open source equivalents)
  • Excel (and open source equivalents)
  • Plain text files (.txt)
  • PNGs
  • Photoshop
  • Illustrator
  • Postscript

There are some limitations (per file, not per Card) to this OCR process to be aware of:

  • 500MB file size limit
  • 10MB file size limit for PNGs
  • Maximum number of pages is 3,000
  • Maximum height and width is 40 inches and 2880 points
  • PDFs cannot be password protected
  • PDFs cannot contain JPEG 2000 formatted images
  • Text must be horizontal, vertical text won’t be picked up
  • Text must be a minimum of 15 points - at 150 DPI this works out to about 8 point font

Since matches in attachments don't receive the same emphasis as matches found in Card titles, tags, and body content, we recommend adding some descriptive text to the body of Cards that contain attachments. Not only is a description helpful for improving search performance, but it will help anyone who comes across the Card better understand if the information in the attachment will be useful to them.

The text extraction process works with content in these languages:

  • English
  • French
  • German
  • Italian
  • Portuguese
  • Spanish