One of the most important parts of any search functionality is that the search index holds all of the content that can be searched for by users. For Guru, the search index is composed of information in Cards. Storing information in the index allows Guru to quickly find relevant content when a user submits a search query.
If you are interested in learning about the other parts of Guru’s search technology, see Guru's Search: Definition.
Building the search index
Every time a Card is created or updated, Guru runs a process to pull information from several components in the Card. We complete this process within a few seconds of a user publishing the Card.
These are the parts of a Card that Guru uses in our search index:
Any text that can be extracted from attachments or uploaded files (like PDFs)
Attachment file names
Guru does not index iframed content. Content must be stored (“hosted”) in Guru for it to be searchable.
How text is extracted from attachments
When an author uploads an attachment to a Card, Guru runs a process to discover if there is text in the file. This text-identifying process is based on Optical Character Recognition (OCR) and a machine learning model that has been taught to recognize text visually. This OCR-based process works with handwritten and printed characters.
File types Guru indexes for search:
Word (and open source equivalents)
PowerPoint (and open source equivalents)
Excel (and open source equivalents)
There are some limitations (per file, not per Card) to this OCR process to be aware of:
500MB file size limit
10MB file size limit for PNGs
Maximum number of pages is 3,000
Maximum height and width is 40 inches and 2880 points
PDFs cannot be password protected
PDFs cannot contain JPEG 2000 formatted images
Text must be horizontal, vertical text won’t be picked up
Text must be a minimum of 15 points - at 150 DPI this works out to about 8 point font
Since matches in attachments don't receive the same emphasis as matches found in Card titles, tags, and body content, we recommend adding some descriptive text to the body of Cards that contain attachments. Not only is a description helpful for improving search performance, but it will help anyone who comes across the Card better understand if the information in the attachment will be useful to them.
The text extraction process works with content in these languages: