One of the most important parts of any search is that the search index holds all of the content that can be searched for by users. For Guru, the search index is composed of information from Guru Cards.

If you are interested in learning about the other parts of Guru’s search technology, see how Guru's search works.

Building the search index

Every time a Card is created or updated, Guru runs a process to pull information from several components in the Card. Normally, this process completes within a few seconds of publishing a Card. Attachment content usually finishes processing within a minute or two, following the Card content.

These are the parts of a Card that are added to the search index:

  • Card title

  • Card body

  • Tags

  • Any text that can be extracted from attachments or uploaded files (like PDFs)

  • Attachment file names

  • A representation of the meaning of the Card as determined by a machine learning model. This is part of Guru’s semantic search capabilities.

✍️ Note
Guru does not index iFramed content. Content must be stored (“hosted”) in Guru for it to be searchable.

How text is extracted from attachments

When an Author uploads an attachment to a Card, Guru runs a process to discover if there is text in the file. This text-identifying process is based on Optical Character Recognition (OCR) and a machine learning model that has been taught to recognize text visually. This OCR-based process works with handwritten and printed characters.

File types Guru indexes for search:

  • PDFs

  • Word (and open source equivalents)

  • PowerPoint (and open source equivalents)

  • Excel (and open source equivalents)

  • Plain text files (.txt)

  • PNGs

  • Photoshop

  • Illustrator

  • Postscript

There are some limitations (per file, not per Card) to this OCR process to be aware of:

  • 500MB file size limit

  • 10MB file size limit for PNGs

  • Maximum number of pages is 3,000

  • Maximum height and width is 40 inches and 2880 points

  • PDFs cannot be password protected

  • PDFs cannot contain JPEG 2000 formatted images

  • Text must be horizontal, vertical text won’t be picked up

  • Text must be a minimum of 15 points - at 150 DPI this works out to about 8 point font

Since matches in attachments don't receive the same emphasis as matches found in Card titles, tags, and body content, we recommend adding some descriptive text to the body of Cards that contain attachments. Not only is a description helpful for improving search performance, but it will help anyone who comes across the Card better understand if the information in the attachment will be useful to them.

The text extraction process works with content in these languages:

  • English

  • French

  • German

  • Italian

  • Portuguese

  • Spanish

📑 Related articles

Did this answer your question?