Retrievers concepts

Retrievers can be used for semantic search or for retrieval augmented generation (RAG). A configured retriever can compute embeddings (vectors) for your input data. After the computing is done, the retriever offers interfaces to return data based on a query.

Data sources and data types

A data source is where the data to be embedded/retrieved comes from. The aidb extension supports two types of data sources:

  • Table A column in a table in the PG database.
  • Volume A PGFS volume, which is a wrapper for accessing an S3 object store or local file system.

Data types are:

  • Text
  • Image

You can combine any data source with any data type:

  • Table and text: The text, in this case, is just a column with TEXT/VARCHAR type in a Postgres table. An example is the description for a list of products.
  • Table and image: The image, in this case, would be stored in a BYTEA column in the table, which would hold the bytes of the image. Pipelines doesn't currently support large objects (also known as BLOB) types.
  • Volume and text: An example is text files in an S3 bucket.
  • Volume and image: The images can be stored as objects in an S3 bucket.

Embedding

Retrievers always need a vector table to store the vector embeddings that belong to the source data. Pipelines uses a model with encoder capabilities to generate the vectors for the source data.

Each model has a different dimensionality, that is, the number of dimensions of the vector. (The vector [1, 2, 2] has 3 dimensions.) Pipelines uses PGvector vector column types to store the embeddings. Those have a fixed number of dimensions. For this reason, the encoder interface on the models returns the number of dimensions the vectors will have.

Pipelines creates the vector table and also supports using an existing one. The vector column dimensionality must fit the model.

Consistency with source data

To retrieve correct and consistent data, embeddings must be in sync with the source data. In the case of the table data source, Pipelines uses triggers in Postgres to be notified about changes on the source table.

If the source table already has data/rows when the retriever is created, then a manual bulk embedding call must be made.

To keep the embeddings in sync going forward, you can activate auto-embedding.

Bulk embedding and auto-embedding

Bulk embedding generates the embeddings for all the existing data in a source table. Use it to initialize the retriever if the source table isn't empty.

Note

Bulk embedding doesn't sync the data. We recommend running it on an empty vector table.

Auto-embedding can be manually enabled. It creates triggers in Postgres to keep the embeddings up to data if source data is added, updated, or removed.

Note

Auto-embedding currently works only with the table data source, not the volume source.

Synchronous vs. Asynchronous embedding computation

  • Bulk embedding is a blocking call that can’t be paused.
  • Auto-embedding runs along with other queries during insert/update calls, while waiting on the embedding computation. This means insert/update queries on the source table will be slowed down.

Could this page be better? Report a problem or suggest an addition!