Skip to content

Embedding similarity-based search returns the wrong adaptor docs #396

@hanna-paasivirta

Description

@hanna-paasivirta

We currently allow the job code chat to search for documentation in both general and adaptor documentation. This uses embedding similarity to fetch several snippets of documentation. As embedding similarity compares the full chunks without reasoning, it can fetch sections from the wrong adaptor documentation.

It seems like the information about the adaptor just gets buried under the docs, but we should double check that nothing else is failing (e.g. adaptor function signature fetching). There could be another issue as well as it should be possible for the model to reason through the irrelevant docs.

Solutions

Solving the RAG issue:

  • Simplest option: remove adaptor search from the RAG step (and leave in general docs) as we've now got the key info in the prompt through the database
  • Ideally, we don't just rely on embedding similarity but also better filtering and/or agentic search by searching/selecting from documentation section titles (Docsite search: Improving vanilla RAG #180 )

Screenshot and logs

Laura's bug:
Image

Log in Metabase shows the right adaptor sent from the front-end: "job_adaptor": "@openfn/language-memento@1.0.5"

   "search_queries": [
      {
        "query": "memento db adaptor retrieve records library",
        "doc_type": "adaptor_docs"
      },
      {
        "query": "memento",
        "doc_type": "adaptor_docs"
      }
    ],
    "search_results": [
      {
        "text": "---\ntitle: MongoDB Adaptor\n---## About MongoDB\n\n[MongoDB](https://www.mongodb.com/) is a NoSQL, document-oriented database that stores data in BSON (Binary JSON) format, enabling easy storage and retrieval of complex and hierarchical data structures## Integration Options\n\nThe `mongodb` adaptor provides direct database connections for accessing data and executing SQL and standard database operations. See [functions](/adaptors/packages/mongodb-docs) for more on how to use this adaptor.## Authentication\n\nSee the [MongoDB docs](https://www.mongodb.com/docs/) for the latest on supported authentication methods. When integrating with a MongoDB database via OpenFn, you authenticate via SSH using authorized database credentials. See this adaptor's [configuration docs](/adaptors/packages/mongodb-configuration-schema) for more on the required authentication parameters.\n\nSee platform docs on [managing credentials](/documentation/manage-projects/manage-credentials) for how to configure a credential in OpenFn. If working locally or if using a Raw JSON credential type, then your configuration will look something like this:",
        "score": 0.766113281,
        "metadata": {
          "doc_title": "mongodb",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "---\ntitle: MySQL Adaptor\n---## About MySQL\n\nMySQL is a free and open-source relational database management system. It can be accessed and manipulated using SQL to extract or load data.## Integration Options\n\nThe `mysql` adaptor provides direct database connections for accessing data and executing SQL and standard database operations. See [functions](/adaptors/packages/mysql-docs) for more on how to use this adaptor.## Authentication\n\nSee the [MySQL docs](https://dev.mysql.com/doc/) for the latest on supported authentication methods. When integrating with a MySQL database via OpenFn, you authenticate via SSH using authorized database credentials. See this adaptor's [configuration docs](/adaptors/packages/mysql-configuration-schema) for more on the required authentication parameters.\n\nSee platform docs on [managing credentials](/documentation/manage-projects/manage-credentials) for how to configure a credential in OpenFn. If working locally or if using a Raw JSON credential type, then your configuration will look something like this:",
        "score": 0.742736816,
        "metadata": {
          "doc_title": "mysql",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "In the following sections, special systems will be described.#### Example user stories\n\n1 Logistics Management(LMIS)\n\n- LMIS is an area where a multitude of parallel, overlapping or competing\n  software solutions can be found in a single country\n- Although a basic LMIS configuration based on aggregate data can take you very\n  far, in some cases a transactional LMIS is necessary if you need to track such\n  detailed operations as returns, transfer between facilities, barcode reading,\n  batch and expiry managemen\n- In such a situation...\n\n2 Data Sharing for Health and Nutrition, Water Sanitation and Hygiene Projects\n\n- Case management sytsems such as CommCare are widely preffered in collecting\n  case data(or patient level data) due to its dominance in the sector and easy\n  of adoption. In such scenarios, ...\n\n3 DHIS2 Instance Synchronization\n\n- Different DHIS2 instances in a given organisation or government ministry may\n  be deployed on separate servers which places the need for synchronization in",
        "score": 0.706970215,
        "metadata": {
          "doc_title": "dhis2",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "You can use Javascript template literals to easily generate key values which\ninclude a mixture of static and dynamic values:```js\ncollections.set(\n  'my-favourite-footballer',\n  value => `${value.createdDate}-${value.region}-${value.name}`\n  $.data\n),```\n\nIn this example, the `createdDate`, `region` and `name` properties will be read\nfrom each value and assembled into a key-string, separated by dashes. This\ntechnique creates keys that are easily sorted by date.### Getting data from a Collection\n\nTo retrieve multiple items from a Collection, we generally recommend using the\n`each()` function.\n\n`each()` will stream each value individually, greatly reducing the memory\noverhead of downloading a large amount of data to the client.```js\ncollections.each('my-collection', '2024*', (state, value, key) => {\n  console.log(value);\n  // No need to return state here\n});",
        "score": 0.703674257,
        "metadata": {
          "doc_title": "collections",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "---\ntitle: Primero\n---## Overview\n\n[Primero](https://www.primero.org/) is an open source software platform that\nhelps social services, humanitarian and development workers manage\nprotection-related data.### Data Model\n\nPrimero data is primarily stored in **cases**, **services** and **referrals**.\n\n- **Cases** - used to track data on people\n- **Referrals** - Referring a record is a way of giving a user limited access to\n  a record without transferring it completely\n- **Services** - Cases are referred for specific _services_ such as\n  `Alternative care` and `Family Reunification`\n\nLearn more about Primero records using the user guides at the Primero\ndocumentation site: https://support.primero.org/documentation\n\n**[See Primero admin guide](https://support.primero.org/assets/books-v2/1sP6VhT70WHhi5ZPbio6EszX-i4jZsBkO/#h.r1lefowgvf0n) for guidance on unique identifiers.**",
        "score": 0.715576112,
        "metadata": {
          "doc_title": "primero",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "---\nid: library-intro\ntitle: The Community Job Library\nsidebar_label: Library Examples\nslug: library\n---## A growing knowledge base\n\nThere have been **over 3,000 distinct jobs written** for the OpenFn platform,\ndriving efficiency across the social sector. These jobs have run millions of\ntimes, connecting and automating critical technologies like CommCare, DHIS2,\nSalesforce, Kobo Toolbox, and more.\n\n**90% of at-scale OpenFn customers make their jobs publicly available** so that\nother organizations can learn from their implementations, yet we’re still\nconstantly asked to find examples of jobs that “create DHIS2 entities from\nCommCare cases” or “initiate payments when Kobo submissions arrive”.\n\nThe Job Library is our attempt to make this vast pool of real-world experience\navailable to everyone, automatically collecting and organizing non-sensitive job\nscripts from this diverse community of social sector integration experts to make\ndata integration safer, faster, and more scalable than ever before.",
        "score": 0.750793517,
        "metadata": {
          "doc_title": "library-intro",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "Data was taken from the[Satusehate Postman Collection](https://www.postman.com/satusehat/satusehat-public/request/56uan96/encounter-create)```js\npost('Encounter', {\n  resourceType: 'Encounter',\n  status: 'arrived',\n  class: {\n    system: 'http://terminology.hl7.org/CodeSystem/v3-ActCode',\n    code: 'AMB',\n    display: 'ambulatory',\n  },\n  subject: {\n    reference: 'Patient/100000030009',\n    display: 'Budi Santoso',\n  },\n  participant: [\n    {\n      type: [\n        {\n          coding: [\n            {\n              system:\n                'http://terminology.hl7.org/CodeSystem/v3-ParticipationType',\n              code: 'ATND',\n              display: 'attender',\n            },\n          ],\n        },\n      ],\n      individual: {\n        reference: 'Practitioner/N10000001',\n        display: 'Dokter Bronsig',\n      },\n    },\n  ],\n  period: {\n    start: '2022-06-14T07:00:00+07:00',\n  },\n  location: [\n    {\n      location: {\n        reference: 'Location/b017aa54-f1df-4ec2-9d84-8823815d7228',\n        display:\n          'Ruang 1A, Poliklinik Bedah Rawat Jalan Terpadu, Lantai 2, Gedung G',\n      },\n    },\n  ],\n  statusHistory: [",
        "score": 0.711608887,
        "metadata": {
          "doc_title": "satusehat",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "Data was taken from the[Satusehate Postman Collection](https://www.postman.com/satusehat/satusehat-public/request/56uan96/encounter-create)```js\npost('Encounter', {\n  resourceType: 'Encounter',\n  status: 'arrived',\n  class: {\n    system: 'http://terminology.hl7.org/CodeSystem/v3-ActCode',\n    code: 'AMB',\n    display: 'ambulatory',\n  },\n  subject: {\n    reference: 'Patient/100000030009',\n    display: 'Budi Santoso',\n  },\n  participant: [\n    {\n      type: [\n        {\n          coding: [\n            {\n              system:\n                'http://terminology.hl7.org/CodeSystem/v3-ParticipationType',\n              code: 'ATND',\n              display: 'attender',\n            },\n          ],\n        },\n      ],\n      individual: {\n        reference: 'Practitioner/N10000001',\n        display: 'Dokter Bronsig',\n      },\n    },\n  ],\n  period: {\n    start: '2022-06-14T07:00:00+07:00',\n  },\n  location: [\n    {\n      location: {\n        reference: 'Location/b017aa54-f1df-4ec2-9d84-8823815d7228',\n        display:\n          'Ruang 1A, Poliklinik Bedah Rawat Jalan Terpadu, Lantai 2, Gedung G',\n      },\n    },\n  ],\n  statusHistory: [    {\n      status: 'arrived',\n      period: {\n        start: '2022-06-14T07:00:00+07:00',\n      },\n    },\n  ],\n  serviceProvider: {\n    reference: 'Organization/{{Org_id}}',\n  },\n  identifier: [\n    {\n      system: 'http://sys-ids.kemkes.go.id/encounter/{{Org_id}}',\n      value: 'P20240001',\n    },\n  ],\n});",
        "score": 0.703979552,
        "metadata": {
          "doc_title": "satusehat",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": "---\ntitle: MSSQL Adaptor\n---## About MSSQL\n\n[Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/?view=sql-server-ver16) (MSSQL) is a relational database management system (RDBMS) developed by Microsoft. It supports a wide variety of applications, including data warehousing, transaction processing, and business intelligence. It can be accessed and manipulated using SQL to extract or load data.## Integration Options\n\nThe `mssql` adaptor provides direct database connections for accessing data and executing SQL and standard database operations. See [functions](/adaptors/packages/mssql-docs) for more on how to use this adaptor.## Authentication\n\nSee [MSSQL docs](https://learn.microsoft.com/en-us/sql/?view=sql-server-ver16) for the latest on supported authentication methods. When integrating with an MSSQL database via OpenFn, you authenticate via SSH using authorized database credentials. See this adaptor's [Configuration docs](/adaptors/packages/mssql-configuration-schema) for more on the required authentication parameters.\n\nSee platform docs on [managing credentials](/documentation/manage-projects/manage-credentials) for how to configure a credential in OpenFn. If working locally or if using a Raw JSON credential type, then your configuration will look something like this:",
        "score": 0.739196777,
        "metadata": {
          "doc_title": "mssql",
          "docs_type": "adaptor_docs"
        }
      },
      {
        "text": ":::## The Collections Adaptor\n\nThe Collections API is inserted into all each Step of a Workflow through a\nspecial kind of adaptor.\n\nUniquely, the Collections adaptor it is designed to be run _alongside_ other\nadaptors, not by itself. It is automatically injected into the runtime\nenvironment making the Collections API available to every Step in a Workflow,\nregardless of which adaptor it is using.\n\nIf using the CLI the use Collections locally, refer to the\n[CLI Usage](#cli-usage) guide below.## Usage Guide\n\nAll values in a Collection are stored under a string key. Values are stored as\nStrings, but the Collections API will automatically serialized and de-serialize\nJSON objects to strings for you (so, in effect, you can treat keys as strings\nand value as objects).\n\nCollections can be manipulated using a single key a pattern - where a pattern is\na string with a wildcard. So the key-pattern `mr-benn` will only match a single\nvalue under the key `mr-benn`, but the pattern `2024*` will match all keys which\nstart with `2024` but have any other characters afterwards. The pattern\n`2024*mr-benn*` will match keys starting with 2024, then having some values plus\nthe string `mr-benn`, plus any other sequence of characters (in other words,\nfetch all keys which relate to Mr Benn in 2024).\n\nThe Collections API gives you four functions to read, write and remove data from\na collection.\n\n- Use [`collections.get()`](adaptors/packages/collections-docs#collections_get)",
        "score": 0.741760254,
        "metadata": {
          "doc_title": "collections",
          "docs_type": "adaptor_docs"
        }
      }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions