Skip to main content

Lineage

Overview

Lineage represents the dependencies between data products and shows how data flows through the organization.

For data product producers, defining lineage correctly is crucial for:

  • Providing transparency to data consumers about where data comes from and where it goes.
  • Supporting impact analysis when making changes to a data product.
  • Enabling governance and observability features in the Marketplace.

In the data product descriptor, lineage is defined through two types of relationships:

  • readsFrom: represents strong, operational dependencies tied to real, physical data flows.
  • logicallyReadsFrom: represents design-level or high-level dependencies, either planned for the future or where not all details are documented.

These relations are visualized in the Marketplace Lineage Graph as solid lines (readsFrom connections) and dashed lines (logicallyReadsFrom connections).

tip

Learn more about:

readsFrom — Strong, Physical Dependency

readsFrom is used to describe real, operational data flows between two data products. It indicates that the consuming product actively reads data from a specific published output port, through one of its components (usually a workload).

Constraints

  • Source (Consumer):
    • Must always be a component or subcomponent, typically a workload (e.g., a service or pipeline).
    • This represents the element actively consuming data.
  • Target (Producer):
    • Must always be a consumable component or subcomponent, such as a published output port.
    • Represents the element exposing data for consumption.

Best Practices

  • Use readsFrom only when the physical flow is established and operational.
  • Be as specific as possible, linking directly to the exact consumable interface (output port).
  • Avoid using readsFrom for future or conceptual dependencies: those should be modeled with logicallyReadsFrom.

logicallyReadsFrom — Logical or High-Level Dependency

logicallyReadsFrom represents a weaker relationship, used for design-level or partially documented dependencies. It is a way to describe intent or higher-level flows without requiring a fully defined physical connection.

When to Use

  • Future dependencies: the data flow does not exist yet, but is planned for the future.
  • Simplified documentation: a real data flow exists, but you don't want to model every detail, for example, when there are many intermediate components.
  • Group-level relationships: when the dependency is on a whole product or group of outputs (i.e., component parent of consumable subcomponents), not a single output port.

Constraints

  • Source (Consumer):
    • Can be a system, component, or subcomponent.
    • Represents the consumer at any level of granularity.
  • Target (Producer):
    • Can be:
      • A group of consumables, such as a whole data product or parent component containing multiple outputs.
      • A specific single consumable, like an output port.
tip

Prefer readsFrom over logicallyReadsFrom when defining a relationship from a component toward a consumable component or subcomponent. While logicallyReadsFrom can technically be used, readsFrom is recommended because it provides a stronger, more accurate representation of an actual operational data flow.