Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fact to Fact JOINS where Querying more than 1 Fact table #9209

Open
ElisabethPA opened this issue Feb 11, 2025 · 2 comments
Open

Fact to Fact JOINS where Querying more than 1 Fact table #9209

ElisabethPA opened this issue Feb 11, 2025 · 2 comments
Assignees
Labels
question The issue is a question. Please use Stack Overflow for questions.

Comments

@ElisabethPA
Copy link

ElisabethPA commented Feb 11, 2025

TWO Fact tables are joined in the same query indirectly, while there is no Fact-to-Fact JOIN defined in example .js file .

We are migrating from Microsoft multidimensional OLAP to CubeDev. Our backend data is in star schema format. We have many facts and dimensions and we are not able to merge fact tables together. They have different granularities. In OLAP cubes we are able to query two measures from different facts in the same MDX query without any problem, while in CubeDev we have a major stopper.

The structure of SQL query generated by CubeDev is slowing down the performance, because we have 1 Fact table join with common dimension + 2nd Fact table joined with common dimension. Both of these queries are combined using JOIN as well. In practice this is generating a cross join between tables, that potentially for us have millions of rows.

Is there any work-around to avoid having LEFT join used in a simple queries where more than one FACT table is involved?
Here is our example:

Tables used:

  1. FactSales
  2. DimProduct
  3. FactTargets

Query we tried from our side:

{
  "limit": 5000,
  "measures": [
    "FactSales.totalSales",
    "FactTargets.totalTargets"
  ],
  "dimensions": [
    "DimProduct.productName"
  ]
}

SQL Query that is generated involves LEFT JOIN on Fact tables, which slows down the performance when tables will have more than millions rows.

The problem for us is the inner query, which shows that two fact tables in example above are cross-joined. Is there any way to avoid it?

Example of .js file here:

cube(`FactSales`, {
  sql: `
    SELECT * FROM (VALUES
      (1, 1001, 500, '2024-02-01'),
      (2, 1002, 700, '2024-02-02'),
      (3, 1001, 800, '2024-02-03'),
      (4, 1003, 600, '2024-02-04')
    ) AS t(sale_id, product_id, sales_value, sale_date)
  `,

  joins: {
    DimProduct: {
      sql: `${CUBE}.product_id = ${DimProduct}.product_id`,
      relationship: `belongsTo`
    }
  },

  measures: {
    totalSales: {
      type: `sum`,
      sql: `sales_value`
    }
  },

  dimensions: {
    saleId: {
      sql: `sale_id`,
      type: `number`,
      primaryKey: true
    },
    productId: {
      sql: `product_id`,
      type: `string`
    },
    saleDate: {
      sql: `sale_date`,
      type: `time`
    }
  }
});

cube(`FactTargets`, {
  sql: `
    SELECT * FROM (VALUES
      (1, 1001, 1500, '2024-02'),
      (2, 1002, 1200, '2024-02'),
      (3, 1003, 1000, '2024-02')
    ) AS t(target_id, product_id, target_value, target_month)
  `,

  joins: {
    DimProduct: {
      sql: `${CUBE}.product_id = ${DimProduct}.product_id`,
      relationship: `belongsTo`
    }
  },

  measures: {
    totalTargets: {
      type: `sum`,
      sql: `target_value`
    }
  },

  dimensions: {
    targetId: {
      sql: `target_id`,
      type: `number`,
      primaryKey: true
    },
    productId: {
      sql: `product_id`,
      type: `string`
    },
    targetMonth: {
      sql: `target_month`,
      type: `string`
    }
  }
});

cube(`DimProduct`, {
  sql: `
    SELECT * FROM (VALUES
      (1001, 'Laptop', 'Electronics'),
      (1002, 'Phone', 'Electronics'),
      (1003, 'Desk', 'Furniture')
    ) AS t(product_id, product_name, category)
  `,

  joins: {
    FactSales: {
      sql: `${CUBE}.product_id = ${FactSales}.product_id`,
      relationship: `hasMany`
    }
  },

  dimensions: {
    productId: {
      sql: `product_id`,
      type: `string`,
      primaryKey: true
    },
    productName: {
      sql: `product_name`,
      type: `string`
    },
    category: {
      sql: `category`,
      type: `string`
    }
  }
});
@ElisabethPA ElisabethPA added the question The issue is a question. Please use Stack Overflow for questions. label Feb 11, 2025
@igorlukanin
Copy link
Member

Hi @ElisabethPA 👋 Great question! Also, thanks for the data model example and the query, it really helped to reproduce.

So, I think the issue here is the direction of joins.

If you:

  1. remove all joins defined in fact tables
  2. add the following joins to your dimension table
  joins: {
    FactSales: {
      sql: `${CUBE}.product_id = ${FactSales}.product_id`,
      relationship: `hasMany`
    },
    FactTargets: {
      sql: `${CUBE}.product_id = ${FactTargets}.product_id`,
      relationship: `hasMany`
    }
  },

For the same query, you'll get the following generated SQL where all fact tables are left joined to the dim table:

SELECT
  q_0."dim_product__product_name",
  "fact_sales__total_sales" "fact_sales__total_sales",
  "fact_targets__total_targets" "fact_targets__total_targets"
FROM
  (
    SELECT
      "keys"."dim_product__product_name",
      sum("fact_sales_key__fact_sales".sales_value) "fact_sales__total_sales"
    FROM
      (
        SELECT
          DISTINCT "fact_sales_key__dim_product".product_name "dim_product__product_name",
          "fact_sales_key__fact_sales".sale_id "fact_sales__sale_id"
        FROM
          (
            SELECT
              *
            FROM
              (
                VALUES
                  (1001, 'Laptop', 'Electronics'),
                  (1002, 'Phone', 'Electronics'),
                  (1003, 'Desk', 'Furniture')
              ) AS t(product_id, product_name, category)
          ) AS "fact_sales_key__dim_product"
          LEFT JOIN (
            SELECT
              *
            FROM
              (
                VALUES
                  (1, 1001, 500, '2024-02-01'),
                  (2, 1002, 700, '2024-02-02'),
                  (3, 1001, 800, '2024-02-03'),
                  (4, 1003, 600, '2024-02-04')
              ) AS t(sale_id, product_id, sales_value, sale_date)
          ) AS "fact_sales_key__fact_sales" ON "fact_sales_key__dim_product".product_id = "fact_sales_key__fact_sales".product_id
          LEFT JOIN (
            SELECT
              *
            FROM
              (
                VALUES
                  (1, 1001, 1500, '2024-02'),
                  (2, 1002, 1200, '2024-02'),
                  (3, 1003, 1000, '2024-02')
              ) AS t(
                target_id,
                product_id,
                target_value,
                target_month
              )
          ) AS "fact_sales_key__fact_targets" ON "fact_sales_key__dim_product".product_id = "fact_sales_key__fact_targets".product_id
      ) AS "keys"
      LEFT JOIN (
        SELECT
          *
        FROM
          (
            VALUES
              (1, 1001, 500, '2024-02-01'),
              (2, 1002, 700, '2024-02-02'),
              (3, 1001, 800, '2024-02-03'),
              (4, 1003, 600, '2024-02-04')
          ) AS t(sale_id, product_id, sales_value, sale_date)
      ) AS "fact_sales_key__fact_sales" ON "keys"."fact_sales__sale_id" = "fact_sales_key__fact_sales".sale_id
    GROUP BY
      1
  ) as q_0
  INNER JOIN (
    SELECT
      "keys"."dim_product__product_name",
      sum("fact_targets_key__fact_targets".target_value) "fact_targets__total_targets"
    FROM
      (
        SELECT
          DISTINCT "fact_targets_key__dim_product".product_name "dim_product__product_name",
          "fact_targets_key__fact_targets".target_id "fact_targets__target_id"
        FROM
          (
            SELECT
              *
            FROM
              (
                VALUES
                  (1001, 'Laptop', 'Electronics'),
                  (1002, 'Phone', 'Electronics'),
                  (1003, 'Desk', 'Furniture')
              ) AS t(product_id, product_name, category)
          ) AS "fact_targets_key__dim_product"
          LEFT JOIN (
            SELECT
              *
            FROM
              (
                VALUES
                  (1, 1001, 500, '2024-02-01'),
                  (2, 1002, 700, '2024-02-02'),
                  (3, 1001, 800, '2024-02-03'),
                  (4, 1003, 600, '2024-02-04')
              ) AS t(sale_id, product_id, sales_value, sale_date)
          ) AS "fact_targets_key__fact_sales" ON "fact_targets_key__dim_product".product_id = "fact_targets_key__fact_sales".product_id
          LEFT JOIN (
            SELECT
              *
            FROM
              (
                VALUES
                  (1, 1001, 1500, '2024-02'),
                  (2, 1002, 1200, '2024-02'),
                  (3, 1003, 1000, '2024-02')
              ) AS t(
                target_id,
                product_id,
                target_value,
                target_month
              )
          ) AS "fact_targets_key__fact_targets" ON "fact_targets_key__dim_product".product_id = "fact_targets_key__fact_targets".product_id
      ) AS "keys"
      LEFT JOIN (
        SELECT
          *
        FROM
          (
            VALUES
              (1, 1001, 1500, '2024-02'),
              (2, 1002, 1200, '2024-02'),
              (3, 1003, 1000, '2024-02')
          ) AS t(
            target_id,
            product_id,
            target_value,
            target_month
          )
      ) AS "fact_targets_key__fact_targets" ON "keys"."fact_targets__target_id" = "fact_targets_key__fact_targets".target_id
    GROUP BY
      1
  ) as q_1 ON (
    q_0."dim_product__product_name" = q_1."dim_product__product_name"
    OR (
      q_0."dim_product__product_name" IS NULL
      AND q_1."dim_product__product_name" IS NULL
    )
  )
ORDER BY
  2 DESC
LIMIT
  5000

Does this help? Also, which database do you use with Cube? Is it MS SQL?

@igorlukanin igorlukanin self-assigned this Feb 13, 2025
@ElisabethPA
Copy link
Author

Thank you for a quick response on your side.
The results of the query you have suggested is not optimized well from a SQL- point.
We are using Microsoft SQL server as a backend.

What we expect to be a correct SQL query is the following. As you can notice, we reduced the usage of Fact tables to be only one-time per sub query. According to execution plan we tested both with you example and our suggested example below, the performance will be much better with our suggestion. This is important to us since our Fact tables contain millions of rows.

We tried the same model in Power BI and queries in Power BI are aligned with our query below. So, is there a possibility to get same results in Cubedev?

SELECT
q_0."dim_product__product_name",
"fact_sales__total_sales" "fact_sales__total_sales",
"fact_targets__total_targets" "fact_targets__total_targets"
FROM
(
SELECT
"fact_sales_key__dim_product".product_name as "dim_product__product_name",
sum("fact_sales_key__fact_sales".sales_value) "fact_sales__total_sales"
FROM
(
SELECT
*
FROM
(
VALUES
(1001, 'Laptop', 'Electronics'),
(1002, 'Phone', 'Electronics'),
(1003, 'Desk', 'Furniture')
) as t(product_id, product_name, category)
) AS "fact_sales_key__dim_product"
LEFT JOIN (
SELECT
*
FROM
(
VALUES
(1, 1001, 500, '2024-02-01'),
(2, 1002, 700, '2024-02-02'),
(3, 1001, 800, '2024-02-03'),
(4, 1003, 600, '2024-02-04')
) AS t(sale_id, product_id, sales_value, sale_date)
) AS "fact_sales_key__fact_sales" ON "fact_sales_key__dim_product".product_id = "fact_sales_key__fact_sales".product_id
GROUP BY
"fact_sales_key__dim_product".product_name
) as q_0
INNER JOIN (
SELECT
"fact_targets_key__dim_product".product_name as "dim_product__product_name",
sum("fact_targets_key__fact_targets".target_value) "fact_targets__total_targets"
FROM
(
SELECT
*
FROM
(
VALUES
(1001, 'Laptop', 'Electronics'),
(1002, 'Phone', 'Electronics'),
(1003, 'Desk', 'Furniture')
) as t(product_id, product_name, category)
) AS "fact_targets_key__dim_product"
LEFT JOIN (
SELECT
*
FROM
(
VALUES
(1, 1001, 1500, '2024-02'),
(2, 1002, 1200, '2024-02'),
(3, 1003, 1000, '2024-02')
) AS t(target_id,product_id,target_value,target_month)
) AS "fact_targets_key__fact_targets" ON "fact_targets_key__dim_product".product_id = "fact_targets_key__fact_targets".product_id
GROUP BY
"fact_targets_key__dim_product".product_name
) as q_1 ON (
q_0."dim_product__product_name" = q_1."dim_product__product_name"
OR (
q_0."dim_product__product_name" IS NULL
AND q_1."dim_product__product_name" IS NULL
)
)
ORDER BY
2 DESC

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The issue is a question. Please use Stack Overflow for questions.
Projects
None yet
Development

No branches or pull requests

2 participants