Implementing Many-to-many Relationships in Data Warehousing

Posted in: Technical Track

This article will discuss how to make many-to-many relationships in data warehousing easily queried by novice SQL users using point-and-click query tools.

This is a big problem with Oracle Discoverer-like tools where the metadata layer is basically a set of pre-joined tables from which the user simply clicks on columns and hits the run button. You can create custom complex queries that they can run, but then every query is custom, which defeats the purpose of the tool in the first place.

The design goal is to create a structure that is simple for the end user and which normally translates to something as flat as possible. This article will go through the different methods of implementing many-to-many relationships, and look at their effect on query complexity, especially for someone who use a tool that hides the SQL.

The typical data warehouse data model is a fact table surrounded by many dimension tables.

In the world of data warehousing, many-to-many relationships are sometimes unavoidable, and we all know that you do not want to normalize your fact tables. One example would something like sports interests or job categories that you’re qualified to work for. The model for that in a normalized OLTP environment would be something like:

The data warehouse world does not function with normalized fact tables since you need to start doing lots of select count(distinct) resulting in facts columns not being additive. I won’t get into that discussion here, but search for articles or literature by Ralph Kimball, the guru of data warehouse design.

Fortunately, there are a few ways of implementing many-to-many relationships for data warehousing. Typically, most DBAs pick either the Boolean Column, the Multiple Column, or the Bridge Table method.

The boolean column method consists of creating a column for each possible value, like so:

Each boolean column can be a numeric (1 or 0), or character (Y or N).

The multiple column method consists of having columns for the number match the number of choices one can make. This has its limitations since it is tightly coupled to the application; but is easily transformed. The typical data model would look like:

In this case, you could only have five sports even though you could have several dozen choices.

The last method is the Bridge Table method, which you may find in data warehouse literature. It is basically a many-to-many relationship attached to a non-normalized fact table. It would typically look like:

Note that the weighing factor is there mainly to distribute the dollar amounts evenly across the categories, where the sum of the parts cannot be greater than the total — for example, if one user has interests in Football and Baseball and spends $10, and the other has interests in Football and Soccer and spends $5. A revenue by breakdown cannot be…

Football: $15
Baseball: $10
Soccer:$5
Total: $30

…when in reality I only made $15. The weighing factor here would be equal to the number of interests for that user (i.e. it would be 2 for each user in this instance).

The result would then be:
Football: $7.5
Baseball: $5
Soccer: $2.5
Total: $15

Back to the main topic. Now that we have our methods, let’s see how easily we can write queries against them. I will have 4 queries for each method:

  1. Breakdown count by sport
  2. an “or” condition: count of basketball or football
  3. an “and” condition: count of basketball and football
  4. complex boolean condition: count of (basketball and football) or (soccer and hockey)

The Boolean Method

Breakdown count by sport:

select 'football' sports, count(*) total from fact f, sports_interest_dm s 
    where f.sports_dm_id = s.id and football_ind = 1
union
select 'basketball' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and basketball_ind = 1
union
select 'hockey' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and hockey_ind = 1
union
select 'baseball' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and baseball_ind = 1
union
select 'soccer' sports, count(*) total from fact f, sports_interest_dm s
    where f.sports_dm_id = s.id and soccer_ind = 1
;

You can also do it with decodes and have your query tool transpose the results.

select sum(decode(football_ind,1,1,0)) football,
       sum(decode(basketball_ind,1,1,0)) basketball,
       sum(decode(hockey_ind,1,1,0)) hockey,
       sum(decode(baseball_ind,1,1,0)) baseball,
       sum(decode(soccer_ind,1,1,0)) soccer
from fact f, sports_interest_dm s where f.sports_dm_id = s.id;

An “or” condition: count of basketball or football:

select count(*) from fact f, sports_interest_dm s where f.sports_dm_id = s.id and (football_ind = 1 or baseball_ind =1);

An “and” condition: count of basketball and football:

select count(*) from fact f, sports_interest_dm s where f.sports_dm_id = s.id and (football_ind = 1 and baseball_ind =1);

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f, sports_interest_dm s
where f.sports_dm_id = s.id
and ((football_ind = 1 and baseball_ind =1) or (soccer_ind=1 and hockey_ind=1));

The Multiple Column Method

Breakdown count by sport query:

select sport, sum(total) from (
    select sports_1 sport, count(*) total from sports_dm s, fact f 
        where s.id= f.sports_dm_id group by sports_1
    union all
    select sports_2 sport, count(*) total from sports_dm s, fact f 
        where s.id= f.sports_dm_id group by sports_2
    union all
    select sports_3 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_3
    union all
    select sports_4 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_4
    union all
    select sports_5 sport, count(*) total from sports_dm s, fact f
        where s.id= f.sports_dm_id group by sports_5
)
group by sport;

An “or” condition: count of basketball or football:

select count(*) from sports_dm s, fact f where s.id= f.sports_dm_id
and ( sports_1 in ('football','basketball')
    or sports_2 in ('football','basketball')
    or sports_3 in ('football','basketball')
    or sports_4 in ('football','basketball')
    or sports_5 in ('football','basketball')
) ;

An “and” condition: count of basketball or football:

select count(*) from fact f
where sports_dm_id in  ( select id from sports_dm
        where sports_1 in ('football')
        or sports_2 in ('football')
        or sports_3 in ('football')
        or sports_4 in ('football')
        or sports_5 in ('football')
    intersect
    select id from sports_dm
        where sports_1 in ('basketball')
        or sports_2 in ('basketball')
        or sports_3 in ('basketball')
        or sports_4 in ('basketball')
        or sports_5 in ('basketball')
) ;

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select count(*) from fact f
where sports_dm_id in  ( (select id from sports_dm
        where sports_1 in ('football')
        or sports_2 in ('football')
        or sports_3 in ('football')
        or sports_4 in ('football')
        or sports_5 in ('football')
    intersect
    select id from sports_dm
        where sports_1 in ('basketball')
        or sports_2 in ('basketball')
        or sports_3 in ('basketball')
        or sports_4 in ('basketball')
        or sports_5 in ('basketball')
        )
    union
    (select id from sports_dm
        where sports_1 in ('soccer')
        or sports_2 in ('soccer')
        or sports_3 in ('soccer')
        or sports_4 in ('soccer')
        or sports_5 in ('soccer')
    intersect
    select id from sports_dm
        where sports_1 in ('hockey')
        or sports_2 in ('hockey')
        or sports_3 in ('hockey')
        or sports_4 in ('hockey')
        or sports_5 in ('hockey')
        )
) ;

Just imagine a few dimensions like that, and you would end up exceeding the SQL query text limit very quickly.

The Bridge Method

Breakdown count by sport query:

select s.description sports, count(*) from fact f, sports_ref s, sports_bridge_details sbd
    where f.sports_bridge_id(+) = sbd.sports_bridge_id
    and  sbd.sports_id(+) = s.id
group by s.description;

An “or” condition: count of basketball or football:

select  count(*) from fact f, sports_bridge sb
    where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
    where sbd.sports_id = s.id
    and s.description in ('basketball','football'));

An “and” condition: count of basketball or football:

select  count(*) from fact f, sports_bridge sb
    where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
    where sbd.sports_id = s.id
    and s.description in ('basketball','football')
    group by sbd.sports_id having count(*) >1);

This works, but you could also have an intersect in the subquery.

Complex boolean condition: count of (basketball and football) or (soccer and hockey):

select  count(*) from fact f, sports_bridge sb
where f.sports_bridge_id = sb.id
and sb.id in (  select sbd.sports_id from sports_bridge_details sbd, sports_ref s
        where sbd.sports_id = s.id
        and s.description in ('basketball','football')
        group by sbd.sports_id having count(*) >1
    union
    select sbd.sports_id from sports_bridge_details sbd, sports_ref s
        where sbd.sports_id = s.id
        and s.description in ('soccer','hockey')
        group by sbd.sports_id having count(*) >1);

Again, intersects can be substituted for group by having count(*) >1.

The following table summarizes query complexity vs. method for ease of query writing in an ad hoc query tool such as Oracle Discoverer.

Boolean method Multiple column method Bridge table method
Breakdown count by sport query Complex Complex Easy
An “or” condition: count of basketball or football Easy Complex Easy
An “and” condition: count of basketball or football Easy Complex Medium
Complex boolean condition: count of (basketball and football) or (soccer and hockey) Easy Complex Complex

As you can see, the Boolean Method is far superior in every category except the breakdown, and the Bridge is a close second. You have to use the Bridge method if the number of potential values in the dimension exceeds 100 columns or so. You can do it, but it looks ugly. The best of both worlds would be to merge the Boolean and Bridge column methods. Its fairly straight–forward: add the boolean columns to the bridge table as follows:

I hope this was informative, and I look forward to your feedback.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

11 Comments. Leave new

bridge-4.png is used twice, in the Multiple Column Method and in the Bridge Table method. If the methods really make the model look the same, maybe you can comment on it. If not, then replacing one or the other picture would be good.

Reply

The article was really informative and it gave me a good idea on how to resolve N:N relationships. Thank you for the same.

Reply

Nothing against the author or the Python Group as this is a great article. But, anyone worth their salt when it comes to databases should see this as another example of how the dimensional model fails. Trust me, using the dimensional model is doing a disservice to yourself and the company you work for. Even Kimball cannot prove this model has any correctness to it. Where as, the relational model as 2500+ years of math behind it. Don’t even get me started on trying to get the correct temporal relations in there. In short: don’t be fooled by the marketing!!

Reply

good article, but i have a question. in the the Bridge Table method graphic, the weighting factor is on the bridge table. in the kimball dwh books, the weighting factor is on what you refer to here as sports_bridge_details. can you clarify this discrepancy? why do you place the weighting factor on the group entity? thanks

Reply

It was very informative.
Could you please send me with some example how to load the bridge table and hierarchy table in DW

Reply

Could you please send me the logic how to load bridge table in DW.

Reply

@Ryan: please don’t take this ill-given advice as a counter-example for Kimball and dimensional modeling. Seeing this article made me my poor BI heart bleed to say the least….

The thing to remember when handling relationships (including N:N) in a dimensional modeling exercise is to express them as fact tables. In this specific case a fact-less fact table. Of-course it is often needed to create analytical or aggregate views (facts) on top of that to make reporting easier. That is especially the case if you are going to let the users report themselves on data structures like these. Your solution is a nightmare in that respect. You will get Cartesian products simply based on the fact that you slammed a N:N relationship into a dimensional model. In a self-service reporting setting this is deadly. A users drops in another column from an extra table and the number of returned rows changes instantly.

Reply

In a star the fact is a bridge. It attempts to ‘bridge’ the many-many relationship amongst the dimensional context. In other words without any measure [ factless facts ] it is in BCNF; whereupon the introduction of a metric, the fact is in 3NF.

Reply

Man, that was so helpfull, many thanks.

Reply
Daniel Adeniji
December 13, 2013 8:54 pm

Nice one – Very nice assist.

I will like to ask a couple of after class sessions; just to prove that I tried my best to follow your thoughts.

1) Per The Boolean Method

a) As Decode does not appear to work in MS SQL Server, can you please extend your post to include how to cover using case statement/etc

2) The Bridge Method

a) Awful ““or” condition” showing up in Microsoft IE and Google Chrome

b) (+) does not work in Microsoft – SQL Server – Transact SQL – should be *= or better still left outer join

c) Lost me @


select count(*) from fact f, sports_bridge sb
where f.sports_bridge_id = sb.id
and sb.id in ( select sbd.sports_id from sports_bridge_details sbd, sports_ref s
where sbd.sports_id = s.id
and s.description in (‘basketball’,’football’)
group by sbd.sports_id having count(*) >1
union
select sbd.sports_id from sports_bridge_details sbd, sports_ref s
where sbd.sports_id = s.id
and s.description in (‘soccer’,’hockey’)
group by sbd.sports_id having count(*) >1);

i] How can you union result of count(*) and sbd.sports_id

3) I think you can also say a lot more as to the benefits of sports_bridge and sports_bridge_detail and how one decides which attributes goes in which dimension table

Again, thanks much for working really hard and commiting so much to the Public Domain.

Like fine wine, your posting stood the test of time (Jan 19, 2007). It is over 6 years later.

Heavenly Blessings,

Daniel Adeniji
Dec-13-2013

Reply
Antonio Astudillo
February 8, 2016 7:43 am

Hello everyone,
The article was really informative and it gave me a good idea on how to resolve N:N relationships.
I have to model a star of Bugs Tracking. One of dimensions is named BUGS. We have differents status for one bug as accepted , new , closed etc .
One status relate to several bugs and one bug relate to several status.
You write :
“The best of both worlds would be to merge the Boolean and Bridge column methods”
Could you please me help to design this merge model (Boolean and Bridge column ) for my star ?
Must I include into the BUGS dimension boolean fields as ? :
accepted , new , closed (9 status fields )
I do not understand the logic of merge Boolean and Bridge column in this case.
Is it for reduce the number of rows if the BUGS dimension?
Thanks in advance.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *