So you have designed your beautiful data mart and you want your load procedure to load as fast as it can but maybe not too much at once? I’ve tried my luck with running sums and I’m happy.
What exactly do we want to achieve? The simple answer is: no matter how much data we have to load, we want to have reasonable runtimes. Most of the time, the data you load will not differ much, it’s just the amount that might vary. That’s good news cause at least you will be able to determine the reasonable amount of rows to load by testing a bit. Let’s say one million is perfect in our scenario. I know it’s obvious, but I’ll say it: Usually new data flows into our data warehouse as time goes by. As such, we will usually load using a timestamp, be it the natural date when our data was created or the date we loaded it into our DWH.
What we want to do now is create date batches that contain as many rows as possible, but less than one million. In my example, I will use the load date and a very simple design with a stage table as base and a data mart table as target.
Disclaimer: I’m skipping indexes, log messages, transactions and complex checks here, but don’t forget that these exist in the real world 😉
--A regular delta load, using the load time stamps, nothing special here. declare @max_table_load_date datetime2 = (select max(table_load_date) from data_marts.mytable); select col1 , col2 , col3 , ... , stage_load_date into #delta from stages.mystage where stage_load_date > @max_table_load_date ;
But what if a single stage_load_date contains such a small amount of rows, that looping would generate more overhead than required.
--First we want to know how many rows we have per stage_load_date. We will use this information later select stage_load_date , count(*) as cnt into #load_dates from #delta group by stage_load_date ; --Now we are going to select a date that fits our row requirement. The running sum is the key to the selection. while exists (select 1 from #load_dates) begin declare @max_batch_date datetime2; select @max_batch_date = max(sub.stage_load_date) from ( select stage_load_date , sum(cnt) over (order by stage_load_date) as runsum from #load_dates group by stage_load_date ) sub where sub.runsum <= 1000000 --that's one million ; --There is still data to be loaded, but what if there is one date with more than a million rows? if (@max_batch_date is null) set @max_batch_date = (select min(stage_load_date) from #load_dates); select * from #delta where stage_load_date <= @max_batch_date ; --Or do whatever you want to do with your delta. I'm sure there are plenty of different things you can do before inserting it to the data mart. --The last part should be the following delete ld from #load_dates ld where stage_load_date <= @max_batch_date end;
I really like this solution cause it’s easy to implement and the overhead is very resource friendly. I’m using this on an MSSQL 2019 with columnstore indexing and the loop overhead is barely noticeable.