The Importance of Sort Order

There’s a bug in Tivoli Storage Manager, where, given a set of tapes to process in order to reclaim the space occupied by expired data, it fails to complete the task because it runs out of scratch tapes. Then manual intervention is required to finish the job. Let me describe the situation, and let you describe the bug. It’s an elementary problem.

You have two empty containers, and five partially full containers, as shown below. You must move the contents of the five partially full containers to the empty containers, leaving as many free containers as possible when you are done. Place the containers in the order in which you will move the contents.

% full: 0,0,10,50,66,33,25

TSM‘s operations are coordinated by an internal database, for which there’s an SQL interface. So, let’s say you tell TSM to reclaim volumes where the percentage reclaimable is greater than 33%, it will issue a query something like this, and work with those volumes.

SELECT volume_name FROM volumes WHERE pct_reclaim > 33

That statement returns the following.

VOLUME_NAME
------------------
001004
001010
001077
001095
001121
001141
001146
001155

But let’s see what percentage reclaimable those volumes are.

SELECT volume_name,pct_reclaim FROM volumes WHERE pct_reclaim > 33

VOLUME_NAME            PCT_RECLAIM
------------------     -----------
001004                        33.2
001010                        55.6
001077                        33.2
001095                        33.3
001121                        43.2
001141                        35.6
001146                        36.7
001155                        74.8

Remember the shuffling exercise from earlier?

SELECT volume_name,pct_reclaim FROM volumes WHERE pct_reclaim > 33 ORDER BY pct_reclaim DESC

VOLUME_NAME            PCT_RECLAIM
------------------     -----------
001155                        74.8
001010                        55.6
001121                        43.2
001146                        36.7
001141                        35.6
001095                        33.3
001004                        33.2
001077                        33.2

I simplified a bit above. TSM attempts to optimize tape operations. The problem, however, is still how it sorts the list. Here’s an example from actual operations last night. Before space reclamation began, we had the following volumes available for reclamation.

VOLUME_NAME            PCT_RECLAIM
------------------     -----------
001001                        47.2
001093                        43.2
001078                        40.7
001198                        38.4
001163                        37.1
001067                        36.9
001016                        35.5
001004                        35.3

After space reclamation ran, and used all four of the available scratch tapes, we were left with this.

VOLUME_NAME            PCT_RECLAIM
------------------     -----------
001078                        97.2
001001                        88.9
001198                        82.5
001016                        74.2
001093                        71.5
001163                        68.5
001004                        59.2