Time Slider is a feature in Solaris that allows you to open past versions of your files.
It is implemented via a service which creates automatic ZFS snapshots every 15 minutes (frequent), hourly, daily, weekly and monthly. By default it retains only 3 frequent, 23 hourly, 6 daily, 3 weekly and 12 monthly snapshots.
I am using Time Slider on several Solaris 11.x servers and I found the same problem on all of them – it doesn’t create any automatic snapshots for some datasets.
For example, it doesn’t create any snapshots for the Solaris root dataset rpool/ROOT/solaris
. However it creates snapshots for the leaf dataset rpool/ROOT/solaris/var
. The rpool/ROOT
dataset also doesn’t have any automatic snapshots, but rpool
itself has snapshots, so it’s not easy to understand what is happening.
I searched for this problem and found that other people have noticed it as well.
There is a thread about it in the Oracle Community:
Solaris 11, 11.1, 11.2 and 11.3: Possible Bug in Time-Slider/ZFS Auto-Snapshot
The last message mentions the following bug in My Oracle Support:
Bug 15775093 : SUNBT7148449 time-slider only taking snapshots of leaf datasets
This bug has been created on 24-Feb-2012 but still isn’t fixed.
After more searching, I found this issue in the bug tracker of the illumos project:
Bug #1013 : time-slider fails to take snapshots on complex dataset/snapshot configuration
The original poster (OP) has encountered a problem with a complex pool configuration with many nested datasets having different values for the com.sun:auto-snapshot*
properties.
He has dug into the Time Slider Python code and has proposed a change, which has been blindly accepted without proper testing and has ended up in Solaris 11.
Unfortunately, this change has introduced a serious bug which has destroyed the logic for creating recursive snapshots where they are possible.
Let me quickly explain how this is supposed to work.
If a pool has com.sun:auto-snapshot=true
for the main dataset and all child datasets inherit this property, Time Slider can create a recursive snapshot for the main dataset and skip all child datasets, because they should already have the same snapshot.
However, if any child dataset has com.sun:auto-snapshot=false
, Time Slider can no longer do this.
In this case the intended logic is to create recursive snapshots for all sub-trees which don’t have any excluded children and then create non-recursive snapshots for the remaining datasets which also have com.sun:auto-snapshot=true
.
The algorithm is building separate lists of datasets for recursive snapshots and for single snapshots.
Here is an excerpt from /usr/share/time-slider/lib/time_slider/zfs.py
:
# Now figure out what can be recursively snapshotted and what # must be singly snapshotted. Single snapshot restrictions apply # to those datasets who have a child in the excluded list. # 'included' is sorted in reverse alphabetical order. for datasetname in included: excludedchild = False idx = bisect_right(everything, datasetname) children = [name for name in everything[idx:] if \ name.find(datasetname) == 0] for child in children: idx = bisect_left(excluded, child) if idx < len(excluded) and excluded[idx] == child: excludedchild = True single.append(datasetname) break if excludedchild == False: # We want recursive list sorted in alphabetical order # so insert instead of append to the list. recursive.append (datasetname)
This part is the same in all versions of Solaris 11 (from 11-11 to 11.3, which is currently the latest).
If we look at the comment above the last line, it says that it should do “insert instead of append to the list”.
This is because the included
list is sorted in reverse alphabetical order when it is built.
And this is the exact line that has been modified by the OP. When append is used instead of insert the recursive
list becomes sorted in reverse alphabetical order as well.
The next part of the code is traversing the recursive
list and is trying to skip all child datasets which already have their parent marked for recursive snapshot:
for datasetname in recursive: parts = datasetname.rsplit('/', 1) parent = parts[0] if parent == datasetname: # Root filesystem of the Zpool, so # this can't be inherited and must be # set locally. finalrecursive.append(datasetname) continue idx = bisect_right(recursive, parent) if len(recursive) > 0 and \ recursive[idx-1] == parent: # Parent already marked for recursive snapshot: so skip continue else: finalrecursive.append(datasetname)
This code heavily relies on the sort order and fails to do its job when the list is sorted in reverse order.
What happens is that all datasets remain in the list with child datasets being placed before their parents.
Then the code tries to create recursive snapshot for each of these datasets.
The operation is successful for the leaf datasets, but fails for the parent datasets because their children already have a snapshot with the same name.
The snapshots are also successful for the datasets in the single
list (ones that have excluded children).
The rpool/dump
and rpool/swap
volumes have com.sun:auto-snapshot=false
. That’s why rpool
has snapshots.
Luckily, the original code was posted in the same thread so I just reverted the change:
if excludedchild == False: # We want recursive list sorted in alphabetical order # so insert instead of append to the list. recursive.insert(0, datasetname)
After doing this, Time Slider immediately started creating snapshots for all datasets that have com.sun:auto-snapshot=true
, including rpool/ROOT
and rpool/ROOT/solaris
.
So far I haven’t found any issue and snapshots work as expected.
There may be some issues with very complex structure like the OP had, but his change has completely destroyed the clever algorithm for doing recursive snapshots where they are possible.
Final Thoughts.
It is very strange that Oracle hasn’t paid attention to this bug and has left it hanging for more than 4 years. Maybe they consider Time Slider a non-important Desktop feature. However I think that it’s fairly useful for servers as well.
The solution is simple – a single line change, but it will be much better if this is resolved in a future Solaris 11.3 SRU. Until then I hope that my blog post will be useful for anyone who is trying to figure out why the automatic snapshots are not working as intended.
No comments