The Distributed Replay Utility: Replaying for Real (Where Could It Possibly Go Wrong)?

The Short Answer?  (In at least a couple of places)

When I first learned about the DRU, I was very excited about the possibilities. I began to play with the utility in our test and stage environments, which are on a different SAN than our production environment. I experimented with the architecture, putting the controller and client on one server. I played with putting the client and target on one server. I tried the default architecture of one controller: one client: one target. I modified the configuration files and used different switch options. I ran the utility tracing myself running horrible queries that didn’t physically change a thing, and running traces where I was actually modifying data (usually just adding a database, then adding a table to a database) to assure myself that it was really physically modifying the target server.

After doing a number of smaller, limited practice runs to assure myself that I had a basic understanding of the utility and how it worked, I learned that our Infrastructure team wanted to use the DRU as part of performance testing for the new SAN they wanted to buy.  Once we were down to a couple of different vendors, it was decided that we would use the DRU to test several different workloads:  one that was more CPU intensive, one that was heavy on IO, one critical system processing run, and one run to test the processing of a tabular cube. 

This went well beyond anything I had ever used the utility for. Additionally, because the test was being done in what was essentially an isolated environment (e.g., the existing SAN could talk to the new test SAN, but they wouldn’t during workload testing), it presented a unique challenge.  To further complicate matters, both the IO-heavy run and critical system processing runs used transactional replication in production, so the performance of replication would need to be assessed. 

I began by prepping the environment on our existing SAN. I set up the DRU controller in our stage environment and ran the needed traces in client test instances there. Then, the question of how to set up the isolated test SAN environment for the replay had to be resolved. Another DRU controller was set up on a SQL Server instance on the new SAN. The new controller also served as the client and the target. This time, only the necessary databases to run the needed processes were restored. Logins were copied over to the new test SAN environment, and permissions were confirmed. One of my teammates then set up a distribution server and a server for the replicated data for the test SAN. Replication was set up and ready to go. Lastly, the traces were copied over. Infrastructure would v-motion the storage from one SAN candidate to another, making that part of the process transparent to us. Between tests, we just needed to tear down replication, do database restores, fire replication back up, and we’d be ready to go again.

I used one of my last test runs when I was first playing with the utility on the existing SAN as a control, naming it controlRun. It had used some processes such as nested cursors and CHECKDBs to generate load on the server, but modified no data. It clocked in as a 5 minute trace.  I hoped it would give me some kind of idea what to expect in terms of preprocessing and playback times.  I had the controller set to all the default settings for the replay. Next, I began preprocessing data.  I had two of our production workloads to work with, the CriticalSystemProcessingRun and the HeavyIORun. The CriticalSystemProcessingRun, while generating IO at the beginning and end of its process, was generally more CPU intensive, and a much longer process, usually taking about 2 hours.  The HeavyIORun, however, lived up to its name, and typically processed millions of rows, but only taking about 20 minutes or so to run.

Comparing preprocessing on the new test SAN vs. the controlRun, which was done on the existing SAN

Clearly, both the CriticalSystemProcessingRun and the HeavyIORun are much larger workloads than the very limited runs I had previously done.     It also seems clear that the test SAN is much faster.  I saw the difference reviewing the first step of preprocessing, where the utility is sucking up the data to prepare for replay.  Our existing SAN took the full 30 seconds to preprocess the first step of the controlRun, where the testSAN took a minute to process over a million events on the HeavyIORun.

It’s also interesting to note how consistent the second step of preprocessing is across workloads and SAN environments (Figure 1):

Figure 1) Metrics for test SAN preprocessing. Step 2 info taken from DRU replay screen.
Figure 2) Looking at event division in step 2 of preprocessing. Since I stopped the CriticalSystemPreprocessingRun when it was only 30% finished, I took that into consideration in the first column by multiplying the totalevents (55,105,119) by .30 and then dividing by the minutes.

Each ten percent interval on the CriticialSystemProcessingRun took exactly 73 minutes. Likewise, the HeavyIORun took exactly four minutes to process each 20 percent interval.  Meanwhile, the controlRun had taken two minutes to process all 143,443 events. I decided to look a little deeper at the average number of events processed per minute (Figure 2).  There was some difference there.  This leads me to believe that while this step may be influenced by the type of events it processes, it is also a matter of the DRU function itself. The utility seems to take a look at the total number of events and divide them up as evenly as possible for processing.  This consistency remained through all three runs despite the fact that the three workloads in question were very different in time, quality, and what they did on the server.

It may be apparent that I stopped looking at the preprocessing on the CriticalSystemProcessingRun at 30%.  That is because it had already taken well over 3 hours to get that far, and I could see that the events being processed weren’t going to go any faster as time went on. That clearly wasn’t going to work for performance testing on a deadline!

Trying the HeavyIORun instead was more productive – it finished in 20 minutes.  I thought we were back on track until the replay began, and DRU told me that the HeavyIORun replay was going to take over 12 hours to replay (for a 20 minute trace)!  Contrasting that to my previous testing, the controlRun trace replay had also been slow (10 minutes for a trace that took about 5 minutes), but 12+ hours of replay for a 20 minute trace was a deal breaker for server testing, and disproportionately long compared to what I had come to expect.  I didn’t have time to troubleshoot the issue; our Infrastructure team was nearly at its time limit to finish testing on the new storage candidates.

Hoping against hope that the DRU’s estimate was inaccurate (although its stats for the preprocessing had been pretty accurate), I gave it 3 hours before I gave up and recommended doing actual test runs with the applications instead.

How disappointing! Where did it go wrong?

I quickly found that I wasn’t alone.  Questions of the same nature were easily found on the internet; unfortunately, answers were not.  It quickly became clear that if I wanted to know why the DRU wasn’t performing as expected, I would need to try to find out myself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s