Impacts of using vSphere Replication with Windows VSS

Windows volume shadow copy service (VSS) can be optionally used by vSphere Replication to perform file system (or level application level) consistent checkpoints. This is typically attractive for workloads like Microsoft SQL Server, so that you can do better than traditional “crash consistent” checkpoints.

While VSS has its advantages, these are some drawbacks. Please be aware that leveraging Windows VSS changes the behavior of the replication process works, specially its performance (in terms of RPO) as well as the underlying traffic patterns and demands.

Specifically, if you use VSS to quiesce the virtual machine, replication traffic cannot be spread out in small sets of bundles throughout the RPO period.

Instead, vSphere Replication transfers all the changed blocks as one set, as soon as the lightweight delta generation (checkpoint) is taken. This means that vSphere Replication will orchestrate the VSS quiesce, take a lightweight delta (which works in conjunction with the VMKernel to track vSCSI traffic), and transfer any delta blocks to the source site immediately.

This leads to (potentially) large spikes in replication traffic flows and datastore demand that might not occur without VSS enabled. This is exponentially impactful when using VSS on high change rate environments on a high number of VM’s (especially if you are trending towards the 500 VM Maximum for 5.5).

Let’s compare this to the default behavior.

Without VSS, vSphere Replication can transfer smaller bundles of changed blocks on an ongoing basis (in an elastic/opportunistic manner), spreading the traffic throughout the RPO period. This basically means it will strive to meet RPO, above all else, but attempt to do it in the least disruptive, most efficient way possible.

This doesn’t always happen, as it depends on a number of factors (RPO, analytics on previous delta sync durations, src/dst datastore performance, network bandwidth, etc); HOWEVER, this opportunistic replication is very cool, as it allows the local scheduler to adapt to changing conditions in the datacenter while reducing the impact of replication on production performance.

What’s the bottom line here? Do what whatever makes most sense for your data protection strategy, but be cognizant that the “one little checkbox” for VSS can have an unforeseen impact on your daily operations.

Advertisements

vSphere Replication RPO vs Replication Scheduling

Hey all, I have been asked several times how RPO’s actually work in vSphere Replication, and I think one of the best explanations is actually in the admin guide. I grabbed the following out of the guide, as I feel it is key enough to deserve some additional attention. Stay tuned below the break for additional commentary.

When you set a Recovery Point Objective (RPO) value during replication configuration, you determine the maximum data loss that you can tolerate. The RPO value affects replication scheduling, but vSphere Replication does not adhere to a strict replication schedule.

For example, when you set the RPO to 15 minutes, you instruct vSphere Replication that you can tolerate losing the data for up to 15 minutes. This does not mean that data is replicated every 15 minutes.

If you set an RPO of x minutes, the latest available replication instance can never reflect a state that is older than x minutes. A replication instance reflects the state of a virtual machine at the time the replication starts.

Assume that during replication configuration you set the RPO to 15 minutes. If the replication starts at 12:00 and it takes five minutes to transfer to the target site, the instance becomes available on the target site at 12:05, but it reflects the state of the virtual machine at 12:00. The next replication can start no later than 12:10. This replication instance is then available at 12:15 when the first replication instance that started at 12:00 expires.

If you set the RPO to 15 minutes and the replication takes 7.5 minutes to transfer an instance,
vSphere Replication transfers an instance all the time. If the replication takes more than 7.5 minutes, the replication encounters periodic RPO violations.

For example, if the replication starts at 12:00 and takes 10 minutes to transfer an instance, the replication finishes at 12:10. You can start another replication immediately, but it finishes at 12:20. During the time interval 12:15-12:20, an RPO violation occurs because the latest available instance started at 12:00 and is too old.

The replication scheduler tries to satisfy these constraints by overlapping replications to optimize bandwidth use and might start replications for some virtual machines earlier than expected.

To determine the replication transfer time, the replication scheduler uses the duration of the last few instances to estimate the next one.

/End Admin Guide Snip/

That all said, I would also like to point out that the vSphere Replication Scheduler works on a host level, not a global level, meaning that scheduling is done by the local VR Agent, and not by the managing vSphere Replication Management Server (VRMS). You will need keep that in mind when evaluating bandwidth and replication windows.

vCenter Server Appliance 5.5 Syslog Collector Rotation

VMware vCenter Server has been shipping with an optional syslog collector for quite some time now, and has proven to be easy way to retain non-persistent logs from ESXi hosts.

The Windows Server variant of vCenter Server ships with a basic syslog collector, that can be optionally integrated into vCenter as a service, allowing you to do the following:

  • Pull syslog data into support bundles for better troubleshooting.
  • See what hosts are actively logging data to the collector.
  • See active configurations for the service, such as maximum log size, rotation, and log directory.

The Appliance variant of vCenter Server, based on SLES 11, DOES ship with a built in syslog collector, which is pre-installed and enabled; however, this variant does NOT have feature parity with the Windows version.

Instead, the Appliance ships with a variant of syslog-ng, and can not be integrated into vCenter as a service, meaning you lose the plugin visibility and functionality described above.

To pile on to this, it also does not come with Rotation pre-configured. This presents a large problem for those depending on vCenter Syslog to maintain copies of their non-persistent logs.

This is especially nasty, as the partition, /dev/sdb2, is in the middle of a non-dedicated disk, making it difficult to expand, and ships only at 20 GB.

To work around this limitation, you will need to configure rotation, which is quite easy, as logrotate.d is already configured as an active cronjob.

To enable basic log rotation, SSH to vCenter Server Appliance, Navigate to /etc/logrotate.d, and modify “syslog” to include the following statements:

/var/log/remote/*/*/* {
daily
compress
delaycompress
rotate 14
postrotate
/etc/init.d/syslog-collector reload > /dev/null
endscript
}

Recommended Syslog Configurations for ESXi

Syslog is an important service in any enterprise architecture, and can even be business critical in some applications (think security, time sensitive logging, etc). To that end, VMware has recently made a play in the big data / log collection game with Log Insight (which is awesome, and if you haven’t tried it, GO DOWNLOAD THE EVAL NOW).

Regardless of what Syslog collector you are using (vCenter Integrated, Log Insight, Splunk, etc), there are a few non-default things that you will need to do to ensure persistent and reliable logging from ESXi 5.x hosts.

First off, note that depending on your patch level, if any of the following happen, the syslog service may not reconnect to your syslog collector, and logs may be missed (sounds important, eh?)

  • The network connection has been interrupted.
  • The remote host has closed the connection.
  • A firewall is preventing the logs from being sent.
  • The remote syslog server is not available.

To remediate these issues, check out the two articles below, and patch accordingly.

vSphere ESXi 5 and Remote Syslog: Make Sure You Patch/Update

VMware ESXi 5.x host stops sending syslogs to remote server

Lastly, I want to highlight VMware’s recommendation in the first article:

Once you have updated all your hosts to the versions listed below, we recommend using TCP or SSL. Without TCP, log message loss due to buffer overflows in network devices and network stacks may happen without detection.

This also sounds important, and it is, as the default log transport in ESXi is UDP (i.e. if you just type the IP or host name, logging will default to udp://loghost.company.com:portNumber).

To remediate this, simply add the tcp:// (or SSL) prefix before your log FQDN or IP Address, so that your syslog.global.loghost entry will be as follows:

tcp://loghost.company.com:514

OR

ssl://loghost.company.com:1514

Final Note: You can configure multiple syslog servers by making syslog.global.loghost comma-delimited.