Skip navigation
All People > fraander > EUC VirtualCentric Blog

Title3.png

As a part of the Cisco Desktop Virtualization Solutions team, I spend a good portion of my time conducting performance tests to size EUC software configurations in conjunction with Cisco UCS hardware and products from our partners. For example, I recently released a large scale FlexPod Datacenter solution comprising of 5,000 seats of Citrix XenApp/XenDesktop on UCS B200 M4 blade servers and NetApp All-Flash storage. Learn more about that solution here.

 

Since Microsoft Office 2016 was released, I wondered how this new application stack might impact EUC user densities in comparison to densities under a Microsoft Office 2010 workload. To find out, I set up a series of performance tests using Citrix XenDesktop 7.7. I decided to combine testing objectives for determining the performance characterization of the Cisco UCS B200 M4 blade server running Windows 10 VDI workloads along with a Microsoft Office version comparison.

 

To eliminate as many variables in the environment as I could to make the comparison of Office 2010 and Office 2016 workloads as “vanilla” as possible. I configured the tests to use non-persistent desktops with Citrix Provisioning Services running Windows 10 and turned off Defender.

 

The user density results I saw for the various scenarios were certainly quantifiable. Microsoft Office 2016 proved to be more CPU-intensive than Microsoft Office 2010, resulting in a notable 9% decrease in user density. (Note that the systems under test had no GPU installed, so DirectX-enabled applications had to rely on the CPU for graphics rasterization.) After running tests to gauge host scalability and Login VSI results, I ran single session tests to collect additional performance data. The single session testing confirmed that overall CPU usage was higher for much of the Microsoft Office 2016 workload – other VDA metrics was analyzed as well. I also generated a video sequence that demonstrates side-by-side performance for the single session test, which helps to correlate the qualitative differences from the synthetic workload perspective with the quantitative performance differences of the VDA system. The video also helps visually illustrate the functional nature of the Login VSI 4.1 Knowledge Worker workload.

 

Testing Environment

The components used for the study comprised: Cisco® Unified Computing System (UCS) B200M4 blade servers, Cisco Nexus 9000 series switches, and NetApp® AFF8080EX-A storage system, VMware vSphere ESXi 6.0 Update 1, Citrix Provisioning Services 7.7, and Citrix XenApp and XenDesktop 7.7 software.

 

Testing Components

    1.png

1b.png

VDA Configuration

    2.png

 

 

Testing Methodology

To generate load in the environment, Login VSI 4.1.4 software from Login Consultants (www.loginvsi.com) was used to generate desktop connections, simulate application workloads, and track application responsiveness. In this testing, the Knowledge Worker official workload (via benchmark mode) was used to simulate productivity tasks (Microsoft Office, Internet Explorer with HTML5 video, printing, and PDF viewing) for a typical knowledge worker.

 

Login VSI records response times during workload operations. Increased latencies in response times indicated when the system configuration was saturated and had reached maximum user capacity. During the testing, comprehensive metrics were captured during the full virtual desktop lifecycle: user login and desktop acquisition (login/ramp-up), user workload execution (steady state), and user logoff. Performance monitoring scripts tracked resource consumption for infrastructure components.

 

Each test run was started from a fresh state after restarting the blade servers. To begin the testing, we took all desktops out of maintenance mode, started the virtual machines, and waited for them to register. The Login VSI launchers initiated desktop sessions and began user logins (the login/ramp-up phase). Once all users were logged in, the steady state portion of the test began in which Login VSI executed the application workload.

 

Four different test cases were conducted:

  1. Testing single server scalability under a maximum recommended VDI load using Windows 10 and Office 2016. The maximum recommended single server user density occurred when CPU utilization reached a maximum of 90-95%.
  2. Testing single server scalability under a VDI load using Windows 10 and Office 2010 with the same amount of sessions as the first test case. The objective to determine the host resource variance between Office 2010 and Office 2016.
  3. Testing a single desktop session using Windows 10 and Office 2016 for a granular level examination.
  4. Testing a single desktop session using Windows 10 and Office 2010 for a granular level examination.
  5. Multiple server scalability testing using a 28-blade server configuration under a 4320-user workload.

  

Single Server Testing

To produce the Login VSI graphs that show user densities, I processed the Login VSI logs using VSI Analyzer . With the Windows 10 and Office 2016 workload, the recommended maximum load was reached at 180 users. The same number of sessions tested against Windows 10 with Office 2010 results in marginally lower VSI operation response times. Under the same test conditions, the CPU host usage reports about 9% lower under an Office 2010 workload compared to an Office 2016 workload.

 

Login VSI results for 180x Windows 10 sessions with Microsoft Office 2016 workload 

    3.png

Login VSI results for 180x Windows 10 sessions with Microsoft Office 2010 workload 

    4.png

Login VSI results for two Windows 10 test runs with Microsoft Office 2016 compared to Office 2010

    5.png

 

As LoginVSI launches virtual desktop sessions and starts to execute the defined application workload, the percentage of CPU usage rises, as shown in the graph below. As the number of desktops reaches the recommended maximum load threshold, CPU usage tops out for the Office 2016 test run until LoginVSI starts logging off the VMs. The graph shows how the test environment with an Office 2010 workload sustains the same user density while using 9% less host CPU resources (which roughly equates to an additional 15 sessions per server).

    6.png

 

Single Session Testing

I really wanted to understand what was behind such a dramatic difference in user density, and identify what resources were limiting performance for the Office 2016 applications. To do this, I ran single session tests, recording PerfMon data for each of the Microsoft Office applications—Word, Outlook, PowerPoint, and Excel—as they executed in the Login VSI workload. By recording performance data for each application process, I was able to get a more comprehensive view of resource usage overall and by application.

 

CPU Usage

The following table averages the processor usage over the entire 48-minute test run and the graphs below show percentages of processor time used for Office 2010 and Office 2016 applications, respectively. Although all Office 2016 applications are generally more CPU-intensive than their 2010 counterparts, some use noticeably more than others. In particular, Word (depicted in blue) results in the more intense and frequent occurrences in CPU use measuring over four times greater resource consumption. A distant second to Word is Excel (depicted in orange) with an observable difference of 61% higher CPU usage.

 

    7.png

8.png

9.png

 

Memory Usage

Although CPU explains the differences in session density, I wanted to make sure that there were other adequate system resources available to both sets of applications. The graph below charts memory use for the two test cases, showing that on average the tested applications used about the same amount of RAM. The following table averages the memory usage over the entire 48-minute test run.

      10.png

11.png

 

Network Usage

The graphs below shows network bandwidth consumed.  The Office 2016 test case consumed marginally higher network bandwidth. The following table averages the network usage over the entire 48-minute test run.

      12.png

13.png

 

Disk Usage

The graphs below show disk/storage usage.  The Office 2016 test case consumed 11-17% more network bandwidth. The following table averages the disk usage over the entire 48-minute test run.

      14.png

15.png

 

Multiple Server Scalability Testing

In this multi-server test case, 28 blade servers were used to support a Windows 10, Non-Persistent workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 28-server configuration supported 4,320 total sessions. The graphs below show the VSI results for the mixed workload and total storage IOPS, throughput, and latencies.

 

Workload distribution for 4320 seats of Windows 10 with Microsoft Office 2016

  16.png

   

Login VSI results for 4320 seats of Windows 10 with Microsoft Office 2016

    17.png

NetApp 8080EX-A storage usage for 4320 seats of Windows 10 with Microsoft Office 2016

    18.png

 

Comparing the User Experience (Video Results)

I recorded a side-by-side video sequence of the single session tests of Office 2010 and Office 2016 to show qualitative application performance differences from a user experience perspective. In the video, I compressed a full single test run of 48 minutes into a short 8½-minute movie. While the difference doesn’t seem that significant at first, the impact is more far-reaching when you consider performance beyond the resource consumption of this single session. When sites try to scale EUC workloads based on Office 2016, they will need to take the greater CPU demand and the user densities differences into account.

 

Aside from the intended purposes of this content, the video also visually illustrates the VSI Knowledge Worker workload that is used across many EUC-based Cisco solutions.

 

Here's a second recorded video of Windows 10 and Office 2016 performance testing that illustrates CPU, memory, network, and storage usage from the viewpoint of a single session.

Summary

My testing showed a calculable performance difference between Microsoft Office 2010 and 2016 applications. The results here don’t particularly reflect the Citrix software used in the test rig, but instead represent the performance impact customers may experience in EUC environments with the Microsoft Office 2016 application stack. Another factor to consider is when applications that use DirectX (like Microsoft Office 2013/2016 and Internet Explorer 10/11/Edge) don’t detect the presence of a physical GPU, then they resort to using a software rasterizer, which consumes additional CPU cycles. For the purposes of this study, GPU hardware acceleration was disabled.

 

 

— Frank Anderson, Senior Solutions Architect, Cisco Systems, Inc. (@FrankCAnderson)

 

Title2.png

The documentation is available for download in the following formats:

  • PDF – Adobe Reader
  • ePub – Available in various apps on iPhone, iPad, Android Sony Reader, or Windows Phone
  • Mobi – Kindle device or Kindle app
  • HTML– Internet browser

 

Welcome to what I believe is easily the best EUC high-scale solution and publication ever released. In terms of best-of-breed components, architecture, validation, and guidance, this one has got it all.

FlexPod Datacenter with Cisco UCS B-Series along with NetApp’s All-Flash FAS is the underlying building block for this solution that simplifies deployments while supporting outstanding density and proven mixes of XenApp and XenDesktop workloads. While FlexPod provides a cookie-cutter solution, this CVD demonstrates how a turnkey solution can also be quite scalable. Deployments share a common architecture, component design, configuration procedures, and management. At the same time configurations can scale and expand to accommodate greater user densities for hosted shared desktops (RDS) or hosted pooled virtual desktops (VDI).

The CVD describes a base 28-blade FlexPod with Cisco UCS B-Series configuration supporting 5,000 users (2,600 RDS, 1,200 Non-Persistent VDI, and 1,200 Persistent VDI users). Cisco UCS B200 M4 Blade Servers were added to the base configuration to support workload expansion and larger densities. All configurations followed a fault-tolerant N+1 design for infrastructure and RDS/VDI clusters. To size and validate workload combinations, we conducted single and multiple blade server scalability tests using Login VSI software. The complete FlexPod CVD documents the step-by-step procedures used to create the test environment and includes all of the test results, which are highlighted here.

  

Figure 1: Reference architecture components in the FlexPod EUC solution

Figure1_bluebk.png

Figure1b_bluebk.png

   

Key Solution Advantages

 

The CVD architecture offers significant benefits to enterprise business deployments:

 

  • Scalable desktop virtualization for the enterprise. Powerful Cisco UCS blade servers enable high user densities with the very best user experience. Twenty-eight blades provide support for 5,000 XenApp and XenDesktop users. The NetApp AFF8080EX-A features exceptionally low-latency flash storage over an end-to-end Ethernet fabric. 
  • Self-contained solution. The FlexPod architecture defines an entirely self-contained solution with the infrastructure needed to support a mix of up to 5,000 Citrix XenApp and XenDesktop users. The solution consumes one full size data center rack—the 4 blade server chassis requires 24RU, NetApp storage with shelves takes 10 RU, and the switches occupy 4 RU. The entire design fits in a single data center rack, conserving valuable data center rack space and simplifying deployments.
  • Fault-tolerant design. The architecture defines redundant infrastructure and workload VMs across multiple physical Cisco UCS blade servers, optimizing availability to keep users productive.
  • Easy to deploy and manage. UCS Manager can manage Cisco UCS servers in the FlexPod solution along with other Cisco UCS blade and rack servers in the management domain. Cisco UCS Performance Manager provides monitoring insights throughout the environment. Cisco UCS Central can extend management across Cisco UCS Manager domains and centralize management across multiple deployments.
  • Fully validated and proven solution. The CVD defines a reference architecture that has been tested under aggressive usage scenarios, including boot and login storms. Test requirements mandated that each configuration boot within 15 minutes and complete logins in 48 minutes at the peak user densities tested. All test scenarios are repeated three times to ensure consistency and reliability.

 

Testing Methodology

To generate load in the environment, Login VSI 4.1.4 software from Login Consultants (www.loginvsi.com) was used to generate desktop connections, simulate application workloads, and track application responsiveness. In this testing, the Knowledge Worker official workload (via benchmark mode) was used to simulate office productivity tasks (Microsoft Office, Internet Explorer with Flash, printing, and PDF viewing) for a typical knowledge worker.

Login VSI records response times during workload operations. Increased latencies in response times indicated when the system configuration was saturated and had reached maximum user capacity. During the testing, comprehensive metrics were captured during the full virtual desktop lifecycle: desktop boot-up, user login and desktop acquisition (login/ramp-up), user workload execution (steady state), and user logoff. Performance monitoring scripts tracked resource consumption for infrastructure components.

Login VSI 4.1.4 features updated workloads that reflect more realistic user workload patterns. In addition, the analyzer functionality has changed—a new Login VSI index, VSImax v4.1, uses a new calculation algorithm and scale optimized for high densities, so that it is no longer necessary to launch more sessions than VSImax.

Each test run was started from a fresh state after restarting the blade servers. To begin the testing, we took all desktops out of maintenance mode, started the virtual machines, and waited for them to register. The Login VSI launchers initiated desktop sessions and began user logins (the login/ramp-up phase). Once all users were logged in, the steady state portion of the test began in which Login VSI executed the application workload.  

Test metrics were gathered from the hypervisor, virtual desktop, storage, and load generation software to assess the overall success of an individual test cycle. Each test cycle was not considered passing unless all of the planned test users completed the ramp-up and steady state phases within the required timeframes and all metrics were within permissible thresholds. Three test runs were conducted for each test case to confirm that the results were relatively consistent.

Seven different test cases were conducted:

  1. Testing single server scalability under a maximum recommended RDS load. The maximum recommended single server user density occurred when CPU utilization reached a maximum of 90-95%.
  2. Testing single server scalability under a maximum recommended VDI Non-Persistent load. The maximum recommended density occurred when processor utilization reached a maximum of 90-95%.
  3. Testing single server scalability under a maximum recommended VDI Persistent load. Again, the maximum recommended density occurred when processor utilization reached a maximum of 90-95%.
  4. Cluster testing multiple server scalability using a 12-blade server configuration under a RDS 2600-user workload.
  5. Cluster testing multiple server scalability using a 7-blade server configuration under a VDI Non-Persistent 1200-user workload.
  6. Cluster testing multiple server scalability using a 7-blade server configuration under a VDI Persistent 1200-user workload.
  7. Full scale testing multiple server scalability using a 28-blade server configuration under a mixed 5000-user workload. 

Test configurations

 

Figure 2 shows the VM configurations workload distribution for each of the seven test cases. In addition to RDS and VDI workload VMs, infrastructure servers were defined to host Cisco Nexus 1000V VSM, Cisco UCS Performance Manager, XenDesktop Delivery Controllers, Studio, StoreFront, Licensing, Director, Citrix Provisioning Services (PVS), SQL, Active Directory, DNS, DHCP, vCenter, and NetApp Virtual Storage Console.

Figure 2: Infrastructure and workload distribution.

Figure2_bluebk.png

Figure 3: VM configurations for the seven test cases.

Figure3_bluebk.png

 

For the cluster and full scale tests, multiple workload and infrastructure VMs were hosted across more than one physical blade. To validate the design as production ready, N+1 server high availability was incorporated. 

 

Table 1 shows the VM definitions used for the RDS and VDI workload servers. Different virtual CPU (vCPU) configurations was first tested, finding that the best performance was achieved when not overcommitting CPU resources. Optimal performance was observed when each RDS VM was configured with six vCPUs and 24GB RAM, and each VDI VM was configured with 2 vCPUs and 1.7GB RAM.

  Table 1: RDS and VDI VM configurations

Table1.png

This CVD used Citrix Provisioning Server 7.7. When planning a PVS deployment, design decisions need to be made regarding PVS vDisk and PVS write cache placement. In this CVD, the PVS write cache was located on RAM with overflow to disk (NFS storage volumes), reducing the number of IOPS to storage. PVS vDisks were hosted using CIFS/SMB3 via NetApp, allowing the same vDisk to be shared among multiple PVS servers while providing resilience in the event of storage node failover.

 

Main Findings

Figure 3 summarizes the seven test cases and the maximum RDS and VDI user densities achieved in each.  The first two test cases examined single server scalability for RDS and VDI respectively, determining the recommended maximum density for each workload type on a Cisco UCS B200 M4 blade with dual Intel® E5-2680 v3 processors and 384GB of RAM. The other three tests analyzed the performance of mixed workloads on multiple blade servers. Multiple blade testing showed that the configurations could support mixed workload densities under simulated stress conditions (cold-start boot and simulated login storms).

 

Figure 4: Seven test cases were run to examine single server and multiple server scalability.

Figure4.png

Test Case 1: Single Server Scalability, RDS

 

We started by testing single server scalability for XenApp hosted shared desktop sessions (RDS) running the Login VSI 4.1.4 Knowledge Worker workload. A dedicated blade server ran eight VMs hosting Windows Server 2012 R2 with XenApp 7.7 sessions. This test determined that the recommended maximum density was 240 RDS sessions. The graphs below show the 240-user VSI results along with resource utilization metrics for the single server RDS Knowledge Worker workload. Note that all metrics, especially CPU utilization during steady state, did not exceed requirements at this capacity. The full CVD contains additional performance metrics, including performance metrics for various infrastructure VMs.

Figure 5: Single Server Scalability, XenApp 7.7 RDS, VSIMax v4.1 Density

Figure5.png

Figure 6: Single Server Scalability, XenApp 7.7 RDS, CPU Utilization

Figure6.png

Figure 7:  Single Server Scalability, XenApp 7.7 RDS, Memory Utilization

Figure7.png

Figure 8: Single Server Scalability, XenApp 7.7 RDS, Network Utilization

Figure8.png

Test Case 2: Single Server Scalability, VDI Non-Persistent

In the second test case, we tested single server scalability for XenDesktop VDI Non-Persistent running the Login VSI 4.1 Knowledge Worker workload. The recommended maximum density for a single blade server was 195 desktops hosting Microsoft Windows 7 (32-bit). The graphs below show the 195-seat VSI results for single server VDI along with resource utilization metrics. These metrics, especially CPU utilization during steady state, were well within permissible bounds.

Figure 9: Single Server Scalability, XenDesktop 7.7 VDI Non-Persistent, VSIMax v4.1 Density

Figure9.png

Figure 10: Single Server Scalability, XenDesktop 7.7 VDI Non-Persistent, CPU Utilization

Figure10.png

Figure 11: Single Server Scalability, XenDesktop 7.7 VDI Non-Persistent, Memory Utilization

Figure11.png

Figure 12: Single Server Scalability, XenDesktop 7.7 VDI Non-Persistent, Network Utilization

Figure12.png

 

Test Case 3: Single Server Scalability, VDI Persistent

In the third test case, we tested single server scalability for XenDesktop VDI Persistent running the Login VSI 4.1 Knowledge Worker workload. The recommended maximum density for a single blade server was 195 desktops hosting Microsoft Windows 7 (32-bit). The graphs below show the 195-seat VSI results for single server VDI along with resource utilization metrics. These metrics, especially CPU utilization during steady state, were well within permissible bounds. From the compute scalability perspective, both persistent and non-persistent workloads can achieve the same level of scalability with host CPU being the gating factor.

Figure 13: Single Server Scalability, XenDesktop 7.7 VDI Persistent, VSIMax v4.1 Density

Figure13.png

Figure 14: Single Server Scalability, XenDesktop 7.7 VDI Persistent, CPU Utilization

Figure14.png

Figure 15: Single Server Scalability, XenDesktop 7.7 VDI Persistent, Memory Utilization

Figure15.png

Figure 16: Single Server Scalability, XenDesktop 7.7 VDI Persistent, Network Utilization

Figure16.png

Test Case 4: Cluster Scalability, RDS

In this cluster test case, 12 blade servers were used to support a RDS workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 12-server configuration supported 2600 RDS sessions on Windows Server 2012 R2. The graphs below show the VSI results for the RDS cluster workload and total storage IOPS and latencies.

Figure 17: RDS Cluster Scalability, 2600 Users, VSIMax v4.1 Density

Figure17.png

Figure 18: RDS Cluster Scalability, 2600 Users, Total IOPS and Latency

Figure18.png

Test Case 5: Cluster Scalability, VDI Non-Persistent

In this cluster test case, 7 blade servers were used to support a VDI Non-Persistent workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 7-server configuration supported 1200 VDI sessions on Windows 7. The graphs below show the VSI results for the VDI Non-Persistent cluster workload and total storage IOPS and latencies.

Figure 19: VDI Non-Persistent Cluster Scalability, 1200 Users, VSIMax v4.1 Density

Figure19.png

Figure 20: VDI Non-Persistent Cluster Scalability, 1200 Users, Total IOPS and Latency

Figure20.png

Test Case 6: Cluster Scalability, VDI Persistent

In this cluster test case, 7 blade servers were used to support a VDI Persistent workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 7-server configuration supported 1200 VDI sessions on Windows 7. The graphs below show the VSI results for the VDI Persistent cluster workload and total storage IOPS and latencies.

Figure 21: VDI Persistent Cluster Scalability, 1200 Users, VSIMax v4.1 Density

Figure21.png

Figure 22: VDI Persistent Cluster Scalability, 1200 Users, Total IOPS and Latency

Figure22.png

Test Case 7: Full Scale Mixed Scalability

In this full-scale test case, 28 blade servers were used to support a RDS, VDI Non-Persistent, and VDI Persistent workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 28-server configuration supported 5000 mixed sessions. The graphs below show the VSI results for the mixed workload and total storage IOPS and latencies.

Figure 23: Full Scale Mixed Scalability, 5000 Users, VSIMax v4.1 Density

Figure23.png

Figure 24: Full Scale Mixed Scalability, 5000 Users, Total IOPS and Latency per Volume Type

Figure24.png

Figure 25: Full Scale Mixed Scalability, 5000 Users, Total Read and Write IOPS

Figure25.png

Desktop Virtualization at High Scale

The test results show how easily FlexPod EUC configurations can accommodate very high densities while maintaining top user experience, allowing deployments at the enterprise to support greater amounts of RDS and VDI workloads. The Cisco UCS B200 M4 blade servers offer high performance to support high RDS/VDI densities, while the NetApp all-flash storage system and 10GbE connectivity optimizes storage-related operations.

 

In the full scale test case, the NetApp storage easily handled load requirements with average read and write latencies less than 1ms. SSD drives in the AFF8080 significantly decrease response time latencies during the boot and login phases. During steady state, the storage experienced low IOPS for the RDS and VDI non-persistent datastores. PVS write cache was configured to use RAM with overflow to disk, which complimented the storage system by decreasing IOPS.

 

To see the full set of test results and learn more, you can access the full CVD here.

 

 

— Frank Anderson, Senior Solutions Architect, Cisco Systems, Inc. (@FrankCAnderson)

MoreEUC2.png

 

This study focused on measuring the EUC-based session density differences between the Intel Xeon E5-2680v3 CPU and the Intel Xeon E5-2680v4 CPU using Cisco UCS B200 M4 blade servers. Without wasting any time, the observable delta for both XenApp (Hosted Shared Desktops) and XenDesktop (Host Virtual Desktops) amounted to slightly higher than 20% without incurring an increase in power consumption.

Table.png

Good News! Upgrading to the latest generation processors does not require the purchase of new servers. The Cisco UCS B200 M4 blades support both the Haswell and Broadwell architectures. Be sure to upgrade to the UCS firmware 3.1(1g) or later before making hardware changes.

 

Testing Environment

The components used for the study comprised: Cisco® Unified Computing System (UCS) B200M4 blade servers, Cisco Nexus 9000 series switches, and NetApp® AFF8080EX-A storage system, VMware vSphere ESXi 6.0 Update 1, Citrix Provisioning Services 7.7, and Citrix XenApp and XenDesktop 7.7 software.

 

Testing Components

Components.png

VDA Configuration

Table_VDAconfig.png

 

Testing Methodology

To generate load in the environment, Login VSI 4.1.4 software from Login Consultants (www.loginvsi.com) was used to generate desktop connections, simulate application workloads, and track application responsiveness. In this testing, the Knowledge Worker official workload (via benchmark mode) was used to simulate productivity tasks (Microsoft Office, Internet Explorer with HTML5 video, printing, and PDF viewing) for a typical knowledge worker.

Login VSI records response times during workload operations. Increased latencies in response times indicated when the system configuration was saturated and had reached maximum user capacity. During the testing, comprehensive metrics were captured during the full virtual desktop lifecycle: user login and desktop acquisition (login/ramp-up), user workload execution (steady state), and user logoff. Performance monitoring scripts tracked resource consumption for infrastructure components.

Each test run was started from a fresh state after restarting the blade servers. To begin the testing, we took all desktops out of maintenance mode, started the virtual machines, and waited for them to register. The Login VSI launchers initiated desktop sessions and began user logins (the login/ramp-up phase). Once all users were logged in, the steady state portion of the test began in which Login VSI executed the application workload.

Test cases conducted:

  • Testing single server scalability under a maximum recommended RDS load. The maximum recommended single server user density occurred when CPU utilization reached a maximum of 90-95%.
  • Testing single server scalability under a maximum recommended VDI Non-Persistent load. The maximum recommended single server user density occurred when CPU utilization reached a maximum of 90-95%.

 

Main Findings

The following test data summarizes the two test cases and the maximum RDS and VDI user densities achieved in each.

XenApp 7.7 (RDS) test case with Windows Server 2012 R2

I started by testing single server scalability for XenApp hosted shared desktop sessions (RDS) running the Login VSI 4.1.4 Knowledge Worker workload. A dedicated blade server ran eight VMs hosting Windows Server 2012 R2 with XenApp 7.7 sessions. This test determined that the recommended maximum density was 240 RDS sessions with the E5-2680v3 processors and 290 RDS sessions with E5-2680v4 processors. The graphs below show the VSI results along with resource utilization metrics for the single server RDS Knowledge Worker workload.

 

A twenty percent gain was measured from Haswell to Broadwell with only a marginal difference in power consumption.

 

Figure 1: Single Server Scalability, XenApp 7.7 RDS, CPU Utilization with 2680v4

RDSv4-CPUusage.png

Figure 2: Single Server Scalability, XenApp 7.7 RDS, Power Utilization with 2680v4

RDSv4-POWERusage.png

Figure 3: Single Server Scalability, XenApp 7.7 RDS, CPU Utilization with 2680v3

RDSv3-CPUusage.png

Figure 4: Single Server Scalability, XenApp 7.7 RDS, Power Utilization with 2680v3

RDSv3-POWERusage.png

Figure 5: Single Server Scalability, XenApp 7.7 RDS, VSI v4.1 Density & Response Times

RDS-VSIscore.png

Figure 6: Single Server Scalability, XenApp 7.7 RDS, VSI Comparison Chart

RDS-Comparison.png

 

XenDesktop 7.7 (VDI) test case with Windows 7 32-bit SP1

The next testing single server scalability for XenDesktop hosted virtual desktop sessions (VDI) running the Login VSI 4.1.4 Knowledge Worker workload. A dedicated blade server ran VMs hosting Windows 7 with XenDesktop 7.7 sessions. This test determined that the recommended maximum density was 195 VDI sessions with the E5-2680v3 processors and 235 VDI sessions with E5-2680v4 processors. The graphs below show the VSI results along with resource utilization metrics for the single server VDI Knowledge Worker workload.

 

The user density increase from Haswell to Broadwell tracks similarly for the VDI workload as compared to RDS. And, much like RDS, VDI testing yielded a negligible difference in power consumption between the two processors.

 

Figure 7: Single Server Scalability, XenDesktop 7.7 VDI, CPU Utilization with 2680v4

VDIv4-CPUusage.png

Figure 8: Single Server Scalability, XenDesktop 7.7 VDI, Power Utilization with 2680v4

VDIv4-POWERusage.png

Figure 9: Single Server Scalability, XenDesktop 7.7 VDI, CPU Utilization with 2680v3

VDIv3-CPUusage.png

Figure 10: Single Server Scalability, XenDesktop 7.7 VDI, Power Utilization with 2680v3

VDIv3-POWERusage.png

Figure 11: Single Server Scalability, XenDesktop 7.7 VDI, VSI v4.1 Density & Response Times

VDI-VSIscore.png

Figure 12: Single Server Scalability, XenDesktop 7.7 VDI, VSI Comparison Chart

VDI-Comparison.png

Intel E5-2600 v4 Processors

The powerful new Intel® Xeon® processor E5-2600 v4 product family offers versatility across diverse workloads. These processors are designed for architecting next-generation data centers running on, software defined infrastructure supercharged for efficiency, performance, and agile services delivery across cloud-native and traditional applications. They support workloads for cloud, high-performance computing, networking, and storage.

 

Broadwell is Intel's fifth generation of Core-series processor that defines the sort of power achievable by today’s CPU platform.  How much more condensed is Broadwell? Intel Haswell uses 22 nanometer transistors whereas Broadwell's transistors use 14nm. The first Core processors back in 2006 had huge, by comparison, 65nm ones. A lot of progress has been made in those eight years.

 

A significant characteristic of Broadwell is that its chips are 30% more efficient than Haswell, using 30% less power while providing better performance at the same relative clock speed. Certainly, the v4 processors have a positive impact on VDI and RDS workloads with increasing user scalability without experiencing addition power consumption – as discovered in this study.

 

The following table provides a feature comparison between the two tested processors. 

Table_v3-vs-v4.png

 

Conclusion

The test results show how the latest generation Intel CPUs can expand and flex, allowing deployments to grow and support greater RDS and VDI workloads. The Cisco UCS B200 M4 blade servers offer high performance to support extraordinary RDS/VDI densities while maintaining datacenter OpEx cost.

Look for v4 processors to be the new standard in the EUC CVD solutions moving forward.

 

References

http://www.cisco.com/c/dam/en/us/products/collateral/servers-unified-computing/ucs-b-series-blade-servers/b200m4-specsheet.pdf

http://www.intel.com/content/www/us/en/processors/xeon/xeon-processor-e5-family.html

https://communities.cisco.com/people/fraander/blog

— Frank Anderson, Senior Solutions Architect, Cisco Systems, Inc. (@FrankCAnderson)

Summit.png

 

Mountain climbing is a challenge that yields a high reward once the objective is achieved - to reach the summit. As any mountaineer will tell you, the secret to a good ascent is proper preparation and guidance. Similarly, implementing a largescale EUC solution is not without its challenges. The difference between a successful deployment and a problematic one comes down to planning and skilled direction. Integration mistakes can be costly, resulting in downtime and further delays in provisioning.

 

To eliminate integration challenges while providing linear scalability, Cisco and NetApp have collaborated on the FlexPod® Data Center, a predesigned architecture that combines Cisco® Unified Computing System (UCS) servers, Cisco Nexus fabric, and NetApp® storage systems. A soon-to-be-released Cisco Validated Design (CVD) makes it easy to deploy Citrix desktop and application virtualization solutions with confidence. This CVD describes a reference architecture that combines the FlexPod Data Center with Cisco UCS, NetApp AFF8080EX flash storage, VMware vSphere ESXi 6.0 Update 1, and Citrix XenApp and XenDesktop 7.7 software.

  Components1.png

Overview1.png

 

The CVD creates a flexible building block for Citrix desktop virtualization that’s ideal for enterprise-sized companies. IT administrators can use Cisco UCS Manager along with UCS Central to centralize ongoing management across multiple domains and locations.

 

The CVD documents best practices, step-by-step deployment instructions, and scalability test results that can help customers rapidly deploy an effective Citrix desktop and application virtualization solution. The Cisco UCS platform enables compute density that provides impressive scalability for hosted shared desktops, non-persistent/pooled hosted virtual desktops, and persistent hosted virtual desktops in several mixed workload scenarios.

 

The pinnacle of all Citrix-based FlexPods is just around the summit.

UPDATE, released: CVD FlexPod Datacenter with Citrix XenDesktop/ XenApp 7.7 and VMware vSphere 6.0 for 5000 Seats - Cisco   Related Blog Announcing a New EUC CVD: FlexPod Datacenter with Citrix XenApp/XenDesktop 7.7 and VMware vSphere 6.0 for 5000 Seats

Find me at Citrix Synergy 2016 in Las Vegas May 24-26.

 

— Frank Anderson, Senior Solutions Architect, Cisco Systems, Inc. (@FrankCAnderson)

 

XDFlexPodMini_Title.png

For companies that have geographically dispersed offices such as remote offices/branch offices (ROBOs) or other satellite locations, a turnkey solution for desktop virtualization can speed deployment at the enterprise edge. A new Cisco Validated Design (CVD) for Citrix XenApp and XenDesktop uses a predesigned, self-contained platform—the FlexPod Express with Cisco UCS Mini—that makes it easy to provision reliable desktops and applications from 350 to 700 users. The FlexPod Express, co-designed by Cisco and NetApp, integrates compute servers, networking fabric, and hybrid storage components, creating a standalone, drop-in VDI/SBC solution that can be installed at remote sites and yet managed centrally.  

 

A Flexible Turnkey Solution

 

FlexPod Express with Cisco UCS Mini is the underlying building block that simplifies ROBO deployments while supporting density growth and proven mixes of XenApp and XenDesktop workloads. While FlexPod provides a cookie-cutter solution, this CVD demonstrates how a turnkey solution can also be quite versatile. Each deployment shares a common architecture, component design, configuration procedures, and management. At the same time configurations can scale and expand to accommodate greater user densities for hosted shared desktops (RDS) or hosted pooled virtual desktops (VDI). Every ROBO deployment follows the same architectural design but scales to facilitate site-specific RDS/VDI combinations. This excellent scalability can also benefit small and mid-sized businesses—they can start small and grow from 350 to as many as 700 users. 

 

The CVD describes a base 4-blade FlexPod with Cisco UCS Mini configuration supporting 350 users (150 RDS and 200 VDI users). Cisco UCS B200 M4 Blade Servers were added to the base configuration to support workload expansion and larger densities. All configurations followed a fault-tolerant N+1 design for infrastructure and RDS/VDI VMs. To size and validate workload combinations, we conducted single and multiple blade server scalability tests using Login VSI software. The complete FlexPod CVD documents the step-by-step procedures used to create the test environment and includes all of the test results, which are highlighted here.

 

Figure 1: Reference architecture components in the FlexPod with Cisco UCS Mini solution

 

XDFlexPodMini_1.png

 

Solution Overview

 

Figure 1 shows the key components in the CVD reference architecture, including:

 

  • Citrix XenApp and XenDesktop 7.6 software. Because Citrix XenDesktop 7.6 unifies the functionality of earlier XenApp and XenDesktop releases, the same software and same PVS Setup Wizard can provision both RDS sessions (on Windows Server 2012 R2) and pooled hosted VDI desktops (running Microsoft Windows 7 or Windows 8). In the CVD all infrastructure and RDS/VDI workload servers were 100% virtualized on VMware vSphere ESXi 5.5 Update 2.
  • Cisco UCS Mini.  The Cisco UCS Mini combines servers, storage, and a 10 Gigabit networking fabric in an easy-to-deploy, compact form factor. The chassis can support up to eight half-width Cisco UCS B200 M4 Blade Servers, each featuring dual 10-core 2.6 GHz Intel Xeon (E5-2660v3) processors and 256GB. In this CVD, between four and seven blade servers were configured for the various test cases. Two Cisco UCS 6324 Fabric Interconnects provide redundant, high bandwidth LAN and storage connectivity for the blade servers and the chassis, and can optionally connect to rack servers as well. Cisco UCS Manager manages all Cisco UCS Mini software and hardware components, and Cisco UCS Central can aggregate multiple UCS Manager domains for comprehensive policy control and centralized management.
  • Cisco Nexus 9372 Switches. To support 10 GbE connectivity for the FlexPod solution, these Layer 2/Layer3 access switches each feature 48 1/10-Gbps Small Form Pluggable Plus (SFP+) ports and 6 Quad SFP+ (QSFP+) uplink ports. In addition, the Nexus 9373 is Cisco ACI capable.
  • NetApp FAS2552 hybrid storage. The NetApp FAS2552 is a dual controller storage system that combines low-latency SSDs for caching and cost-effective SAS drives for capacity. The array configuration used in the testing included four 200GB SSDs and twenty 900GB SAS drives. The array controllers feature 10GbE ports to support blade server boot over iSCSI and NFS/CIFS connectivity for file system access.

     

    Key Solution Advantages

    The CVD architecture offers significant benefits to enterprise-edge or small business deployments:

  

  • Self-contained and compact solution. The FlexPod with UCS Mini architecture defines an entirely self-contained “all-in-one” solution with the infrastructure needed to support a mix of up to 700 Citrix XenApp and XenDesktop users. The solution consumes only 10 rack units—the blade server chassis requires 6RU, NetApp storage takes 2 RU, and the switches occupy 2 RU. The entire “in-a-box” design fits in less than a single data center rack, conserving valuable data center rack space and simplifying deployments, especially in small business or standalone branch office environments.
  • Cost-effective and scalable desktop virtualization for the enterprise edge. Powerful Cisco UCS blade servers enable high user densities at a low cost per seat. By adding additional blade servers to the chassis, a basic 4-server configuration supporting 350 users scales easily to support another 350 additional XenApp and XenDesktop users. The NetApp storage array features a combination of low-latency flash devices and a tray of less expensive SAS drives, for economical I/O over an end-to-end Ethernet fabric.
  • Fault-tolerant design. The architecture defines redundant infrastructure and workload VMs across multiple physical Cisco UCS blade servers, optimizing availability to keep users productive.
  • Easy to deploy and manage. UCS Manager can monitor and manage Cisco UCS servers in the FlexPod solution along with other Cisco UCS blade and rack servers in the management domain. Cisco UCS Central can extend management across Cisco UCS Manager domains and centralize management across multiple remote sites.
  • Fully validated and proven solution. The CVD defines a reference architecture that has been tested under aggressive usage scenarios, including boot and login storms. Test requirements mandated that each configuration boot within 15 minutes and complete logins in 48 minutes at the peak user densities tested.

     

  • Testing Methodology

    To generate load in the environment, Login VSI 4.1.4 software from Login Consultants (www.loginvsi.com) was used to generate desktop connections, simulate application workloads, and track application responsiveness. In this testing, the default Login VSI 4.1 Office Worker workload was used to simulate office productivity tasks (Microsoft Office, Internet Explorer with Flash, printing, and PDF viewing) for a typical knowledge worker.

    Login VSI records response times during workload operations. Increased latencies in response times indicated when the system configuration was saturated and had reached maximum user capacity. During the testing, comprehensive metrics were captured during the full virtual desktop lifecycle: desktop boot-up, user login and desktop acquisition (login/ramp-up), user workload execution (steady state), and user logoff. Performance monitoring scripts tracked resource consumption for infrastructure components.

    Login VSI 4.1.4 features updated workloads that reflect more realistic user workload patterns. In addition, the analyzer functionality has changed—a new Login VSI index, VSImax v4.1, uses a new calculation algorithm and scale optimized for high densities, so that it is no longer necessary to launch more sessions than VSImax.

    Each test run was started from a fresh state after restarting the blade servers. To begin the testing, we took all desktops out of maintenance mode, started the virtual machines, and waited for them to register. The Login VSI launchers initiated desktop sessions and began user logins (the login/ramp-up phase). Once all users were logged in, the steady state portion of the test began in which Login VSI executed the application workload.

  

Test metrics were gathered from the hypervisor, virtual desktop, storage, and load generation software to assess the overall success of an individual test cycle. Each test cycle was not considered passing unless all of the planned test users completed the ramp-up and steady state phases within the required timeframes and all metrics were within permissible thresholds. Three test runs were conducted for each test case to confirm that the results were relatively consistent.

 

We conducted five different test cases:

 

  1. Testing single server scalability under a maximum recommended RDS load. The maximum recommended single server user density occurred when CPU utilization reached a maximum of 90-95%.
  2. Testing single server scalability under a maximum recommended VDI load. Again, the maximum recommended density occurred when processor utilization reached a maximum of 90-95%.
  3. Testing multiple server scalability using a 4-blade server base configuration under a mixed 350-user workload.
  4. Extending the base configuration to 700 users with an RDS focus.
  5. Extending the base configuration to 700 users with a VDI focus.

  

Test configurations

 

Figure 2 shows the VM configurations for each of the five test cases. In addition to RDS and VDI workload VMs, infrastructure servers were defined to host XenDesktop Delivery Controllers, Studio, StoreFront, Licensing, Director, Citrix Provisioning Services (PVS), SQL, Active Directory, DNS, DHCP, vCenter, and NetApp Virtual Storage Console.

 

Figure 2: VM configurations for the five test cases.

XDFlexPodMini_2.png

 

For the multiple server scalability tests, multiple workload and infrastructure VMs were hosted across more than one physical blade. Configuring N+1 servers as shown enables a highly available yet cost-effective configuration for smaller deployment sites.

 

Table 1 shows the VM definitions used for the RDS and VDI workload servers. We first tested different virtual CPU (vCPU) configurations, finding that the best performance was achieved when not overcommitting CPU resources. Optimal performance was observed when each RDS VM was configured with five vCPUs and 24GB RAM, and each VDI VM was configured with 2 vCPUs and 1.5GB RAM.

 

Table 1: RDS and VDI VM configurations

XDFlexPodMini_3.png

 

 

This CVD used Citrix Provisioning Server 7.6. When planning a PVS deployment, design decisions need to be made regarding PVS vDisk and PVS write cache placement. In this CVD, the PVS write cache was located on RAM with overflow to disk (NFS storage volumes), reducing the number of IOPS to storage. PVS vDisks were hosted using CIFS/SMB3 via NetApp, allowing the same vDisk to be shared among multiple PVS servers while providing resilience in the event of storage node failover.

 

Main Findings

 

Figure 3 summarizes the five test cases and the maximum RDS and VDI user densities achieved in each.  The first two test cases examined single server scalability for RDS and VDI respectively, determining the recommended maximum density for each workload type on a Cisco UCS B200 M4 blade with dual Intel® E5-2660 v3 processors and 256GB of RAM. The other three tests analyzed the performance of mixed workloads on multiple blade servers. Multiple blade testing showed that the configurations could support mixed workload densities under simulated stress conditions (cold-start boot and simulated login storms).

Figure 3: Five test cases were run to examine single server and multiple server scalability.

  XDFlexPodMini_4.png

 

Test Case 1: Single Server Scalability, RDS

 

We started by testing single server scalability for XenApp hosted shared desktop sessions (RDS) running the Login VSI 4.1 Office Worker workload. A dedicated blade server ran eight VMs hosting Windows Server 2012 sessions. This test determined that the recommended maximum density was 210 RDS sessions. The graphs below show the 210-user VSIMax v4.1 results along with resource utilization metrics for the single server RDS Office Worker workload. Note that all metrics, especially CPU utilization during steady state, did not exceed requirements at this capacity. The full CVD contains additional performance metrics, including performance metrics for various infrastructure VMs.

 

Figure 4: Single Server Scalability, XenApp 7.6 RDS, VSIMax v4.1 Density

XDFlexPodMini_5.png

 

Figure 5: Single Server Scalability, XenApp 7.6 RDS, CPU Utilization

XDFlexPodMini_6.png

 

Figure 6:  Single Server Scalability, XenApp 7.6 RDS, Memory Utilization

XDFlexPodMini_7.png

 

Figure7: Single Server Scalability, XenApp 7.6 RDS, Network Utilization

XDFlexPodMini_8.png

 

Test Case 2: Single Server Scalability, VDI

 

In the second test case, we tested single server scalability for XenDesktop hosted virtual desktops (VDI) running the Login VSI 4.1 Office Worker workload. The recommended maximum density for a single blade server was 160 desktops hosting Microsoft Windows 7 (32-bit). The graphs below show the 160-seat VSIMax v4.1 results for single server VDI along with resource utilization metrics. Again these metrics, especially CPU utilization during steady state, were well within permissible bounds.

 

Figure 8: Single Server Scalability, XenApp 7.6 VDI, VSIMax v4.1 Density

    XDFlexPodMini_9.png

Figure 9: Single Server Scalability, XenApp 7.6 VDI, CPU Utilization

    XDFlexPodMini_10.png

Figure 10: Single Server Scalability, XenDesktop 7.6 VDI, Memory Utilization

    XDFlexPodMini_11.png

Figure11: Single Server Scalability, XenDesktop 7.6 VDI, Network Utilization

  XDFlexPodMini_12.png

 

Test Case 3: 4-Blade, Mixed Workload, 350 Users

 

In this full-scale test case, 4 blade servers were used to support a mixed RDS/VDI workload in which N+1 infrastructure and workload servers were configured for fault tolerance. The 4-server configuration supported 150 RDS sessions on Windows Server 2012 and 200 Windows 7 VDI users. The graphs below show the VSIMax v4.1 results for the 4-server mixed workload and total storage IOPS and latencies.

 

Figure12: 4-Server Scalability, 350 Mixed Users, VSIMax v4.1 Density

XDFlexPodMini_13.png

Figure13: 4-Server Scalability, 350 Mixed Users, Total IOPS and Latency

XDFlexPodMini_14.png

 

Test Case 4: 6-Blade, Mixed Workload - RDS Expansion, 700 Users

 

To validate an environment that requires a larger number of RDS seats, six Cisco UCS B200 M4 blade servers were used to support a 700-seat workload: 460 RDS sessions and 240 VDI users. Again, N+1 infrastructure and workload servers were configured for fault tolerance. The graphs below show the VSIMax v4.1 results for the 6-server mixed workload and the total storage IOPS and latencies.   

 

Figure 14: 6-Server Scalability, 700 Mixed Users, VSIMax v4.1 Density

XDFlexPodMini_15.png

Figure 15: 6-Server Scalability, 700 Mixed Users, Total IOPS and Latency

XDFlexPodMini_16.png

 

Test Case 5: 7-Blade, Mixed Workload - VDI Expansion, 700 Users

 

By expanding the base configuration to seven blade servers, the final test validated a mixed 700-seat deployment that was predominately VDI. In this test case, the N+1 configuration supported 150 RDS sessions and 550 VDI users. The graphs below show the VSIMax v4.1 results for the VDI expansion workload and the total storage IOPS and latencies during the test window.

 

 

Figure 16: 7-Server Scalability, 700 Mixed Users, VSIMax v4.1 Density

XDFlexPodMini_17.png

Figure 17: 7-Server Scalability, 700 Mixed Users, Total IOPS and Latency

XDFlexPodMini_18.png

 

Scalable Desktop Virtualization at the Enterprise Edge

 

The test results show how easily FlexPod with UCS Mini configurations can expand and flex, allowing deployments at the enterprise edge to grow and support greater RDS and VDI workloads. The Cisco UCS B200 M4 blade servers offer high performance to support high RDS/VDI densities, while the NetApp hybrid storage array and 10GbE connectivity optimizes IOPS and storage-related costs.

 

In the multiple blade test cases, the NetApp storage easily handled IOPS requirements for the 350 and 700-user workloads with average read and write latencies less than 5ms. Flash drives in the hybrid storage configuration helped to decrease latencies during the boot and login phases. During steady state, the storage experienced low IOPS, especially since the PVS write cache was configured to use RAM with overflow to storage, which decreased IOPS demands.

 

To see the full set of test results and learn more, you can access the full CVD here.

 

— Frank Anderson, Senior Solutions Architect, Cisco Systems, Inc. (@FrankCAnderson)
Rob Briggs, Principal Solutions Architect, Citrix Systems, Inc. (@briggs_rob)