The Importance of Being Earnest in Monitoring Your Virtual Desktop User Experience

Did you recently complete a long-awaited project to upgrade your network and virtualize your PCs, data centers, and infrastructure?

I’m guessing you might be facing some challenges with monitoring how it went and how users are enjoying (or NOT enjoying) their virtual desktop experience.

While your IT Director does glow a bit more radiantly walking down the hallway and whistle a bit more frequently in the elevator now that bulky physical desktops are gone, you still need to troubleshoot problems and optimize performance.

Plus, the new CIO wants a report that validates your infrastructure changes were and will continue to be a sound investment and the executive team wants to know in advance about any performance bottlenecks.

They ultimately want to snapshot, quantify, and track changes in the user experience for all users, on all devices, 24/7!

 

Monitoring the Virtual Desktop User Experience

In the past, physical machines offered IT shops the opportunity to customize the user experience (UX). Christine in marketing had more RAM than Bill in accounting, and Ramesh in services had access to more network storage than either of them.

But with virtual machines, many shops do not monitor user experience and use a policy where all 20,000 employees get precisely the exact same virtual desktop; same processor, same RAM, same configuration, and same access to resources.

As you might have imagined, Ramesh would be cursing your IT staff through a support chat app, and Bill would be overwhelmed.

Christine just walked out.

In other words, without monitoring the user experience, this failed policy would:

  • Upgrade low-demand users who did not require access to advanced resources.
  • Downgrade high-demand power users who previously enjoyed a superior level of service.

Therefore, remember this important rule of virtualization—because you can dynamically allocate and throttle resources, monitoring the user experience is even more important than it was in the physical environment.

Four Reasons Why You Should Monitor the User Experience

  • Constant adjustments require usage data for maximum optimization. Monitoring helps you discover areas of improvement.
  • Users would otherwise experience issues and wrongly assign blame to virtualization.
  • Opportunities for automation, enhanced collection, and dynamic real-time reallocation of resources.
  • You can now more easily do it; much easier than physical environments.

 

How to Measure the Virtual User Experience

A rule-of-thumb in this business is that the virtual user experience must be at least the same or better than the physical experience. We can’t declare victory until that assertion is shared by a clear majority of users.

Naturally, you may be wondering, how do we measure that objectively?

Let me address three primary methods below.

1.           Delays and Crashes

First, establish a rubric or benchmark based on a standard set of factors. Track the following three parameters and chart trends over time:

  • App Load Delay
  • Login Delay
  • App Not Responding (ANR) and Similar Crashes

 

Delays and crashes are strong indicators of user frustration level. In any given range of time (3 weeks, 3 days, or 3 hours), these numbers are going to point to the issue. Remember, lower numbers are better when it comes to measuring load times and crashes. Four crashes are better than 40 and a three-second load time is better than 30 seconds.

Like the indicators used by economists to describe trends in the business market, these are lagging indicators. For example, existing home sales, jobless claims, and new jobs for the past month. Lagging indicators reliably report on events that have already occurred.

 

2.           Technical Metrics

Second, track the following four major technical metrics:

  • RAM
  • CPU
  • Disk Storage
  • Network Traffic

Trending data from those four metrics add-up and empirically point to general environment issues that contribute to user frustration.

To continue our economic metaphor, these are leading indicators such as bond yields or new housing starts. They are based on conditions that offer insight as to what might occur if we can quickly assess the data and make accurate predictions. For example, don’t cutover to a new enterprise app that uses a lot of RAM if two-thirds of desktops are reporting out-of-memory issues.

3.           User Experience Feedback Surveys

Third, conduct user experience feedback surveys. Because the results will be swayed by the current mood of each user in a highly subjective manner, you’ll need participation and feedback from many users to reliably establish objective statistical significance that reflects the population mean.

You might include the following survey questions:

  • How would you rate the speed of your virtual desktop?
  • Would you consider any of the applications you use to be slow?
  • If YES, please list which apps are slow and the time of day when they are slow.
  • List any applications that you have used in the past three months that crashed?
  • How often did each application crash?

 

Consult with your data scientist or marketing team to carefully construct the questions in your survey. For best results, you want to invest up front in getting the first survey as accurate as possible, and consistently track future results.

Monitoring Software

Skip the attempt to build a custom solution in-house. A few commercial tools are available to help you collect user experience data. Most solutions provide views with metrics that track architecture specs, infrastructure changes, desktops, laptops, workstations, kiosks, terminals, other devices, users, and apps.

Market tools include:

  • Liquidware Stratusphere UX: The reliable established market leader in this segment.
  • Lakeside Systrack: A good tool for automated reports and dashboards.
  • ControlUp: Their real-time product includes a responsive dashboard that helps you resolve issues quickly.
  • Nexthink: Another real-time product with historical usage and IT service performance records, visualizations, actionable dashboards, reporting, and feedback surveys integrated.

These solutions also include built-in root cause analysis and problem identification.

They all tend to be strong at monitoring crashes, delays, and metrics; however, they typically lack an end-user survey feedback function. Nexthink is an exception. It delivers on all three points I made in the previous section, including surveys, but has some other disadvantages such as configuration requirements and cost.

When it comes to evaluating the costs and features of these competitors, I invite you to compare and decide for yourself. I will suggest that you can likely conduct the surveys yourself using SurveyMonkey, SurveyGizmo, GetFeedback, or another popular online survey tool.

Data Collection Tips:

  • Collect metrics and feedback data for as large a user pool as possible with a consistent number of users. For example, if you cannot survey all 15,000 employees, poll 1,000 every quarter. If you can do it every 60 days or monthly, that’s even better. You also want to have data before a change to serve as a baseline, and after a change to make comparisons. For example, immediately before and immediately after a shift from physical to virtual desktops.
  • Run the delay, crash, and technical metrics tools as often as possible. You want them capturing data almost constantly. Compare the data every month, examine reports, and look for trends.
  • It’s also important to note that all the tools I mention are strictly for monitoring. They don’t perform any corrective actions. You could script your own, but most organizations today are cautious about building yet another in-house custom solution when the cloud promises so much including everything from automation to updates.
  • Corrective automation tools on servers are available; however, not for virtual desktops. Some server real-time resource allocation features exist in Turbonomics and VMware vRealize Operations Manager/Automation.

Evaluate the Trends

After collecting the data, examine any trends. If you see an increase in crashes, delays, helpdesk tickets, and other common issues, the overall user experience at your organization is in trouble. Like a crime drama or forensics TV show, go into analysis mode to determine why.

Use the feedback surveys to substantiate the trends. It works both ways; you can also use the metrics to support a trend in user feedback results.

For users that report poor performance, your survey should also ask them to specify when it occurs. If you can, try to pinpoint a two-hour window. Then, focus on that time and try to determine a root cause. You also have the names and machine IDs to go on.

Other forensic analysis tips:

  • Analyze just two or three users: They will reveal findings representative of a larger audience. Troubleshooting forensics for dozens of users will yield too many data points and too much variability.
  • Focus on snippets of user experience feedback: For example, three users reported crashes while using the same streaming app at the same time.
  • Look for patterns: For example, every 30 days you notice a block of days with high disk utilization metrics. Run another report for just that week and look for trends and sustained peaks. Within those peaks focus on just three hours, then one hour.
  • Filter out false positives: When you upgraded to a new application, everyone’s RAM suddenly became insufficient in the metrics; however, a new patch next week fixes a known memory leak vulnerability.
  • Memory is critical: The most common issues center around insufficient resources. Users often need more RAM. It’s typically more important than processor speed or flash storage.

 

Next Steps

After running monthly reports and tracking the trends, narrow your analysis window and draw your conclusions. It’s typical to prioritize the corrective actions that you want to make.

For example, after identifying a storage bottleneck or memory issue that impacts 500 users, you might choose to allocate more memory to the top 50 and monitor that change for a few days.

A perception issue also plays a role. Studies show that users do not notice an improvement unless it signifies at least a 20 percent increase over the previous state. In other words, don’t spread resource allocation adjustments so thin that each user is given a two percent incremental bump-up every six months. They won’t even notice the change. Better to boldly introduce a 20 percent increase today. Your users will definitely notice the improvement.

Monitor changes and look for new patterns for at least two full weeks after a significant change. Compare data before, during, and after the change. Look at variances expressed in units and as percentages. Make sure your audience, staff, and customers are aware of the changes. User engagement is helpful.

Finally, quantify the cost of slow performance in terms of its financial and political impacts:

Financial Impact:

When 500 users experience slow applications every day for a week, the lost productivity is significant. On a recent CDI engagement, we found an anti-virus process that crawled along very slowly during peak work hours. There was no need to impact users like this when the process could run after midnight.

Another financial example involves a hospital billing department. The accounts receivable team would face a severe challenge if slow network speeds prevented new billings from going out on time.

A critical medical procedure might require MRI images in the next 20 minutes while the patient remains under anesthesia. Now is not the time for performance delays.

Slow physical or virtualized environments also carry legal risks. A firm might be sued for losses involving delays for thousands of users.

Political Impact

Slow performance and a poor user experience does not reflect highly on the brand. Company executives and account managers want to look their best when showcasing new product demos. In these situations, some of your IT staff may receive phone calls from frustrated callers demanding a fix or your resignation.

Performance is no joke, especially when you factor in contractual service agreements and the competitive dynamics of the cloud economy. A sub-standard user experience impacts your bottom line and perception in the news and social media.

In the long run, prevention pays for itself, so fund your performance fixes and attack the next set of bugs early and often. Equipping your staff with faster performance is essential for business.

The Final Word

People expect robust, fast, responsive computing devices. They want to leverage powerful networks, platforms, and applications to increase their productivity. When a weak link in the system arises, it can snowball and user productivity can dramatically decline or drop-off altogether.

In the physical realm, you can still go buy a better laptop.

But in the virtual realm, monitoring the user experience is essential to identify pain points and make the right adjustments.

8 Essential Tips for Virtual Desktop Security

 

  • #1 Do Not Use Persistent Virtual Desktops

 

Always use non-persistent virtual desktops. They are more secure because they are refreshed from their original image. Persistent virtual desktops behave like physical desktop PCs and are more susceptible to malware, virus infections, and corruption. They may be more difficult to implement and manage, with more requirements, but they are the safer bet in the long run.

 

Some users may be inconvenienced when their personal files such as Microsoft Word documents that they saved may no longer appear after a desktop refresh. However, as an administrator, you can address this problem by configuring the environment to save personal files and other auxiliary settings and restore them from the user’s network profile after they log in again.

 

Even though more time is required for managing a non-persistent refresh-ready virtual desktop environment, this investment is well worth the effort. As a case in point, a public school made a smart decision to virtualize about half of their nearly 1,000 desktops. When a virus attack was detected, they simply advised their users to log off. That action alone was all that was required to destroy the virus from all user-accessible VDI desktops, and in only about five minutes. Half the network was spared with only physical desktops and a few servers needing attention. Any non-virtualized PCs or non-persistent desktops required considerable time for remediation. Therefore, it is advised to virtualize the vast majority of your computing resources. For example, imagine the security you would enjoy if fully 90% of your desktops were virtual and only 10% of resources (typically servers) remained as physical hardware devices.

 

  • #2 Maintain Agentless Anti-Virus

 

Most PCs are running a standard anti-virus package. Don’t scale back on dedicated anti-virus. But if you want to optimize performance, you’ll need an agentless anti-virus solution. In tests, typical anti-virus software decreased storage IOPS performance by as much as 30 percent.

 

Consider an agentless option for the hypervisor, where a light agent is built into VMware Tools on every virtual machine. Since the agent is so small, the solution is considered agentless. VMware’s NSX or vShield also provide a structure to use agentless antivirus and you can put a product like TrendMicro Deep Security or McAfee MOVE on your infrastructure servers. You’ll achieve full-agentless antivirus scanning on virtual desktops.

 

When a user logs on, they get a fresh virtual machine with no virus. While using the desktop, real-time scans prevent a virus. And when the user logs off, the desktop is refreshed from a clean image. Again, no viruses.

 

Some customers (schools, municipalities, or small businesses looking to save money) might skip agentless anti-virus, or even skip out on licensing a standard anti-virus package on virtualized machines entirely. This is a poor decision. In these environments, the virus will be introduced, continue to exist, and spread. Even a refresh on a virtual desktop won’t eradicate the virus on these compromised systems. The recurrence of the virus will continue. Even if all users log off, while reducing infection risk dramatically, the potential threat continues to exist. You must maintain real-time anti-virus protection. Agentless options are preferred to eliminate the 30% performance hit.

 

  • #3 Disable Multiple Virtual Desktop Logins

 

Do not allow the same user to log on to multiple virtual desktops at the same time. As an administrator, you need to disable that setting.

 

The following example illustrates a potential problem scenario that you want to avoid:

 

A user logs on to their PC. Later that day, that user logs into a virtual machine (VM1). Without logging off of either machine, they go home and decide to use a remote connection to the same machine (VM1) or even a different one (VM2). The security concern is that the session on VM1 is still open and vulnerable while that user is not present. Anyone walking by the PC can assume control of that virtual session.

 

As a precaution, institute the following network security policy:

 

Whenever the same user logs into another virtual desktop, automatically log them off the previous machine or virtual desktop.

 

  • #4 Use Two-Factor Authentication

 

Strength and options depend on the vendor technology, but generally speaking, we’re talking about a strong password plus a second form of physical or biometric authentication. Authentication providers include Okta, Imprivata, RSA, Duo, Yubico and others.

 

You want to enable and maintain an effective two-factor authentication arrangement to prevent unwanted cyber-attacks, data breaches, security intrusions, viruses, malware, and hacks from home or remote PCs.

 

  • #5 Use Single-Sign-On (SSO) Tools

 

Network policies typically enforce strong passwords and force users to change their main desktop password used to establish SSO to network applications every 90 days. Strong SSO password policies typically enforce rules for a minimum length, number of special characters, letters, and numbers, as well as preventing common strings or recycled passwords as a precaution.

 

With SSO, instead of multiple passwords, users only have to remember one. They are automatically logged into their network applications based on their desktop ID in the corporate LDAP, active directory, or user store. Even remote cloud-hosted applications such as Salesforce.com and Office365 can authenticate users with SSO.  That one SSO password is more convenient for both backend administrators and for users. The administrators don’t have to maintain separate user stores with their own password policies. And the users can typically remember their password without writing it down or copying it from an unprotected Excel file.

 

Security is also improved because there is a 1:1 ratio of unique identifiable usernames with real human employees as opposed to an environment without SSO where a single person might have 10, 20, or more different usernames that obscure the very notion of an authentic identity. However, in the event of a breach, the distributed separate passwords would then be more secure. Hacking or phishing for SSO credentials can allow the hacker to infiltrate more data.

 

Today, biometrics, once confined to science fiction, Hollywood, and television media, are common today including fingerprint, thumbprint, and retina pattern scanning. Thus, you can combine biometric two-factor authentication with SSO for a successful easy to use, yet secure solution. For the next 10-15 years, dual-authentication consisting of a thumb print or retina scan paired with a traditional memorized password seems likely to remain the de facto two-factor authentication gold standard for government security. Financial institutions are likely to continue one tier below that with a silver standard that consists of a password and a dynamically-generated temporary code.

 

  • #6 Restrict Access by Device Type

 

You can and should restrict access by device type. This involves establishing policies on Windows or Mac servers that restrict access by device type. These restrictions help you respond to the bring-your-own-device (BYOD) mania that took over corporate wireless networks over the past 10-15 years. More secure variations on this theme include restricting access to pre-configured Windows-based thin clients (good), Linux-based thin clients (better), and even more secure zero-clients (best).

 

Thin-clients are typically Windows or Linux workstations. As such, they could contain viruses. For example, a virus spreads malware onto the thin-client that contains keystroke capture spyware that could compromise the virtual desktop credentials. Linux and Mac clients are considered more secure than Windows devices because of the large Windows market share, and thus larger number of existing Windows viruses.

 

A more secure alternative is to procure zero-client hardware right from the start. These are dedicated hardware devices with no OS and only a standard BIOS architecture. Zero-clients are available from 10zig, Dell, HP, Samsung, and other popular vendors. A zero-client has no other function but to provide a secure connection to the virtual desktop. For that reason, since they have no OS or other local apps, they are very secure. Windows-based thin clients are not as secure and still remain susceptible to viruses.

 

For example, a recent innovative hospital was wheeling out patient care carts with diagnostic equipment and each cart had its own Apple iPad to establish a virtual patient chart. The administrators established a policy to allow exclusive access to a patient care app on a virtual desktop infrastructure locked down beyond the reach of other devices.

 

The following access restriction strategies are common:

 

You can prohibit connections from certain unwanted devices. For example, you can allow or deny access to users with a PC, Mac, a specific OS, a specific set of login credentials to a virtual desktop, an iPad, an iPhone, a tablet, an Android device, a Windows phone, a Chromebook, or a specific mobile OS. (Hint: Based on recent history, Apple iOS devices are more secure than Android devices.)

 

You can use management tools to establish policies that secure your own preset thin-clients or zero-clients. For example, Apple utilities and third-party management tools can turn the iPad into a zero-client.

 

For maximum security, you can reduce the number of access points to your network by enabling client security certificates. Essentially, you enable a tool for handling certificates and then use management software to push a policy to all approved thin or zero-clients to verify a certificate before allowing login.

 

  • #7 Configure VDI Servers, Desktops, and Devices on Separate VLANs

 

Do not use the same VLAN for all network components. For optimum performance and security, you want your virtual desktops, access devices, and infrastructure servers on their own separate VLANs. When on the same VLAN, a weak access point such as a PC with an older OS might become infected with a virus that would easily spread to other virtual desktop clients on the same VLAN. Even servers are not immune when on the same shared VLAN.

 

Separate VLANs with discrete gateways also add variation to IP addresses, which make device hacking more difficult. Another benefit is more DHCP IP addresses are available because you are splitting access across VLANs. On one C-class VLAN, you would be limited to 256 devices on a single gateway.

 

  • #8 Use Network Micro-Segmentation

 

Gaining in popularity, especially among big government, banking, finance, and pharmaceutical organizations, a micro-segmentation security strategy integrates directly into the VDI without a hardware firewall. Your network policies are synchronized with a virtual network, virtual machine, OS, or other virtual security target to create a security bubble. Access control capabilities in virtual switches replace existing firewall functions for segregation and controlled access across data center tenants.

 

Micro-segmentation is ideal for today’s software-defined networks with virtual desktops and pools of users on multiple smaller devices. For example, let’s say you want to protect a pool of desktops for the accounting business unit. That department stores very sensitive information and you must maintain a secure environment. With micro-segmentation, you allocate virtual desktops in that specific zone so they can only communicate with Internet and VDI servers, and are blocked from seeing any other desktops. Restricting IP traffic to any sibling desktops is extremely effective at neutralizing the spread of malware or viruses.

10 Security Best Practices for Mobile Device Owners

Don’t be alarmed by these statistics. More importantly, don’t become a statistic yourself. I’m sharing a few factoids here to help protect you, as one of the nearly 4.6 billion mobile device users out there (Gartner).

  • Cybercrimes including hacking and theft cost American businesses over $55 million per year (Ponemon Institute)
  • Every month, one in four mobile devices succumbs to some type of cyber threat (Skycure)
  • Last year in the United States alone, over five million smartphones were stolen or lost (Consumer Reports)

Who is responsible for such mayhem? Hackers, of course, and online thieves all over the world.

But who is responsible for protecting your device? You are.

As IT and Networking professionals, we can manage mobile device security around the clock, seven days a week, 365 days a year, but it is you, the mobile device owner or user, who ultimately determines the relative health of your smartphone or tablet and the level of security you want to experience.

To protect your mobile device, follow these recommended best practices:

  1. Lock your device with a passcode: One of the most common ways your identity can be stolen is when your phone is stolen. Lock your device with a password, but do not use common combinations like 1234, 1111. On Android phones you can establish a swipe security pattern. Always set the device to auto-lock when not in use.
  2. Choose the Right Mobile OS for Your Risk Tolerance: Open source integrations, price, and app selection might guide you toward Android or Windows phones; however, Apple devices running iOS are generally more secure. A recent NBC Cybersecurity News article revealed that Google’s Android operating system has become a primary target for hackers because “app marketplaces for Android tend to be less regulated.” Hackers can more easily deploy malicious apps that can be downloaded by anyone. As an example, the article reported that over 180 different types of ransomware were designed to attack Android devices in 2015. If you’re an Android owner, fear not. Consumers who choose Android can still remain safe by being aware of the vulnerabilities and actively applying the other tips in this article.
  3. Monitor Links and Websites Carefully: Take a moment to monitor the links you tap and the websites you open. Links in emails, tweets, and ads are often how cybercriminals compromise your device. If it looks suspicious, it’s best to delete it, especially if you are not familiar with the source of the link. When in doubt, throw it out.

    If you have Android and your friend has an iOS device, and you both have a link you are not sure about opening, open the link on iOS first. This practice allows you to check out the link while lowering your exposure to risks including malware.

  4. Regularly Update Your Mobile OS: Take advantage of fixes in the latest OS patches and versions of apps. These updates include fixes for known vulnerabilities. (To avoid data plan charges, download these updates when connected to a trusted wireless network.) Every few days, and especially whenever you hear news about a new virus, take the time to check for OS updates or app patches.

    In 2016, an iOS 9.x flaw resulted in a vulnerability for iPhone users where simply receiving a certain image could leave the device susceptible to infection. Apple pushed out a patch. A year ago a similar flaw was detected on Android devices; however, the risk to users was significantly greater, impacting 95 percent of nearly one billion Android devices. An expected 90-day patch was late. Meanwhile, the flaw allowed hacking to the maximum extent possible including gaining complete control of the phone, wiping the device, and even accessing apps or secretly turning on the camera. Don’t ignore those prompts to update!

    At this point you may be asking, “Do I need a separate anti-virus app, especially if I use an Android device?” To answer that question, balance your need for security against how much risk you plan on taking with your device. Do you often use public wireless networks and make poor choices with the links you open? For now, you may not need an anti-virus app; however, some early industry trends are showing more anti-virus apps on the horizon.

  5. Do Not Jailbreak Your Smartphone: Reverse engineering and unauthorized modification of your phone (jailbreaking) leaves your phone vulnerable to malware. Even jailbreaking an iOS device leaves it open to infections. If your cousin already customized your device for you, it’s not too late. Restore the OS through the update process or check with an authorized reseller.

For the rest of the tips please read my work blog:

http://www.cdillc.com/newsroom/blog/

Exchange 2010-2016 Database Availability Group (DAG) cluster timeout settings for VMware

Symptom:

Exchange 2010-2016 Database Availability Group (DAG) active database moves between DAG nodes without any reason, when the DAG nodes are VMware Virtual Machines. This may be due to the DAG node being VMotioned by vSphere DRS cluster.

Solution:

The settings below allow you to VMotion without DAG active databases flipping between nodes for no reason.

Although the tip below is mainly useful for Multi-Site DAG clusters, I have seen this flipping happen even within the same site. So, the recommendation is to do these commands on ANY DAG cluster that is running on VMware.

Instructions:

Substitute your DAG name for an example DAG name below, yourDAGname or rpsdag01.

On any Mailbox Role DAG cluster node, open Windows PowerShell with modules loaded.

Image

Type the following command to check current settings:

cluster /cluster:yourDAGname /prop

Note the following Values:

SameSubnetDelay

SameSubnetThreshold

CrossSubnetDelay

CrossSubnetThreshold

Image

Type the following commands to change the timeout settings.

cluster /cluster:yourDAGname /prop SameSubnetDelay=2000

cluster /cluster:yourDAGname /prop SameSubnetThreshold=10

cluster /cluster:yourDAGname /prop CrossSubnetDelay=4000

cluster /cluster:yourDAGname /prop CrossSubnetThreshold=10

Image

Type the command to check that settings took

cluster /cluster:yourDAGname /prop

Image

You ONLY need to run this on ONE DAG node. It will be replicated to ALL the other DAG nodes.

More Information:

See the article below:

http://technet.microsoft.com/en-us/library/dd197562(v=ws.10).aspx

Renaming Virtual Disks (VMDK) in VMware ESXi

Symptom:

You have just cloned a VM, and would like to rename your VMDKs to match the new name of the clone.

When you try to rename a VMDK in the GUI Datastore Browser in vSphere client, you get a message:

“At the moment, vSphere Client does not support the renaming of virtual disks”

How do you go around the message?

Instructions:

  1. Lookup the name of your Datastore and your VM in the GUI.
  2. Start SSH service.
  3. Login as root to your ESXi host.
  4. In a SSH session type the following commands. Substitute the name of your Datastore for STORAGENAME and your VM for VMNAME.
    1. cd /vmfs/volumes/STORAGENAME/VMNAME
  5. Substitute the name of your old VMDK for OLDNAME and your new VMDK for NEWNAME. Remember – everything is case sensitive.
    1. vmkfstools -E ./OLDNAME.vmdk ./NEWNAME.vmdk 

VMware vSphere misidentifies local or SAN-attached SSD drives as non-SSD

Symptom:

You are trying to configure Host Cache Configuration feature in VMware vSphere. The Host Cache feature will swap memory to a local SSD drive, if vSphere encounters memory constraints. It is similar to the famous Windows ReadyBoost.

Host Cache requires an SSD drive, and ESXi will detect the drive type as SSD. If the drive type is NOT SSD, Host Cache Configuration will not be allowed.

However, even though you put in some local SSD drives on the ESXi host, and also have an SSD drive on your storage array coming through, ESXi refuses to recognize the drives as SSD type, and thus refuses to let you use Host Cache.

Solution:

Apply some CLI commands to force ESXi into understanding that your drive is really SSD. Then reconfigure your Host Cache.

Instructions:

Look up the name of the disk and its naa.xxxxxx number in VMware GUI. In our example, we found that the disks that are not properly showing as SSD are:

  • Dell Serial Attached SCSI Disk (naa.600508e0000000002edc6d0e4e3bae0e)  — local SSD
  • DGC Fibre Channel Disk (naa.60060160a89128005a6304b3d121e111) — SAN-attached SSD

Check in the GUI that both show up as non-SSD type.

SSH to ESXi host. Each ESXi host will require you to look up the unique disk names and perform the commands below separately, once per host.

Type the following commands, and find the NAA numbers of your disks.

In the examples below, the relevant information is highlighted in RED.

The commands you need to type are BOLD.

The comments on commands are in GREEN.

———————————————————————————————-

~ # esxcli storage nmp device list

naa.600508e0000000002edc6d0e4e3bae0e

Device Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000002edc6d0e4e3bae0e)

Storage Array Type: VMW_SATP_LOCAL

Storage Array Type Device Config: SATP VMW_SATP_LOCAL does not support device configuration.

Path Selection Policy: VMW_PSP_FIXED

Path Selection Policy Device Config: {preferred=vmhba0:C1:T0:L0;current=vmhba0:C1:T0:L0}

Path Selection Policy Device Custom Config:

Working Paths: vmhba0:C1:T0:L0

naa.60060160a89128005a6304b3d121e111

Device Display Name: DGC Fibre Channel Disk (naa.60060160a89128005a6304b3d121e111)

Storage Array Type: VMW_SATP_ALUA_CX

Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}

Path Selection Policy: VMW_PSP_RR

Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=1: NumIOsPending=0,numBytesPending=0}

Path Selection Policy Device Custom Config:

Working Paths: vmhba2:C0:T1:L0

naa.60060160a891280066fa0275d221e111

Device Display Name: DGC Fibre Channel Disk (naa.60060160a891280066fa0275d221e111)

Storage Array Type: VMW_SATP_ALUA_CX

Storage Array Type Device Config: {navireg=on, ipfilter=on}{implicit_support=on;explicit_support=on; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=2,TPG_state=AO}}

Path Selection Policy: VMW_PSP_RR

Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0;lastPathIndex=1: NumIOsPending=0,numBytesPending=0}

Path Selection Policy Device Custom Config:

Working Paths: vmhba2:C0:T1:L3

———————————————————————————————-

Note that the Storage Array Type is VMW_SATP_LOCAL for the local SSD drive and VMW_SATP_ALUA_CX for the SAN-attached SSD drive.

Now we will check to see if in CLI, ESXi reports the disks as SSD or non-SSD for both disks. Make sure to specify your own NAA number when typing the command.

———————————————————————————————-

~ # esxcli storage core device list –device=naa.600508e0000000002edc6d0e4e3bae0e

naa.600508e0000000002edc6d0e4e3bae0e

Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000002edc6d0e4e3bae0e)

Has Settable Display Name: true

Size: 94848

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.600508e0000000002edc6d0e4e3bae0e

Vendor: Dell

Model: Virtual Disk

Revision: 1028

SCSI Level: 6

Is Pseudo: false

Status: degraded

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Thin Provisioning Status: unknown

Attached Filters:

VAAI Status: unknown

Other UIDs: vml.0200000000600508e0000000002edc6d0e4e3bae0e566972747561

~ # esxcli storage core device list –device=naa.60060160a89128005a6304b3d121e111

naa.60060160a89128005a6304b3d121e111

Display Name: DGC Fibre Channel Disk (naa.60060160a89128005a6304b3d121e111)

Has Settable Display Name: true

Size: 435200

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.60060160a89128005a6304b3d121e111

Vendor: DGC

Model: VRAID

Revision: 0430

SCSI Level: 4

Is Pseudo: false

Status: on

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Thin Provisioning Status: yes

Attached Filters: VAAI_FILTER

VAAI Status: supported

Other UIDs: vml.020000000060060160a89128005a6304b3d121e111565241494420

———————————————————————————————-

Now we will add a rule to enable SSD on those 2 disks. Make sure to specify your own NAA number when typing the commands.

———————————————————————————————-

~ # esxcli storage nmp satp rule add –satp VMW_SATP_LOCAL –device naa.600508e0000000002edc6d0e4e3bae0e –option=enable_ssd

~ # esxcli storage nmp satp rule add –satp VMW_SATP_ALUA_CX –device naa.60060160a89128005a6304b3d121e111 –option=enable_ssd

———————————————————————————————-

Next, we will check to see that the commands took effect for the 2 disks.

———————————————————————————————-

~ # esxcli storage nmp satp rule list | grep enable_ssd

VMW_SATP_ALUA_CX     naa.60060160a89128005a6304b3d121e111                                                enable_ssd                  user

VMW_SATP_LOCAL       naa.600508e0000000002edc6d0e4e3bae0e                                                enable_ssd                  user

———————————————————————————————-

Then, we will run storage reclaim commands on those 2 disks. Make sure to specify your own NAA number when typing the commands.

———————————————————————————————-

~ # esxcli storage core claiming reclaim -d naa.60060160a89128005a6304b3d121e111

~ # esxcli storage core claiming reclaim -d naa.600508e0000000002edc6d0e4e3bae0e

Unable to unclaim path vmhba0:C1:T0:L0 on device naa.600508e0000000002edc6d0e4e3bae0e. Some paths may be left in an unclaimed state. You will need to claim them manually using the appropriate commands or wait for periodic path claiming to reclaim them automatically.

———————————————————————————————-

If you get the error message above, that’s OK. It takes time for the reclaim command to work.

You can check in the CLI by running the command below and looking for “Is SSD: false”

———————————————————————————————-

~ # esxcli storage core device list –device=naa.600508e0000000002edc6d0e4e3bae0e

naa.600508e0000000002edc6d0e4e3bae0e

Display Name: Dell Serial Attached SCSI Disk (naa.600508e0000000002edc6d0e4e3bae0e)

Has Settable Display Name: true

Size: 94848

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.600508e0000000002edc6d0e4e3bae0e

Vendor: Dell

Model: Virtual Disk

Revision: 1028

SCSI Level: 6

Is Pseudo: false

Status: degraded

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: false

Is Offline: false

Is Perennially Reserved: false

Thin Provisioning Status: unknown

Attached Filters:

VAAI Status: unknown

Other UIDs: vml.0200000000600508e0000000002edc6d0e4e3bae0e566972747561

———————————————————————————————-

Check in the vSphere Client GUI. Rescan storage.

If it still does NOT say SSD, reboot the ESXi host. 

Then look in the GUI and rerun the command below.

———————————————————————————————-

~ # esxcli storage core device list —device=naa.60060160a89128005a6304b3d121e111

naa.60060160a89128005a6304b3d121e111

Display Name: DGC Fibre Channel Disk (naa.60060160a89128005a6304b3d121e111)

Has Settable Display Name: true

Size: 435200

Device Type: Direct-Access

Multipath Plugin: NMP

Devfs Path: /vmfs/devices/disks/naa.60060160a89128005a6304b3d121e111

Vendor: DGC

Model: VRAID

Revision: 0430

SCSI Level: 4

Is Pseudo: false

Status: on

Is RDM Capable: true

Is Local: false

Is Removable: false

Is SSD: true

Is Offline: false

Is Perennially Reserved: false

Thin Provisioning Status: yes

Attached Filters: VAAI_FILTER

VAAI Status: supported

Other UIDs: vml.020000000060060160a89128005a6304b3d121e111565241494420

———————————————————————————————-

If it still does NOT say SSD, you need to wait. Eventually, the command works and displays as SSD in CLI and the GUI. 

More Information:

See the article below:

Swap to host cache aka swap to SSD?

Enable Microsoft Exchange 2010-2016 DAC mode

Description:

Datacenter Activation Coordination (DAC) mode is a property setting for a database availability group (DAG).

DAC mode is disabled by default and should be enabled for all DAGs with 2 or more members that use continuous replication.

That means the majority of Exchange DAG deployments need the DAC mode.

The DAC is most useful in a multi-datacenter configuration to prevent split brain syndrome, a condition that occurs when all networks fail, and DAG members can’t receive heartbeat signals from each other.

However, I suggest you always enable the DAC.

If you enable the DAC and you need to recover, the recovery takes less commands on the command line. Only the following commandlets will be necessary:

Stop-DatabaseAvailabilityGroup

Restore-DatabaseAvailabilityGroup

Start-DatabaseAvailabilityGroup

Also, if you did NOT enable the DAC, you cannot do so if you have a failure. The DAC must be enabled ahead of time.

Instructions:

Here is how to enable the DAC mode:

1. Go to Exchange Management Shell

2. Type the following, where DAG2 is your DAG name:

Set-DatabaseAvailabilityGroup -Identity DAG2 -DatacenterActivationMode DagOnly

More Information:

For more information, read this article:

http://technet.microsoft.com/en-us/library/dd979790.aspx

What’s next for Virtual Desktop Infrastructure?

Greetings CIOs, IT Managers, VM-ers, Cisco-ites, Microsoftians, and all other End-Users out there… Yury here. Yury Magalif. Inviting you now to take another virtual trip with me to the cloud, or at least to your data center. As Practice Manager at CDI, your company is depending on my team of seven (plus or minus a consultant or two) to manage the implementation of virtualized computing including hardware, software, equipment, service optimization, monitoring, provisioning, etc. And you thought we were sitting behind the helpdesk and concerned only with front-end connectivity. Haha (still laughing) that’s a good one!

VDI: OUR JOURNEY BEGINS HERE
Allow me to paint a simple picture and add a splash of math to illustrate why your CIO expects so much from me and my team. Your company posted double-digit revenue growth for three years running and somehow, now, in Q2 of year four, finds itself in a long fourth down and 20 situation. (What? You don’t understand American football analogies? Okay, in the international language of auto-racing, we are 20 laps behind and just lost a wheel.) One thousand employees need new laptops, docking stations, flat panel displays, and related hardware. Complicating the matter are annual software licensing fees for a group of 200 but with only five simultaneous concurrent users worldwide. At $1,500 per user times 1,000, plus the $100 fee, your CIO has to decide how it will explain to the board that it plans to spend another 1.5 million dollars on IT just after Q1 closed down 40 percent and Q2 is looking to be even worse.

To read the rest of this blog, where I try something different, please go to my work blog page:

http://www.cdillc.com/whats-next-virtual-desktop-infrastructure/

Assessing your Infrastructure for VDI with real data – Part 2 of 2 – Analysis

For VSI, we established that using analysis tools was a necessity, and VMware provided wonderful Capacity Planner tool. However, it soon became evident that for VDI, it is even more important to use analysis tools. That is because for VDI, when you buy hardware and software, the investment is generally higher. You need a lot more, faster storage. You need many servers and a fast network. So the margin of error is smaller.

Consequently, using Liquidware FIT or Lakeside SysTrack is essential. There are now a few more tools on the market, like ControlUp or Login PI. However, the new entrants have not been battle tested yet.

So how do you analyze your physical desktops for VDI?

First, buy a license for the Liquidware FIT tool (per user, inexpensive), or buy an engagement from your friendly Valued Added Reseller or Integrator who is a Liquidware partner. If you buy a service from a partner, then usually up to 250 desktop license will be included with the service.

Here, I will talk about services of the partner because that is what I do. However, if you are doing this yourself, just apply the same steps.

You will need to provide your partner’s engineer with space for 2 small Liquidware virtual appliances. The only gotcha is that you want them on the fastest storage you have (SSD preferable). That is because on slower storage, it takes much longer to process any analysis or reports.

The engineer will come and install the 2 appliances into your vSphere. Then, the engineer will give you an EXE or MSI with an agent. Usually, you can use the same mechanism you already use to install software on your desktops to distribute the agent. For example, distribution tools like Microsoft SCCM, Symantec Altiris, LANDesk, and even Microsoft Group Policy will all be good. If you don’t have a mechanism for software distribution, then your engineer can use a script to install the agents on all PCs.

Make sure to choose a subset of your PCs, and at least some from each possible group of similar users (Accounting, Sales, IT, etc.). Your sample size could be about 10-25% of total user count. Obviously, the higher the analysis percentage, the more accuracy you get. But the goal here is not 100% accuracy – it’s impossible to achieve 100%. Assessment and performance analysis is an art as much as a science. Thus, you need just enough users to get a ballpark estimate of what hardware you need to buy. Also, run the assessment for 1 month preferably, or at a bare minimum 2 weeks. The time of the start of the data collection above should start from the time you deploy your last user with the Liquidware agent.

Your partner engineer will need remote access, if possible, to check on the progress of the installation. First, the engineer will check if the agents are reporting successfully back to the Liquidware appliances. During the month, the engineer will make sure agents are reporting and data can be extracted from the appliance.

In the middle of the assessment, engineer will do a so-called “normalization” of the data. That is to make sure the results are compatible with rules of thumb for analysis. If necessary, the engineer will readjust thresholds and recalculate the data back to the beginning.

At the end of 30 days, the engineer will generate a machine-made report on the overall performance metrics, and will present the report to you.

At some partners, for an extra service price, the engineer will go further, and will analyze the report for the amount and performance parameters of hardware you need. In addition, the engineer will create a written report and present all the data to you.

In either case, you will know which desktops have the best score for virtualization, and which ones you should not virtualize. If you go with more advanced report services from your partner, then you will also understand how to map the results to hardware and further insights.

One way of mitigating bad VDI sizings is to also use a load simulation tool like LoginVSI. However, LoginVSI is only useful for clients who can afford to buy similar equipment for the lab that they will buy for production. Using LoginVSI, you can test robotic (fake) users doing tasks that normal users will do in VDI. LoginVSI allows you to have a ballpark hardware number that is good. However, the LoginVSI number does not have real user experience data. For that, you need tools like Liquidware FIT and associated work to determine proper VDI strategy.

Understanding what your current user experience is, and also how that experience could be accommodated with virtual desktops is essential to VDI. You should do this assessment before buying your hardware. Doing an assessment ensures that your users get the same experience or better on the virtual desktop as they have on the physical desktop (the holy grail of VDI).

Assessing your Infrastructure for VDI with real data – Part 1 of 2 – History

It is now a common rule of thumb that when you are building Virtual Server Infrastructure (VSI), you must assess your physical environment with analysis tools. The analysis tools show you how to fit your physical workloads onto virtual machines and hosts.

The golden standard in analysis tools is VMware’s Capacity Planner. Capacity Planner used to be made by a company called AOG. AOG was analyzing not just for physical to virtual migrations, but was doing overall performance analysis of different aspects of the system. AOG was one of the first agentless data collections tools. Agentless was better because you did not have to touch each system in a drastic way, there was less chance of drivers going bad or performance impact to the target system.

Thus, AOG partnered with HP and other manufacturers, and was doing free assessments for their customers, while getting paid by the manufacturer on the backend. AOG tried to sell itself to HP, but HP, stupidly, did not buy AOG. Suddenly, VMware came from nowhere and snapped up AOG. VMware at the time needed an analysis tool to help customers migrate to the virtual infrastructure faster.

When VMware bought AOG, VMware dropped AOG’s other analysis business, and made AOG a free tool for partners to analyze migrations to the virtual infrastructure. It was a shame, because AOG’s tool, renamed to Capacity Planner, was really good. Capacity Planner relies solely on Windows Management Instrumentation (WMI) functions that is already built into Windows and is collecting information all the time. Normally, WMI discards information like performance, unless it is collected somewhere by choice. Capacity Planner just enabled that choice, and collected WMI performance and configuration data from each physical machine.

When VMware entered the Virtual Desktop Infrastructure (VDI) business with Horizon View, it lacked major pieces in the VDI ecosphere. One of the pieces was profile management, another piece was planning and analysis, another piece was monitoring. Immediately, numerous companies sprang to life to help VMware fill the need. Liquidware Labs (where the founder worked for VMware) was the first to come up with a robust planning and analysis tool in Stratusphere FIT, then with monitoring tool in Stratusphere UX. Lakeside SysTrack came on the scene. VMware used both internally, although the preference was for Liquidware.

Finally, VMware realized that the lack of analysis tool for VDI, made by VMware, was hindering them. But what they failed to realize, was that such tool already existed at VMware for years – Capacity Planner. The Capacity Planner team was neglected, so rarely would any updates were done to the tool. However, since Capacity Planner could already analyze physical machines for performance, it was easy to modify the code to collect information on virtualizing physical desktops, in addition to servers.

Capacity Planner code was eventually updated with desktops analysis. All VMware partners were jumping with joy – we now had a great tool and we did not have to relearn any new software. I remember that I eagerly collected my first data, and began to analyze the data. After analysis, the tool told me I needed something like twenty physical servers to hold 400 virtual desktops. Twenty desktops per server? That sounded wasteful. I was a beginner VDI specialist then, so I trusted the tool but still had doubts. Then I did a few more passes at the analysis, and kept getting wildly different numbers. Trusting my gut instinct, I decided to redo one analysis with Liquidware FIT.

Of course, Liquidware FIT has agents, so I used it, but always thought that it would be nice not to have agents. So VMware’s addition of desktop analysis to agentless Capacity Planner was very welcome. So, back to my analysis, after running Liquidware FIT, I came up with completely different numbers. I don’t remember what they were – perhaps 60 desktops per physical server, or something else. But what I do remember was that Liquidware’s analysis made sense, where Capacity Planner did not. My suspicions about Capacity Planner as a tool were confirmed by VMware’s own VDI staff, who, when asked if they use Capacity Planner to size VDI said, “For VDI, avoid Cap Planner like the plague, and keep using Liquidware FIT.”

As a result, I kept using Liquidware FIT since then, and never looked back. While FIT does have agents, now I understand that getting metrics like Application load times and User Login delay is not possible without agents. That is because Windows does not include such metrics in WMI. Therefore, a rich agent is able to pick up many more user experience items, and thus do much better modeling.