In many organizations the skill sets that administer Exchange, Active Directory (AD), Domain Naming System (DNS), and the network infrastructure are often segregated. Since Exchange is heavily dependent on communication with AD and that communication must traverse the network, the ability for the Exchange Administrator to have a basic understanding of how each of these components can be troubleshot is useful. Furthermore, often times the DNS, network, and AD administrators are not familiar with how Exchange is dependent on their services and thus are not fully prepared to help an Exchange administrator isolate production issues. As such, by the end of this it is hoped that there will be enough information to allow the Exchange, DNS, network, and AD administrators to each collect the data necessary to identify the majority of the issues.
There is some required reading that is necessary in order to understand much of the content that follows. Rather than reiterate all of the pre-requisite knowledge here, reviewing the following content might be helpful:
To start with, an understanding of what the performance counters "LDAP Read Time" and "LDAP Search Time" track is very useful (as identified in Ruling Out Active Directory-Bound Problems and Monitoring Common Counters: Exchange 2007 Help). One knows that once these counters approach or exceed the recommended thresholds, troubleshooting outside of Exchange needs to begin. Therefore, the question, "What are the potential contributors to high values for these counters?" is thus raised. Since these counters are installed with Exchange and are tracked within the DSAccess /ADAccess components, depending on the version of Exchange, there are a number of dependencies that Exchange does not have visibility into that the troubleshooter needs to be cognizant of when evaluating where the source of the delay is incurred. In order to provide a clearer picture of what needs to be considered, below is a walkthrough of a somewhat simplified list of events that happens after the clock starts ticking (reference How Active Directory Searches Work for greater detail):
1. WLDAP32.DLL receives the request from one of the Exchange processes. It has to locate a Global Catalog (GC) first.
2. DNS query traverses the network. Unresponsiveness or degraded responsiveness from DNS servers will degrade the overall performance of the query.
3. WLDAP32.DLL submits the query to the GC.
4. If not already started, a Transmission Control Protocol (TCP) session is established, the Lightweight Directory Access Protocol (LDAP) query traverses the wire.
Note: since TCP requires a 3 stage handshake before the session and in turn windowing can work, multiply network latencies by 3 to figure out how much time it takes just to establish the session (thus that 10 ms latency becomes 30 ms delay even before the Exchange server submits the query).
5. The TCP data transmitted up the networking stack to LSASS.exe listening on the LDAP ports.
6. The query is processed by the GC and searches through the database to return the results.
7. The data is sent out the Network Interface Card (NIC) on the GC.
8. The query is received via WLDAP32.DLL from the GC.
a. If there are multiple pages all pages need to be returned and stored in a data structure within DSAccess / ADAccess. Each of these requires traversal of the network and resubmission of the query.
b. If there are a large quantity of values in the requested attribute (think group memberships), these all need to be retrieved and returned. This results in multiple queries to the GC. This may require steps 1 - 5 to recur.
As can be seen from the above set of steps a lot of the performance delays are heavily dependent on network latency and network performance. In fact the only portion of that number that is actually impacted by the GC performance itself is step 5. As such, in addition to troubleshooting any potential performance issues on the GC, it is highly advantageous to collect network trace data when issues do occur, preferably from both sides of the conversation.
Summary of Data needed to troubleshoot
Exchange Server - Having this data from both sides is important for comparison purposes (i.e. is the processor spike on the DC correlating with the degraded performance in Exchange)
- Performance Counters specified in the referenced articles:
- Network Trace from the Exchange server side
Note: To ease collection of the Exchange performance data reference "Mike Lagase : Perfwiz replacement for Exchange 2007"
AD Server -
- Performance Data -
- Network Trace from the affected DC side.
DNS Server - not all organizations use Microsoft DNS servers, but for the purposes of this article the assumption that a Microsoft DNS server is in use.
- Network Trace from the DNS server
- Performance Data
- Windows 2008/2008 R2 - start the built in Data Collector Set "Reliability and Performance\Reports\System\System Performance" and use the data collector set.
- Windows 2003 - SPA Select the System Overview report
- Performance counters - All DNS related counters.
How to triage where troubleshooting needs to happen
- Network Infrastructure issues -
- Use the network trace to determine if the DNS server is responding in a timely fashion. Keeping in mind that Microsoft recommends the entire conversation happen in less than 50 ms in general, if the DNS servers are taking more of that 50 ms than the actual LDAP search, this could be problematic. Timely is also subjective to whatever is normal for the environment and base lining is helpful here. In general if the DNS server is on the same segment and performing properly, 1 to 2 ms or less is reasonable. Of course as always, it is recommended that the environment be base lined so that the expected performance is known.
- Use the network trace to determine if the TCP session with the GC is being established in a timely fashion.
Again, keeping in mind that there is a 50 ms performance threshold, if the TCP session establishment is taking more time than the actual LDAP search takes to complete, this could be problematic. Also, timely is also subjective to whatever is normal for the environment and base lining is helpful hear. In general, if the GCs are on the same segments and both the network and the GC are performing properly, sessions established in 1 to 2 ms or less is reasonable. Of course as always, it is recommended that the environment be base lined so that the expected performance is known.
- AD Issues - Often AD infrastructures are large and the AD admins cannot determine which of the many GCs is in use by the affected Exchange server, thus the responsibility of pointing the AD support team to the right box to troubleshoot falls on the Exchange admin.
- Use the network trace to find out which GC is having a large delay between the LDAP Request and LDAP Response packets.
- Use the performance counters to narrow down to the box that is performing poorly.
- Exchange 2007/2010 - MSExchange ADAccess Domain Controllers\LDAP Search Time and MSExchange ADAccess Domain Controllers\LDAP Read Time check the specific instances for values exceeding the recommended thresholds. This will report the specific GC (and underlying network path) that is not responding to Exchange as quickly as desired.
Note: On Exchange 2007 SP2 and newer systems, the "MSExchange ADAccess Local Site Domain Controllers" can also be used in the same manner.
- Exchange 2003 - MSExchangeDSAccess Domain Controllers\LDAP Search Time and MSExchangeDSAccess Domain Controllers\LDAP Read Time, check the specific instances for values exceeding the recommended thresholds. This will report the specific GC (and underlying network path) that is not responding to Exchange as quickly as desired.
- Exchange 2000 - performance counters similar to the above did not exist. The only way to determine which GC is responding slowly is to analyze a network trace as described previously.
Next, identify bottlenecks within the GC (note: for the most part these basics apply to troubleshooting DNS performance as well):
While there are a large variety of potential issues, there are some basic things that can be eliminated before advanced troubleshooting need take place. The below guidance are the most basic remediation suggestions for the largest variety of these types of issues. Essentially this comes down to, "Ok, so the issue is identified, now what do I do?"
Since there is plenty of published information on how to isolate disk, processor, and/or network issues, please reference the excellent articles linked at the beginning for the recommended thresholds. Also, check out Performance Analysis of Logs (PAL), which automates the analysis of the performance data collected to help determine where the system is bottlenecking.
Disk bound - Once the issue is isolated to being disk bound, there are essentially two options, add memory or allow for increased disk IO (i.e. add spindles for direct attached storage).
- Adding memory is not quite as cut and dry as just throwing in some sticks of RAM.
- 32-bit systems - It makes sense to avoid rehashing the entire discussion of 32-bit OS architecture especially since 32-bit architecture is on its way out. Reference this support article for some useful info. Memory usage by the Lsass.exe process on domain controllers that are running Windows Server 2003 or Windows 2000 Server
- 64-bit systems - Generally the easiest thing to do is to put enough RAM in the box to load the entire database (minus white space) in memory. As of this writing, even for the largest enterprises (16 GB to 32 GB databases), this is becoming quite inexpensive or easy to justify. If the box has enough memory to load the entire database already and disk IO is the issue, troubleshoot why the database is not being loaded. A couple tip:
- Measure Process\Virtual Bytes\lsass to see if it is about (within 10%/20%) of the size of the database (minus white space) or larger. If it isn't, troubleshoot why lsass isn't growing.
Ensure that other applications aren't consuming large amounts of memory and preventing LSASS from growing like it wants to.
- Increasing disk-throughput.
- If RAM cannot be added (or additional RAM will provide no benefit) due to the limits of the 32-bit OS architecture migrate to 64-bit. If migrating to 64-bit is not an option, the only option left is to add more disk throughput. Assuming that the storage is local and not SAN attached, it essentially means add more spindles (talk to the SAN guys if the AD database is stored on the SAN).
- Reducing disk IO to the database volume is another option. Look at options such as turning off backups and AV scans during production hours, move SYSVOL and/or the database logs to separate spindles. Just follow standard storage performance tuning recommendations. Also, IOs to the disk can be reduced by reducing the demand on the boxes either by adding more DCs to the environment or reducing client demand. SPA and Performance Logs and Alerts will help track down what else is using the volume.
- Ensure that LSASS is the process using the CPU, if not eliminate the problematic process.
- If LSASS is running hot, then use SPA (Windows 2003) or Performance Logs and Alerts (Windows 2008 and up) to identify where the CPU is being consumed within LSASS.
- If everything looks normal then the only option remaining is to add more processors to the box or more boxes to the infrastructure if there is high processor load across multiple systems.
Note: In large infrastructures with multiple DCs and multiple Exchange servers, adding more hardware only makes sense if all of the boxes are experiencing high CPU utilization (close to or exceeding the thresholds in Ruling Out Active Directory-Bound Problems). If most or all of the boxes in the Exchange site are not experiencing high CPU utilization then there is probably a system specific reason for high CPU time. Look for hardcoded applications, improperly balanced load due to tuning LDAP Weights and Priorities (Check out Why you need Active Directory for Exchange Server 2007 for details on how this works), or anything else which may indicate why on GC is being singled out compared to the rest. Fix that issue before adding more hardware.
- If the outbound queue length is > 0 sustained the easy fixes are:
- Update the NIC drivers to the latest versions if possible
- Disable the Scalable Networking pack features (SNP) on Windows 2003 base servers- An update to turn off default SNP features is available for Windows Server 2003-based and Small Business Server 2003-based computers. Check the network card settings to see if your driver can enable or disable certain SNP offloading features. Depending on the driver, if you disable these settings at the OS level, there is a possibility that they can still be enabled directly on the network card settings.
- If the DCs are running Windows 2003 and the Exchange Servers are running on Windows 2008, then apply the Scalable Networking Pack rollup on the DCs to ensure that known compatibility issues do not occur. Also be aware of a known Windows 2008 TCP/IP setting issue in http://support.microsoft.com/kb/967224 that can affect overall network connectivity and performance.
- Ensure that the NIC speed/duplex settings are properly set. MS recommendations, as of this writing, recommend both the GC and switch be set to 100/Full for 100 Mb segments and Auto/Auto for GigE segments. Check the Network Interface\Current Bandwidth performance counter to ensure that network card auto-negotiation is not changing the speeds the network card. This value should remain consistent at one value throughout the lifetime of the server being online.
- Disable the load balancing/fault tolerance features of the NIC driver suite.
- Ensure that Network Interface\Packets Outbound Errors performance counters remains at 0 all the time. Any increases in this value can indicate intermittent connectivity issues on the server.
- If the network connection is using more 60% to 80% of the maximum bandwidth, the only solutions are to upgrade the available bandwidth (100Mb > GigE) or distribute the load across multiple servers/network connections.
Identifying the source of the GC load
The possibility exists that, despite the DC being either disk or processor bound, that the load is unnecessary. To determine this, the load on the box must be analyzed in greater detail than just the performance can provide. To that end, there are two main strategies for identifying the load on the box, both of which can be used concurrently. How to analyze this data is somewhat out of scope of this article, but the below bullets contain links to useful articles on how to use this data:
- Enable diagnostic logging on the DC by setting the "15 Field Engineering" registry value in order to identify expensive queries that return greater than 1,000 objects.
- For more details on how to use this reference:
- Shows the expensive or inefficient LDAP queries
- The data is easy to get to since it is stored in the DS log
- Data collection and alerting can be automated using SCOM.
- Can overwrite the logs quickly, but the irony here is that if it is flushing the logs quickly, there is probably a lot of expensive queries that need to be remediated.
- Doesn't actually track the processor impact of the query.
- Only reports on LDAP queries.
- Use SPA (note this functionality exists natively in Windows 2008 and is entitled Performance Logs and Alerts which is part of Windows Reliability and Performance Monitor)
- More details
- Shows everything that is going on, whether it is authentication, LDAP queries, network traffic, or disk IO.
- Data collection can be triggered using SCOM when certain thresholds are hit.
- On busy DCs the logging can quickly grow to unmanageable levels (busy being hundreds of LDAP queries and authentications per second). Think on the order of 100+ MB per minute. But this isn't too bad as usually a 5 to 10 minute data collection during a production issue will get the data necessary to identify what is occurring.
- Requires the installation of a tool set (Windows 2003 only)
Data collection is key! Getting the right data when the problem occurs is absolutely critical to the restoring the health of the environment. Furthermore, as Exchange is the product suffering when its dependencies are not responding adequately, it falls on the Exchange administrator to do the leg work collecting the right data from the perspective of Exchange, and then isolating where the network, DNS, and AD administrators might need to look and start collecting data. Without this information, the network, DNS, and AD admins will be looking for a needle in a haystack and it is highly unlikely that they will find anything of use. Thus, once the Exchange admin has isolated the dependencies of concern, using the information provided here, it will be much easier to collect the data necessary from those systems to remediate any issues.