“Brandon, what the hell?” There is something about a conversation that starts out that way which pretty much tells you that it won’t be so pleasant. The lead technical architect was livid, and I mean “super mad” at me. I had been brought in to a Dynamics AX performance review, and he was reading the report of my findings. Now, you have to put things in context. Imagine a very good employee who is outstanding at AX and development. That person usually becomes the owner of a particular organization, and it becomes like his/her baby. He or she stays up late at night, puts in obscene hours, and learns all the special nuances of that particular organization required for day to day troubleshooting. This person doesn’t do it for the money, but out of pride and legitimate caring about their quality of work. So, when you, an external person come out to that site, and flag a problem that slipped through their watch, you better be prepared for at least some anger. Now, different people tend to handle it in different ways, and this was the more confrontational way.
Of Course, the situation even got worse
Instantly, I looked at my calendar and I received 5 meeting requests — two of them from the partner. What’s really sad is that this partner and I had worked together multiple times, knew each other, and respected one another. But one of the titles of the meeting request was “Ensure that Brandon does not stop Go-Live”. Ouuch.. That is the problem with the Go Lives and emotions. So much money and effort has been invested into getting a company to go live that people really get angry fast. Sadly, I had been through this before, and I didn’t enjoy it. I never enjoy writing a report stating that there are problematic elements on an implementation which are not suitable for production.
So, what Dynamics AX Performance issue did I see that was so serious that it caused me to tell a multi-billion dollar implementation to fix the issue before Go-live?
Reviewing the code, I saw a code issue that I deemed as so serious as to it needing to be addressed immediately. I looked at it and determined that it could basically destabilize the inventory process. I saw 42,000 round trips to update a very important inventory form (protecting identities here by not mentioning the form since many of my clients have presented me with issues like this in the past).
So here was my score card. At below a 5, I consider a component unsuitable for production. At 8, I consider something good. At 5 to 8, I believe a company can go live but I highlight issues that need to be addressed as soon as possible.
|Overall Configuration of Software||8/10|
When tuning the AOS tier, it is all about Network Latency
Previously, I have blogged that size is your number one enemy. And I’ve even gone on video and talked about our biggest indicator of size in logical reads.
But the art of tuning an AOS tier mirrors much of the same logic that you use when determining if a driving a car to a certain place is worth it. In formal terms, it’s all about controlling Network Latency through the number of trips. For those of you brand new to this, in Dynamics terms, think of it as the time between the client and server and back aka round trip. If you take an average latency of 100 ms, and have a form that uses 42,000 round trips at 100 ms apiece, while locking up data in SQL so no one else can retrieve anything, the results are disastrous. A number of hardware counters would shoot through the roof with errors as well as overall degradation of a lot of things on the server. If SQL has to wait three minutes before getting a response back from the network, not many people can use that system at one time.
In development for AX 2012 (different in D365), you may have remembered a term for this called “RPC’s”. At it’s root, it gives us an idea of a common (not the only) cause for round trips, and you may also remember that all code emphasizes a reduction round trips.
90% of the settings for tuning the application tier are concerned with round trips
Think back to your AX 2012 configuration settings and look at them again. What they really do is allow us to store things from SQL into a format that isn’t quite as fast as direct SQL access usually, but faster than the cumulative effect of bunches of round trips. All cache is, after all, is just data stored in various formats, that is being used while not directly in the SQL tables. Likewise, in code development most of the performance enhancements have centered upon adding cache to certain processes to stop RPC’s.
But why doesn’t Cache always work
Have you ever wondered why a fair number of performance hotfixes have issues? First, cache is only faster than sql when there are a few records. When there is a large amount of data, it isn’t always faster unless the cumulative effect of the reduced RPC’s is significant enough to offset the losses from not using the database directly. Thus, we are always in a battle with our caching settings. Does the effect of increasing the cache hurt us or help us?
This is where one of the most complicated battles in performance tuning takes place.
Cache misuse is usually the most common cause for AOS slowness or too many AOS servers
Adding to this is that the most common cause for adding too many AOS servers usually has to do with inefficient use of the cache from my experience rather than the actual hardware. When cache gets bloated, the whole entire AOS will slow down. Many times, it will require a restart or waiting till midnight cleaning routines to get that particular AOS running again. This is the usual reason why you see some places that can only handle 25 people on each AOS with super hardware and others that can handle 200 with less hardware. In many ways, I’ve found that cache requirements dictate hardware more than the other way around.
Brandon if you are saying that you believe there is no way to standardize cache settings for every implementation, what do we do?
From my experience, no one hard core number works for every implementation on the various settings. However, by learning a methodology and understanding why and how we use cache, you can adjust your settings on your own and track the effects. (blog for another day) Good latency for example, allows you to get away with a lot of practices that would not work otherwise.
AX will run 50% faster from my experience on a well-tuned AOS tier than on a system without AOS tuning.
How does our story end with the client?
Well, I never did hear back from the client after submitting my report and going through a couple of meetings. However, about a year later, I was at a conference and actually saw the lead technical architect. He made his way over to me and shook my hand. He told me that they ignored my recommendations and went live. But within 3 months (as soon as their network latency increased from adding another plant and having more load), they were having extremely poor performance problems. They remembered my report and went through all the code issues that I highlighted and fixed them. Performance improved dramatically. He had been meaning to reach out to me via email and apologize but had been too busy. I told him apology accepted, and he even said that they would be calling me when they do their next upgrade. So, I suppose that things worked out in the end.
I hope this post helps anyone else who is trying to get their head around one of the most fundamental concepts for Dynamics AX Performance tuning and understanding how our thinking changes when we must speed up the AOS tier. It is a different animal than the database, but essential.