Delphi Profiling tools [closed]
I am having some performance problems with my Delphi 2006 app. Can you Suggest any profiling tools that will help me find the bottle neck
i.e. A tool like turbo Profiler
I asked the same question not too long ago
I've downloaded and tried AQtime. It does seem comprehensive, but it is not an easy to use tool and is VERY expensive for an individual programmer (i.e. $600 US). I loved the fact that it was non-invasive (did not change your code), and that it could do line-by-line profiling, until I found that because it is an instrumenting profiler, it can lead to improper optimizations as in: Why is CharInSet faster than Case statement?
I tried a demo of ProDelphi, much less expensive (about $80 I think), but it was much too clunky for me - I didn't like the user interface at all, and it is invasive - changing your code to add the instrumenting, which you have to be careful about.
I used GpProfile with Delphi 4 for many years. I loved it. It also was invasive, but it worked so well I learned to trust it and it never gave me a problem in 10 years. But when I upgraded to Delphi 2009, I didn't think it best to try using it, since it hasn't been upgraded and by GP's admission, won't work without modifications. I expect you won't be able to use it either with Delphi 2006.
ProDelphi and GpProfile will only profile at the procedure level. If you want to do individual lines (which I sometimes had to), you have to call PROC1, PROC2, PROC3 for each line and put the one line in each PROC. It was a bit of an annoyance to have to do that, but it gave me good results (at least I was happy with the results of GpProfile doing that).
The answer I accepted in my CharInSet question said that "Sampling profilers, which periodically check the location of the CPU, are usually better for measuring code time." and a later answer gave Eric Grange's free sampling profiler for Delphi that now supports Delphi 2009. I haven't tried it yet, but I've heard good things about it, and it is the next one I'm going to try.
By the way, you might be best off by saving your $600 by NOT buying AQtime, and instead using that to upgrade your Delphi 2006 to Delphi 2009. The stability, speed and extra features (expecially Unicode), will be worth your while. See: What are major incentives to upgrade to D2009 (Unicode excluded)?
Also AQtime does not integrate into Delphi 2009 yet.
One other free one, with source that I found out about, but haven't tried yet is TProfiler. If anyone has tried that one, I'd like to know what they think.
Note: The Addenum I added afterwards to question 291631 seems like it may be the answer. See Andre's open source program: asmprofiler
Feb 2010 followup. I bit the bullet and purchased AQTime. A few months ago they finally integrated it into Delphi 2009 which is what I use (but they still have to do Delphi 2010). The viewing of source lines and their individual times and counts is invaluable to me, and AQTime does a superb job of this.
I have just found a very nice free sampling profiler and it supports Delphi 2009
I've used ProDelphi, mostly to determine which routines are eating the most time. It's an Instrumenting Profiler, meaning it adds a bit of code to the beginning and end of each routine. You control which routines it profiles by directives inside comments. You can also profile sections of a routine. But the sections must start and stop at the same block level, with no entry into or exit out of the section. Optimization must be off where ProDelphi inserts it's code (where you put the directives), but you can turn it on anywhere else.
The interface is kinda klunky, but very fast once you get the hang of it. You can do useful work with the free version (limited to 10 routines or sections). ProDelphi can quickly tell you which routines you should examine. But not why, or which lines.
Recently, I've started using Intel's VTune Performance Analyzer. 'WOW' doesn't begin to sum it up. I am impressed. I simply had no idea all this was built into modern Intel processors. Did you know it can tell you exactly how often a single instruction needed to wait for the L1 Data Cache to look sideways at another core before reloading a word from a higher cache? If I keep writing, I'll just sound like a breathless advert for the product.
Go to Intel and download the full-working timed demo. Dig around the net and find a couple of videos on how to get started. (Otherwise, you run the risk of being stymied by all the options.) It works with any compiler. Just point it to a .exe. It'll show you source lines if your .exe includes debug info & you point it to the source code.
I was stuck trying to optimize an inner loop that called a function I wrote. There were no external calls except length(str). This inner loop ran billions of times per run, and ate up about half the cpu time -- a perfect candidate for optimization. I tried all sorts of standard optimizations, with little to no effect. VTune shows hot-spots. I just drilled down till it showed me the ASM my code generated, and how much time each instruction took.
Here's what VTune told me:
- line nnnn [line of delphi code] ...
- addr hhhh cmp byte ptr [edx+ecx],0x14h - - - - - - - - 3 cycles
- addr hhhh ja label_x - - - - - - - - - - - - - - - - - - -10302 cycles
The absolute values mean nothing. (I think I was measuring cycles per instruction retired.) The relative values make it kinda clear where all the time went. The great thing was the Advice Window. It told me the code stalled waiting for data to load into the L1 data cache, and actually gave me good advice on how to avoid stalls.
My mistake was in thinking of the Core2 Quad as just a really fast 8086 CPU. No^3. The code was spending 99% of its time waiting for data to load from memory because I was jumping around too much. My algorithm assumed that memory was RAM (Random Access). That's not how modern CPUs work. Data in L1 cache might be accessed in 1 or 2 cycles, but accessing the L2 or L3 cache costs tens to hundreds of cycles, and going to RAM costs thousands. However, all that latency is avoided when you access your data sequentially -- because the processor will pre-load the cache with the data following the first byte you ask for.
Net result is that I rewrote the algorithm to access the data more sequentially, and got a 10x speedup, which was good enough. When I have the time, I'm certain I can get another 10x out of it. But that's just the Geek in me. Good Enough is good enough.
I already knew that you get the most bang by optimizing your algorithm, not your code. I thought I only needed the profiler to tell me what needed optimizing. But I also needed it to find the reason for the bottleneck so I could design a faster algorithm.
The new algorithm isn't radically different from the old. It just stores the data such that it can be accessed sequentially. For example, in one place I moved a field from an array of records into it's own array of integers -- because the inner loop didn't need the rest of the data in each record. I also had a rectangular matrix stored as a dynamic array of dynamic arrays. The code used this to randomly access megabytes of data (and the poor L1 data cache is only 64Kb). I figured out how to store it in a linear array as diagonals of the matrix, which is the order I use the data. (OK, maybe that part is radical.)
Anyway, I'm sold on VTune.
I have used http://www.prodelphi.de with success on Delphi 7 project in the past. Cheap and works. Don't let the bush league web site scare you off.