String.Substring() seems to bottleneck this code
Introduction
I have this favorite algorithm that I've made quite some time ago which I'm always writing and re-writing in new programming languages, platforms etc. as some sort of benchmark. Although my main programming language is C# I've just quite literally copy-pasted the code and changed the syntax slightly, built it in Java and found it to run 1000x faster.
The Code
There is quite a bit of code but I'm only going to present this snippet which seems to be the main issue:
for (int i = 0; i <= s1.Length; i++)
{
for (int j = i + 1; j <= s1.Length - i; j++)
{
string _s1 = s1.Substring(i, j);
if (tree.hasLeaf(_s1))
...
The Data
It is important to point out that the string s1 in this particular test is of length 1 milion characters (1MB).
Measurements
I have profiled my code execution in Visual Studio because I thought the way I construct my tree or the way I traverse it isn't optimal. After examining the results it appears that the line string _s1 = s1.Substring(i, j);
is accommodating for over 90% of the execution time!
Additional Observations
Another difference that I've noticed is that although my code is single threaded Java manages to execute it using all 8 cores (100% CPU utilization) while even with Parallel.For() and multi threading techniques my C# code manages to utilize 35-40% at most. Since the algorithm scales linearly with the number of cores (and frequency) I have compensated for this and still the snippet in Java executes order of magnitude 100-1000x faster.
Reasoning
I presume that the reason why this is happening has to do with the fact that strings in C# are immutable so String.Substring() has to create a copy and since it's within a nested for loop with many iterations I presume a lot of copying and garbage collecting is going on, however, I don't know how Substring is implemented in Java.
Question
What are my options at this point? There is no way around the number and length of substrings (this is already optimized maximally). Is there a method that I don't know of (or data structure perhaps) that could solve this issue for me?
Requested Minimal Implementation (from comments)
I have left out the implementation of the suffix tree which is O(n) in construction and O(log(n)) in traversal
public static double compute(string s1, string s2)
{
double score = 0.00;
suffixTree stree = new suffixTree(s2);
for (int i = 0; i <= s1.Length; i++)
{
int longest = 0;
for (int j = i + 1; j <= s1.Length - i; j++)
{
string _s1 = s1.Substring(i, j);
if (stree.has(_s1))
{
score += j - i;
longest = j - i;
}
else break;
};
i += longest;
};
return score;
}
Screenshot snippet of the profiler
Note this was tested with string s1 the size of 300.000 characters. For some reason 1 milion characters just never finishes in C# while in Java it takes only 0.75 seconds.. The memory consumed and number of garbage collections don't seem to indicate a memory issue. The peak was about 400 MB but considering the huge suffix tree this appears to be normal. No weird garbage collecting patterns spotted either.
Solution 1:
Issue Origin
After having a glorious battle that lasted two days and three nights (and amazing ideas and thoughts from the comments) I've finally managed to fix this issue!
I'd like to post an answer for anybody running into similar issues where the string.Substring(i, j)
function is not an acceptable solution to get the substring of a string because the string is either too large and you can't afford the copying done by string.Substring(i, j)
(it has to make a copy because C# strings are immutable, no way around it) or the string.Substring(i, j)
is being called a huge number of times over the same string (like in my nested for loops) giving the garbage collector a hard time, or as in my case both!
Attempts
I've tried many suggested things such as the StringBuilder, Streams, unmanaged memory allocation using Intptr and Marshal within the unsafe{}
block and even creating an IEnumerable and yield return the characters by reference within the given positions. All of these attempts failed ultimatively because some form of joining of the data had to be done as there was no easy way for me to traverse my tree character by character without jeopardizing performance. If only there was a way to span over multiple memory addresses within an array at once like you would be able to in C++ with some pointer arithmetic.. except there is..
(credits to @Ivan Stoev's comment)
The Solution
The solution was using System.ReadOnlySpan<T>
(couldn't be System.Span<T>
due to strings being immutable) which, among other things, allows us to read sub arrays of memory addresses within an existing array without creating copies.
This piece of the code posted:
string _s1 = s1.Substring(i, j);
if (stree.has(_s1))
{
score += j - i;
longest = j - i;
}
Was changed to the following:
if (stree.has(i, j))
{
score += j - i;
longest = j - i;
}
Where stree.has()
now takes two integers (position and length of substring) and does:
ReadOnlySpan<char> substr = s1.AsSpan(i, j);
Notice that the substr
variable is literally a reference to a subset of characters of the initial s1
array and not a copy! (The s1
variable had been made accessible from this function)
Note that at the moment of writing this I am using C#7.2 and .NET Framework 4.6.1 meaning that to get the Span feature I had to go to Project > Manage NuGet Packages, tick the "Include prerelease" checkbox and browse for System.Memory and install it.
Re-running the initial test (on strings of length 1 milion characters i.e. 1MB) the speed increased from 2+ minutes (I gave up waiting after 2 minutes) to ~86 miliseconds!!