## Superstrings

Just decided to revive this blog after long hibernation! First of all, congratulations to the team of Ain Shams for ranking 64 in their first participation in the ACM ICPC World Finals this year. AFAIK, this year’s world finals was supposed to be in Sharm El-Sheikh in Egypt, but because of the revolution the contest was moved to the United States.

This post is about the solution to one of the three problems that I wrote for the local ACM ICPC at Ain Shams this year. The problem statement can be summarized as follows: Given an alphabet A, a string S that is constructed from the alphabet A, and an integer value L, calculate N[A,L,S], which is the number of possible strings of length L that can be constructed from the alphabet A and are superstrings of S. To make the problem easier, it is assumed that L and |A| are small integers, and the string S consists of distinct characters (i.e., no character exists twice within S). The complete problem statement can be found here.

The problem can be solved using dynamic programming as follows. First note that, given a superstring S’, S may exist multiple times within S’, but all the existences of S within S’ can’t overlap with each others since overlapping would require the existence of a sequence of characters that is both a prefix and a postfix of S, which is impossible because we assumed that S consists of distinct characters.

Let S* be the left-most occurrence of S within S’. The start position of S* within S’ (i.e., the position of the first character of S* within S) can be any index from 1 through L-len(S)+1. For example, if S=xyz and L=6, there are 4 possible start positions of S*; either 1, 2, 3, or 4.

If L-len(S)+1 < 1, then the solution to the problem is N[A,L,S]=0. Otherwise, let N[A,L,S,i] denote the number of possible superstrings of S in which the start position of S* is i. The value of N[A,L,S] is just the summation of N[A,L,S,i] for all i.

To compute N[A,L,S,i] let us divide S’ into three parts: a prefix that comprises the characters before S*, S* itself, and finally a postfix that comprises the characters after S*. Given that the start position of S* is i, the length of the prefix is i-1, and the length of the postfix is L-len(S)-i+1 characters.

Letbe the number of possible prefixes, andbe the number of possible postfixes, given that S* starts at position i. Then. Using simple counting, the value ofcan be computed as, where |A| is the size of the alphabet. To computenote that S* by definition is the first occurrence of S within S’, thus a prefix should never contain S as a substring. Consequently,. That is, the set of all possible strings of length i-1 excluding superstrings of S. In conclusion,

The last equation defines N[A,L,S] recursively as a function in N[A,i-1,S]. Obviously, i-1 is always less than L. Hence, the problem can be solved using dynamic programming by saving the values of N[A,L,S], for all L, in a table indexed by L.

P.S. I wrote this post while sitting in Istanbul’s international airport, waiting for a flight. Twelve hours of transit is really too much!

## Maximum number of crossed matchings between two sequences

Back to OJ problems …

**Problem**

Given two sequences of integers,and, we are allowed to *match *an integerfrom the first sequence with an integerfrom the second sequence only if. No integer in either sequence can be matched with more than one integer in the other sequence.

Each matching must *cross *with another one and only one *different* matching. Two matchingsandare considered *crossed* iffand, or ifand. The two matchings are considered *different* iff.

We are required to know the maximum number of crossed matchings that can be done. For example, in the following figure, the maximum is 4.

**Solution**

The key to solve this problem is to notice its similarity to the well-known longest common subsequence (LCS) problem. However, while the LCS problem asks for the maximum number of uncrossed matchings, this problem asks for the maximum number of crossed matchings.

Letbe a sequence that contains the firstelements ofin order. Similarly,is a sequence that contains the firstelements ofin order. Letbe the maximum number of crossed matchings between the two sequencesand. We define the the predicateto be true if and only if there is a crossed matching betweenandthat involves the last element ofand the last element of; that is,

The importance of this predicate is that it determines whether adding/dropping the last element inorwill add/drop a crossed matching. If the predicateis satisfied, then we are interested to know what elements are involved in this boundary crossed matching. Letdenote the index of the last occurrence of in; that is,. Similarly,is the index of the last occurrence ofin. If we are going to count this boundary crossed matching in our maximal set of crossed matchings, then we will have to uncount any other crossed matchings that involve elements betweenandinclusive, or elements betweenandinclusive. This effectively means going back to. Therefore we shouldn’t count a boundary crossed matching unless there are no other crossed matchings that will be uncoutned for this purpose. We definerecursively as follow:

**Function**

** If ****or** **Then Return **

**If** **Then**

**Return**

which is quite similar to the recursive solution of the LCS problem. The problem can then be solved efficiently using dynamic programming.

## Two problems from the local ACM ICPC in FCIS Ain Shams

It has been a long time since I last posted here. Since this is the Eid and I have enough time to waste, it seems appropriate to add one more post. This post contains my solutions to the two problems that I wrote for the local ACM ICPC contest of FCIS Ain Shams University that was held in October 2009.

The first problem is about PQ trees. It is a simple counting problem where the input is a PQ tree, written in parenthesized form, and the output is the number of possible permutations that the tree represents. The problem statement can be accessed here. The solution is as follows. To parse the input, a simple recursive descent parser is sufficient. The number of permutations is initialized by . For each node , the number of permutations is multiplied by the factorial of the number of children of ; that is, . And for each node , if , then the number of permutations is doubled; that is, . Otherwise, if , then is not affected.

The second problem is about infinite geometric series. The problem statement can be accessed here. The problem can be summarized as follows. We have an amount of data that is initially zero. Each day kilobytes are added. All the data are compressed at the end of each day by a compression ratio , where the compression ratio is the size before compression divided by the size after compression. Note that the data are compressed multiple times. The data that are days old are compressed times. Given and , the problem asks for the the minimum storage capacity needed so that we may never run out of storage; in other words, the maximum size that our data might reach. The solution of the problem is as follows. After days we have chunks of data added, each was originally kilobytes. The chunk of data that was added days ago has been compressed times and thus its size has been divided by for times. Therefore, the amount of data that we have after the day is:

Since we seek the maximum amount of data that we will ever have, therefore we compute the amount of data after infinite number of days; that is, the amount of data as the value of tends to infinity:

## The number of trailing zeros in the factorial of a large integer

**Problem**

Given a (large) integer K, find the number of zeros on the right of the factorial of K. For example, factorial 15 = 1,307,674,368,000, so the number of zeros on right of the factorial (let’s denote it Z(1,307,674,368,000) ) equals 3.

The input integer K can be as large as 1000,000,000.

**Solution**

Given an integer N, let the prime factorization of N be .

The number of zeros on the right of N, that is Z(N), is simply the number of times N can be divided by 10 without getting a fraction. It is obvious that .

Since N, in our case, is the factorial of another integer, K, then the prime factorization of N is just the product of the prime factorizations of all the integers from 1 to K inclusive. This can be written as

So, and .

Consequently,

In other words, that is the minimum between the sum of the number of times 2 has appeared as a factor for all integers in the range 1 through K versus the sum of the number of times 5 has appeared as a factor for all integers in the same range. Obviously the number of times 5 has appeared is smaller, because it appears every 5 integers, versus 2 in the case of 2. So actually we are interested in .

We might go through all the integers in the range 1 to K and find, for each integer in such range, how many times 5 appears as a factor. But due to the fact that we are dealing with large integers that won’t be feasible. Alternatively, we may solve it as follows: the integer division K/5 gives us the number of integers in the range 1 through K for which 5 is a factor at least once, K/25 gives the number of integers in the same range for which 5 is a factor at least twice, … and so on.

Consequently, , where the division is integer division.

The formula is logarithmic in the value of K, so it is linear in the number of bits. That is efficient enough to handle the required input.

## Second shortest path

**Problem**

Given a simple weighted graph (having no loops and no multi-edges), we want to find for every two vertices a path between those vertices whose length is strictly greater than the length of the shortest path between them but still less than or equal to the length of any other path between them. Loops are allowed in our target path.

**Solution**

First use Floyd-Warshall to obtain shortest paths between all pairs. Let denote the shortest path between the vertices and . The shortest path between a vertex and itself is zero. If there is no path between and then .

Now use the following algorithm to obtain all-pairs second shortest paths. We will denote the second shortest path between a pair of vertices and as . Initially for any pair of vertices .

**For each edge in
For each pair of vertices in
Let the path be the concatenation: , ,
If
**

The rationale behind that algorithm is as follows: Moving from a source vertex to a target vertex, if -at each vertex along our path- we move from our current vertex to an adjacent vertex that lies on the shortest path between the current vertex and the target vertex, we will eventually reach the target vertex throught shortest path. If, at one step, we made a mistake by moving to an adjacent vertex that doesn’t lie on the shortest path between the current vertex and the target vertex, we won’t reach our target vertex throught the shortest path. We need to know which wrong step will make us arrive through a path that is worse than the shortest path but still better than any other possible path. We also know that we have to make no more than one mistake since two mistakes will give us a path that is even worse than the second shortest path. So, for each pair of vertices and , we try to find all paths between and that is composed of a sequence of right steps (the shortest path between and some reachable vertex ) followed by an arbitrary step that could be our mistake (the edge ) then another sequence of right steps (the shortest path between and ). We search all such possible paths for a path that is longer than the shortest path (obtained through Floyd-Warshall) but still shorter than the rest.

## Minimum-maximum integral rectangle partitioning

**Problem:**

We are given a rectangle with integral dimensions . We need to partition this rectangle into “sub-rectangles” with integral dimensions (not necessarily equal) such that the area of the largest sub-rectangle is minimized. Partitioning into sub-rectangles is achieved by splitting the rectangle either vertically or horizontally to get two smaller rectangles, then splitting the smaller rectangles to get even smaller rectangles … and so on.

**Solution:**

The problem is NP-Hard by first look, though I didn’t try to prove that formally. For small input values, the problem lends itself directly to a pseudo-polynomial dynamic programming solution. Let be the minimum maximum integral subrectangle area that results from partitioning a rectangle of dimensions into subrectangles. Let and be defined as follows:

Then, for , the following recursive formula holds:

And the base case of the recursion is .

This is a recursive formula where the value of given the arguments is a function in the value of given smaller arguments; and thus comes dynamic programming.

The recursive formula can be explained as follows: we need to cut the given rectangular area successive cuts to get subrectangles. For a rectangular area, we have possible first cuts to be performed: one for each integral step along the width and one for each integral step along the height. If the first cut was done vertically at position , then we get two sub-rectangles: with dimensions and with dimensions . Similarly, a horizontal cut at position will give us two sub-rectangles and . In either case, the required subrectangles will be divided among and . That is, for some , will be paritioned into subrectanges while will be partitioned into the remaining subrectangles. So we divided the original problem into two subproblems with smaller inputs. The process will go on recursively as long as the value of is greater than 1. Otherwise, if the only possible parition is the original rectangle itself.

We still need to decide whether to cut vertically or horizontally and, in either case, we need to pick the values of and or the values of and so as to minimize the maximum subrectangle area we might get. That is achieved by checking all possible values of and before deciding where to cut and how to divide . Dynamic programming makes that feasible as the range of the input is relatively small.

Note that we can still get a more optimized recursive formula by utilizing the symmetric nature of rectangles, but the forumla mentioned above is sufficient to get the solution accepted on the online judge.

## Analyzing a random search in a tree

**Problem:**

This is a 500-points problem from TC SRM440 Div1. It can be summarized as follows:

We are given a matrix. Some cells are marked closed and others are marked open. We can only move from one open cell to an adjacent open cell. The matrix is defined such that there is one and only one way from one open cell to another.

Our target is to reach a specific open cell, let’s call it the target cell. Our starting point is uniformly random (can be the target cell itself). Moving from one cell to another takes one second. Our motion is also uniformly random; that is, whenever we are in an open cell, the probability to move from that cell to any of the adjacent open cells is the same for all adjacent open cells (but we must move, we can’t stand still).

It is required to calculate the expected time to reach the target cell.

**Solution:**

Since the starting cell is uniformly random, we will calculate the expected time to reach the target cell for each possible starting cell; then the required overall expected time is just the average of all those expected times.

Let the open cells be where is the target cell. For the special case where is the starting cell, the expected time equals . We need to calculate the expected time for the rest.

For any cell , where , let the expected time to reach be . If the degree (the number of adjacent cells) of is then the probability of going to any of the adjacent cells is . If we are in cell , and we chose to go to the adjacent cell then the expected time to reach will be one second (for moving from to ) plus (the expected time to reach from ). We end up with the following formula for :

This is a system of linear equations that can be solved with any appropriate method, for example Gauss-Jordan elimination.

**Comments:**

The interesting thing with this problem is that it is asking for an algorithm to analyze the performance of a heuristic! We can think of this matrix as an undirected graph, where open cells of the matrix are vertices of the graph, and adjacent open cells correspond to adjacent vertices. Since there is one and only one way between any two open cells then it is actually a tree, and the algorithm is just analyzing the performance of a random search in a tree.