Superstrings
Just decided to revive this blog after long hibernation! First of all, congratulations to the team of Ain Shams for ranking 64 in their first participation in the ACM ICPC World Finals this year. AFAIK, this year’s world finals was supposed to be in Sharm El-Sheikh in Egypt, but because of the revolution the contest was moved to the United States.
This post is about the solution to one of the three problems that I wrote for the local ACM ICPC at Ain Shams this year. The problem statement can be summarized as follows: Given an alphabet A, a string S that is constructed from the alphabet A, and an integer value L, calculate N[A,L,S], which is the number of possible strings of length L that can be constructed from the alphabet A and are superstrings of S. To make the problem easier, it is assumed that L and |A| are small integers, and the string S consists of distinct characters (i.e., no character exists twice within S). The complete problem statement can be found here.
The problem can be solved using dynamic programming as follows. First note that, given a superstring S’, S may exist multiple times within S’, but all the existences of S within S’ can’t overlap with each others since overlapping would require the existence of a sequence of characters that is both a prefix and a postfix of S, which is impossible because we assumed that S consists of distinct characters.
Let S* be the left-most occurrence of S within S’. The start position of S* within S’ (i.e., the position of the first character of S* within S) can be any index from 1 through L-len(S)+1. For example, if S=xyz and L=6, there are 4 possible start positions of S*; either 1, 2, 3, or 4.
If L-len(S)+1 < 1, then the solution to the problem is N[A,L,S]=0. Otherwise, let N[A,L,S,i] denote the number of possible superstrings of S in which the start position of S* is i. The value of N[A,L,S] is just the summation of N[A,L,S,i] for all i.
To compute N[A,L,S,i] let us divide S’ into three parts: a prefix that comprises the characters before S*, S* itself, and finally a postfix that comprises the characters after S*. Given that the start position of S* is i, the length of the prefix is i-1, and the length of the postfix is L-len(S)-i+1 characters.
Letbe the number of possible prefixes, and
be the number of possible postfixes, given that S* starts at position i. Then
. Using simple counting, the value of
can be computed as
, where |A| is the size of the alphabet. To compute
note that S* by definition is the first occurrence of S within S’, thus a prefix should never contain S as a substring. Consequently,
. That is, the set of all possible strings of length i-1 excluding superstrings of S. In conclusion,
The last equation defines N[A,L,S] recursively as a function in N[A,i-1,S]. Obviously, i-1 is always less than L. Hence, the problem can be solved using dynamic programming by saving the values of N[A,L,S], for all L, in a table indexed by L.
P.S. I wrote this post while sitting in Istanbul’s international airport, waiting for a flight. Twelve hours of transit is really too much!
Alexandria University ranked 147 worldwide
This year’s THE ranking was a surprising boost for the Egyptian academia. Alexandria University from Egypt is ranked 147 worldwide. This is higher than many of the European universities that Egyptians go to do their PhDs in. So probably from now on Egyptians should try to get scholarships in Alexandria University!
Times Higher Education (THE) World University Ranking is one of three major university rankings, the other two are ARWU and USNews.
While the ranking took into consideration many factors of evaluation like learning environment and industry income, Alexandria’s one main boost came from research influence as measured by citations.
The only regional competitors that got higher ranks than Alexandria University are University of Cape Town from South Africa, which is ranked 107, and Bilkent University from Turkey, which is ranked 112. Turkey also has another university in the top 200, which is Middle East Technical University, ranked 183. Otherwise, no other Mideastern or African universities showed up in the ranking list.
In 2009, Cairo University was the only Egyptian university that appeared in THE ranking, and was ranked in the range 401-500. That was unsatisfactory compared to regional competitors: 4 Arab universities, 4 Israeli universities, 1 Turkish university, and 1 Iranian university were in top 400.
Then, a final comment that everybody knows but is worth mentioning again: University ranking is not everything. You can never capture the quality of a whole institution in a single number. But it can be a good rough estimate of the current direction in which the institution is heading!
Maximum number of crossed matchings between two sequences
Back to OJ problems …
Problem
Given two sequences of integers,and
, we are allowed to match an integer
from the first sequence with an integer
from the second sequence only if
. No integer in either sequence can be matched with more than one integer in the other sequence.
Each matching must cross with another one and only one different matching. Two matchingsand
are considered crossed iff
and
, or if
and
. The two matchings are considered different iff
.
We are required to know the maximum number of crossed matchings that can be done. For example, in the following figure, the maximum is 4.

Solution
The key to solve this problem is to notice its similarity to the well-known longest common subsequence (LCS) problem. However, while the LCS problem asks for the maximum number of uncrossed matchings, this problem asks for the maximum number of crossed matchings.
Letbe a sequence that contains the first
elements of
in order. Similarly,
is a sequence that contains the first
elements of
in order. Let
be the maximum number of crossed matchings between the two sequences
and
. We define the the predicate
to be true if and only if there is a crossed matching between
and
that involves the last element of
and the last element of
; that is,
The importance of this predicate is that it determines whether adding/dropping the last element inor
will add/drop a crossed matching. If the predicate
is satisfied, then we are interested to know what elements are involved in this boundary crossed matching. Let
denote the index of the last occurrence of
in
; that is,
. Similarly,
is the index of the last occurrence of
in
. If we are going to count this boundary crossed matching in our maximal set of crossed matchings, then we will have to uncount any other crossed matchings that involve elements between
and
inclusive, or elements between
and
inclusive. This effectively means going back to
. Therefore we shouldn’t count a boundary crossed matching unless there are no other crossed matchings that will be uncoutned for this purpose. We define
recursively as follow:
Function
If
or
Then Return
If
Then
Return
which is quite similar to the recursive solution of the LCS problem. The problem can then be solved efficiently using dynamic programming.
First publication, to appear in SIGMOD 2010.
I have just gotten my first paper accepted for publishing in SIGMOD 2010. Here are some learned lessons:
- A paper is evaluated based on originality, significance, quality of presentation, technical depth, quality of experiments and whether the experimental results really support the work. You don’t have to be perfect in all aspects, do the best you can in each of them.
- Make sure your introduction and related work sections clarify the significance and originality of your work. If the reviewer gets beyond those two sections unconvinced of the significance and originality of your paper, then he will never get convinced of that, no matter what comes later! This is a very precious quote from my advisor: “if the reviewer gets beyond the first page without getting convinced, then he will never get convinced”!
- Related to the previous point, the stuff you leave to the end usually doesn’t receive as much attention as that at the beginning, even if it is of better quality!
- Don’t stick yourself behind other researchers’ work, trying to improve their results. Instead, spend as much time as it takes searching for something that wasn’t investigated a lot, then your job will be much easier.
- Don’t read any papers published in low-quality conferences/journals.
- Read and attend lectures about other subjects, even if they are not obviously related to your research, they may inspire you with new ways to approach your research.
- Spend more time thinking and imagining than reading and learning.
- Keep cool and avoid stress!
- There is nothing called “part-time research” !!
Two problems from the local ACM ICPC in FCIS Ain Shams
It has been a long time since I last posted here. Since this is the Eid and I have enough time to waste, it seems appropriate to add one more post. This post contains my solutions to the two problems that I wrote for the local ACM ICPC contest of FCIS Ain Shams University that was held in October 2009.
The first problem is about PQ trees. It is a simple counting problem where the input is a PQ tree, written in parenthesized form, and the output is the number of possible permutations that the tree represents. The problem statement can be accessed here. The solution is as follows. To parse the input, a simple recursive descent parser is sufficient. The number of permutations is initialized by
. For each
node
, the number of permutations
is multiplied by the factorial of the number of children of
; that is,
. And for each
node
, if
, then the number of permutations is doubled; that is,
. Otherwise, if
, then
is not affected.
The second problem is about infinite geometric series. The problem statement can be accessed here. The problem can be summarized as follows. We have an amount of data that is initially zero. Each day kilobytes are added. All the data are compressed at the end of each day by a compression ratio
, where the compression ratio is the size before compression divided by the size after compression. Note that the data are compressed multiple times. The data that are
days old are compressed
times. Given
and
, the problem asks for the the minimum storage capacity needed so that we may never run out of storage; in other words, the maximum size that our data might reach. The solution of the problem is as follows. After
days we have
chunks of data added, each was originally
kilobytes. The chunk of data that was added
days ago has been compressed
times and thus its size has been divided by
for
times. Therefore, the amount of data that we have after the
day is:
Since we seek the maximum amount of data that we will ever have, therefore we compute the amount of data after infinite number of days; that is, the amount of data as the value of tends to infinity:
The number of trailing zeros in the factorial of a large integer
Problem
Given a (large) integer K, find the number of zeros on the right of the factorial of K. For example, factorial 15 = 1,307,674,368,000, so the number of zeros on right of the factorial (let’s denote it Z(1,307,674,368,000) ) equals 3.
The input integer K can be as large as 1000,000,000.
Solution
Given an integer N, let the prime factorization of N be .
The number of zeros on the right of N, that is Z(N), is simply the number of times N can be divided by 10 without getting a fraction. It is obvious that .
Since N, in our case, is the factorial of another integer, K, then the prime factorization of N is just the product of the prime factorizations of all the integers from 1 to K inclusive. This can be written as
So, and
.
Consequently,
In other words, that is the minimum between the sum of the number of times 2 has appeared as a factor for all integers in the range 1 through K versus the sum of the number of times 5 has appeared as a factor for all integers in the same range. Obviously the number of times 5 has appeared is smaller, because it appears every 5 integers, versus 2 in the case of 2. So actually we are interested in .
We might go through all the integers in the range 1 to K and find, for each integer in such range, how many times 5 appears as a factor. But due to the fact that we are dealing with large integers that won’t be feasible. Alternatively, we may solve it as follows: the integer division K/5 gives us the number of integers in the range 1 through K for which 5 is a factor at least once, K/25 gives the number of integers in the same range for which 5 is a factor at least twice, … and so on.
Consequently, , where the division is integer division.
The formula is logarithmic in the value of K, so it is linear in the number of bits. That is efficient enough to handle the required input.
Second shortest path
Problem
Given a simple weighted graph (having no loops and no multi-edges), we want to find for every two vertices a path between those vertices whose length is strictly greater than the length of the shortest path between them but still less than or equal to the length of any other path between them. Loops are allowed in our target path.
Solution
First use Floyd-Warshall to obtain shortest paths between all pairs. Let denote the shortest path between the vertices
and
. The shortest path between a vertex and itself is zero. If there is no path between
and
then
.
Now use the following algorithm to obtain all-pairs second shortest paths. We will denote the second shortest path between a pair of vertices and
as
. Initially
for any pair of vertices
.
For each edge in
For each pair of vertices
in
Let the path
be the concatenation:
,
,
If
The rationale behind that algorithm is as follows: Moving from a source vertex to a target vertex, if -at each vertex along our path- we move from our current vertex to an adjacent vertex that lies on the shortest path between the current vertex and the target vertex, we will eventually reach the target vertex throught shortest path. If, at one step, we made a mistake by moving to an adjacent vertex that doesn’t lie on the shortest path between the current vertex and the target vertex, we won’t reach our target vertex throught the shortest path. We need to know which wrong step will make us arrive through a path that is worse than the shortest path but still better than any other possible path. We also know that we have to make no more than one mistake since two mistakes will give us a path that is even worse than the second shortest path. So, for each pair of vertices and
, we try to find all paths between
and
that is composed of a sequence of right steps (the shortest path between
and some reachable vertex
) followed by an arbitrary step that could be our mistake (the edge
) then another sequence of right steps (the shortest path between
and
). We search all such possible paths for a path that is longer than the shortest path (obtained through Floyd-Warshall) but still shorter than the rest.
Minimum-maximum integral rectangle partitioning
Problem:
We are given a rectangle with integral dimensions . We need to partition this rectangle into
“sub-rectangles” with integral dimensions (not necessarily equal) such that the area of the largest sub-rectangle is minimized. Partitioning into sub-rectangles is achieved by splitting the rectangle either vertically or horizontally to get two smaller rectangles, then splitting the smaller rectangles to get even smaller rectangles … and so on.
Solution:
The problem is NP-Hard by first look, though I didn’t try to prove that formally. For small input values, the problem lends itself directly to a pseudo-polynomial dynamic programming solution. Let be the minimum maximum integral subrectangle area that results from partitioning a rectangle of dimensions
into
subrectangles. Let
and
be defined as follows:
Then, for , the following recursive formula holds:
And the base case of the recursion is .
This is a recursive formula where the value of given the arguments
is a function in the value of
given smaller arguments; and thus comes dynamic programming.
The recursive formula can be explained as follows: we need to cut the given rectangular area successive cuts to get
subrectangles. For a
rectangular area, we have
possible first cuts to be performed: one for each integral step along the width and one for each integral step along the height. If the first cut was done vertically at position
, then we get two sub-rectangles:
with dimensions
and
with dimensions
. Similarly, a horizontal cut at position
will give us two sub-rectangles
and
. In either case, the required
subrectangles will be divided among
and
. That is, for some
,
will be paritioned into
subrectanges while
will be partitioned into the remaining
subrectangles. So we divided the original problem into two subproblems with smaller inputs. The process will go on recursively as long as the value of
is greater than 1. Otherwise, if
the only possible parition is the original rectangle itself.
We still need to decide whether to cut vertically or horizontally and, in either case, we need to pick the values of and
or the values of
and
so as to minimize the maximum subrectangle area we might get. That is achieved by checking all possible values of
and
before deciding where to cut and how to divide
. Dynamic programming makes that feasible as the range of the input is relatively small.
Note that we can still get a more optimized recursive formula by utilizing the symmetric nature of rectangles, but the forumla mentioned above is sufficient to get the solution accepted on the online judge.
Analyzing a random search in a tree
Problem:
This is a 500-points problem from TC SRM440 Div1. It can be summarized as follows:
We are given a matrix. Some cells are marked closed and others are marked open. We can only move from one open cell to an adjacent open cell. The matrix is defined such that there is one and only one way from one open cell to another.
Our target is to reach a specific open cell, let’s call it the target cell. Our starting point is uniformly random (can be the target cell itself). Moving from one cell to another takes one second. Our motion is also uniformly random; that is, whenever we are in an open cell, the probability to move from that cell to any of the adjacent open cells is the same for all adjacent open cells (but we must move, we can’t stand still).
It is required to calculate the expected time to reach the target cell.
Solution:
Since the starting cell is uniformly random, we will calculate the expected time to reach the target cell for each possible starting cell; then the required overall expected time is just the average of all those expected times.
Let the open cells be where
is the target cell. For the special case where
is the starting cell, the expected time equals
. We need to calculate the expected time for the rest.
For any cell , where
, let the expected time to reach
be
. If the degree (the number of adjacent cells) of
is
then the probability of going to any of the adjacent cells is
. If we are in cell
, and we chose to go to the adjacent cell
then the expected time to reach
will be one second (for moving from
to
) plus
(the expected time to reach
from
). We end up with the following formula for
:
This is a system of linear equations that can be solved with any appropriate method, for example Gauss-Jordan elimination.
Comments:
The interesting thing with this problem is that it is asking for an algorithm to analyze the performance of a heuristic! We can think of this matrix as an undirected graph, where open cells of the matrix are vertices of the graph, and adjacent open cells correspond to adjacent vertices. Since there is one and only one way between any two open cells then it is actually a tree, and the algorithm is just analyzing the performance of a random search in a tree.
The number of the factors of the factors of a number!
Problem:
Given an integer N, find all the factors of N including the trivial factors (1 and N). For each factor f, find the cube of the number of factors of f. Then sum all those cubes.
For example, if N=4, the set of all possible factors are 1, 2 and 4. 1 has only one factor which is 1. 2 has 2 factors: 1 and 2. 4 has 3 factors: 1, 2 and 4. So the result equals
Solution:
A naive solution won’t work since the input is too large (up to 500,000 test cases and N can be as large as 5,000,000).
The idea is as follows: According to the fundamental theorem of arithmetic, any integer N>1 has a unique prime factorization (i.e., there is one and only one way to factorize this integer into a product of prime numbers). For example, 420 has the following unique prime factorization:
420 = 2 * 2 * 3 * 5 * 7
For a given integer N, let the prime factorization of N be where n is the number of prime factors of N,
is the
distinct prime factor which is raised to the power
.
We can see that the total number of general factors of N (all factors, not only the primes) equals .
This value follows from the fact that any general factor of N is just the product of some of the prime factors of N. So any such general factor
has the prime factorization
where
. The case where
for all i will give us the general factor
. The case where
for all i will give us the general factor
(the number itself).
However, we are not interested in the number of general factors of N; we are interested in the number of general factors of each general factor of N (!!) to be cubed each and summed together.
This value can be obtained as follows: using the same reasoning mentioned above, for each general factor of N, the number of general factors of
equals
. For each j and i, the value of
ranges from 0 to
inclusively, therefore the required value, the summation of cubes of numbers of general factors of all general factors of N, equals:
Re-distributing the summation we get the following formula:
Using the well-known series we can re-write our formula again as follows:
Thus, all we need to calculate is the prime factorization of N, from which we can get the value of for each i.
Necessary Optimizations:
Due to large input sizes, calculating the prime factorization of every input value from the scratch will give TLE. The process can be optimized by using simple dynamic programming.
For every integer N, the value we need to calculate is
where N’ is some integer less than N. So actually we can re-use the values of res( ) computed for smaller input values to compute larger ones. All we need for any given integer N is to find one of its values (the power of one of its prime factors), then we can calculate res(N) recursively, and there comes dynamic programming.
Finding a power of one of the prime factors
of N can be done easily once the prime factor itself
is found. We can use a modified version of Sieve of Eratosthenes to find a prime factor for each integer in a given range.
The original Sieve was used to determine which natural numbers are prime and which are not, up to some maximum number MAX. Starting from counter=2, we mark all multiples of the counter as not-prime until MAX is reached. Move the counter forward to the next unmarked integer (which will always be prime since it is not a multiple of any smaller integer).
In the modified version, instead of marking the multiples of the counter as not-prime, we will associate the counter with its multiples as a prime factor of them. This way we get a prime factor for each natural number up to MAX (which equals 5,000,000 is our case).
Having reached such step, dynamic programming can be done easily by iterating through natural numbers up to MAX, for each natural number N, we can calculate res(N) using the prime factor of N we have obtained before and res(N’) for some smaller natural number N’ also calculated before.