Yet Another Blog

Superstrings

with 3 comments

Just decided to revive this blog after long hibernation! First of all, congratulations to the team of Ain Shams for ranking 64 in their first participation in the ACM ICPC World Finals this year. AFAIK, this year’s world finals was supposed to  be in Sharm El-Sheikh in Egypt, but because of the revolution the contest was moved to the United States.

This post is about the solution to one of the three problems that I wrote for the local ACM ICPC at Ain Shams this year. The problem statement can be summarized as follows: Given an alphabet A, a string S that is constructed from the alphabet A, and an integer value L, calculate N[A,L,S], which is the number of possible strings of length L that can be constructed from the alphabet A and are superstrings of S. To make the problem easier, it is assumed that L and |A| are small integers, and the string S consists of distinct characters (i.e., no character exists twice within S). The complete problem statement can be found here.

The problem can be solved using dynamic programming as follows. First note that, given a superstring S’, S may exist multiple times within S’, but all the existences of S within S’ can’t overlap with each others since overlapping would require the existence of a sequence of characters that is both a prefix and a postfix of S, which is impossible because we assumed that S consists of distinct characters.

Let S* be the left-most occurrence of S within S’. The start position of S* within S’ (i.e., the position of the first character of S* within S) can be any index from 1 through L-len(S)+1. For example, if S=xyz and L=6, there are 4 possible start positions of S*; either 1, 2, 3, or 4.

If L-len(S)+1 < 1, then the solution to the problem is N[A,L,S]=0. Otherwise, let N[A,L,S,i] denote the number of possible superstrings  of S in which the start position of  S* is i. The value of N[A,L,S] is just the summation of N[A,L,S,i] for all i.

To compute N[A,L,S,i] let us divide S’ into three parts: a prefix that comprises the characters before S*, S* itself, and finally a postfix that comprises the characters after S*. Given that the start position of S* is i, the length of the prefix is i-1, and the length of the postfix is L-len(S)-i+1 characters.

LetN_{pre}[A,L,S,i]be the number of possible prefixes, andN_{post}[A,L,S,i]be the number of possible postfixes, given that S* starts at position i. ThenN[A,L,S,i] = N_{pre}[A,L,S,i] * N_{post}[A,L,S,i]. Using simple counting, the value ofN_{post}[A,L,S,i]can be computed as\vert A\vert^{L-len(S)-i+1}, where |A| is the size of the alphabet. To computeN_{pre}[A,L,S,i]note that S* by definition is the first occurrence of S within S’, thus a prefix should never contain S as a substring. Consequently,N_{pre} [A,L,S,i] = \vert A\vert^{i-1} - N [A, i-1 ,S ] . That is, the set of all possible strings of length i-1 excluding superstrings of S. In conclusion,

N[A,L,S] = \sum_{i} (\vert A\vert^{L-len(S)-i+1} ) ( \vert A\vert^{i-1} - N[A,i-1,S] )

The last equation defines N[A,L,S] recursively as a function in N[A,i-1,S]. Obviously, i-1 is always less than L. Hence, the problem can be solved using dynamic programming by saving the values of N[A,L,S], for all L, in a table indexed by L.

P.S. I wrote this post while sitting in Istanbul’s international airport, waiting for a flight. Twelve hours of transit is really too much!

Written by Hatem Abdelghani

June 4, 2011 at 2:52 am

Maximum number of crossed matchings between two sequences

leave a comment »

Back to OJ problems …

Problem

PKU – 1692

Given two sequences of integers,A=a_1,a_2,a_3...andB=b_1,b_2,b_3..., we are allowed to match an integera_ifrom the first sequence with an integerb_jfrom the second sequence only ifa_i=b_j.  No integer in either sequence can be matched with more than one integer in the other sequence.

Each matching must cross with another one and only one different matching. Two matchings(a_i,b_j)and(a_k,b_l)are considered crossed iffi<kandl<j, or ifk<iandj<l. The two matchings are considered different iffa_i\neq a_k.

We are required to know the maximum number of crossed matchings that can be done.  For example, in the following figure, the maximum is 4.

Solution

The key to solve this problem is to notice its similarity to the well-known longest common subsequence (LCS) problem. However, while the LCS problem asks for the maximum number of uncrossed matchings, this problem asks for the maximum number of crossed matchings.

LetA_mbe a sequence that contains the firstmelements ofAin order. Similarly,B_nis a sequence that contains the firstnelements ofBin order. LetMCM(A_m,B_n)be the maximum number of crossed matchings between the two sequencesA_mandB_n. We define the the predicateboundary(A_m,B_n)to be true if and only if there is a crossed matching betweenA_mandB_nthat involves the last element ofA_mand the last element ofB_n; that is,

boundary(A_m,B_n) \Leftrightarrow\ a_m\neq b_n\wedge (\exists\ i<m\ \exists\ j<n\ (a_m=b_j\wedge b_n=a_i))

The importance of this predicate is that it determines whether adding/dropping the last element inA_morB_nwill add/drop a crossed matching. If the predicateboundary(A_m,B_n)is satisfied, then we are interested to know what elements are involved in this boundary crossed matching. Letl\_idx(A_m, b_n)denote the index of the last occurrence of b_ninA_m; that is,l\_idx(A_m,b_n)=\displaystyle\max\{i:i<m\ and\ a_i=b_n\}. Similarly,l\_idx(B_n,a_m)=\displaystyle\max\{j:j<n\ and\ b_j=a_m\}is the index of the last occurrence ofa_minB_n. If we are going to count this boundary crossed matching in our maximal set of crossed matchings, then we will have to uncount any other crossed matchings that involve elements betweenl\_idx(A_m,b_n)anda_minclusive, or elements betweenl\_idx(B_n,a_m)andb_ninclusive. This effectively means going back toMCM(A_{l\_idx(A_m,b_n)-1},B_{l\_idx(B_n,a_m)-1}). Therefore we shouldn’t count a boundary crossed matching unless there are no other crossed matchings that will be uncoutned for this purpose. We defineMCMrecursively as follow:

Function MCM(A_m,B_n)
\ If m=0orn=0 Then Return 0
\ res\leftarrow max(MCM(A_m,B_{n-1}),MCM(A_{m-1},B_n))
\ If boundary(A_m,B_n) Then
\ \ res\leftarrow\max(res, MCM(A_{l\_idx(A_m,b_n)-1},B_{l\_idx(B_n,a_m)-1})+1)
\ Return res

which is quite similar to the recursive solution of the LCS problem. The problem can then be solved efficiently using dynamic programming.

Written by Hatem Abdelghani

March 23, 2010 at 6:15 pm

Two problems from the local ACM ICPC in FCIS Ain Shams

with 5 comments

It has been a long time since I last posted here. Since this is the Eid and I have enough time to waste, it seems appropriate to add one more post. This post contains my solutions to the two problems that I wrote for the local ACM ICPC contest of FCIS Ain Shams University that was held in October 2009.

The first problem is about PQ trees. It is a simple counting problem where the input is a PQ tree, written in parenthesized form, and the output is the number of possible permutations that the tree represents. The problem statement can be accessed here. The solution is as follows. To parse the input, a simple recursive descent parser is sufficient. The number of permutations nPerm is initialized by 1. For each P node p, the number of permutations nPerm is multiplied by the factorial of the number of children of p; that is, nPerm\leftarrow nPerm*nChildren(p) !. And for each Q node q, if nChildren(q)>1, then the number of permutations is doubled; that is, nPerm\leftarrow nPerm * 2. Otherwise, if nChildren(q)\le 1, then nPerm is not affected.

The second problem is about infinite geometric series. The problem statement can be accessed here. The problem can be summarized as follows. We have an amount of data that is initially zero. Each day N kilobytes are added. All the data are compressed at the end of each day by a compression ratio R, where the compression ratio is the size before compression divided by the size after compression. Note that the data are compressed multiple times. The data that are k days old are compressed k times. Given N and R, the problem asks for the the minimum storage capacity needed so that we may never run out of storage; in other words, the maximum size that our data might reach. The solution of the problem is as follows. After k days we have k chunks of data added, each was originally N kilobytes. The chunk of data that was added i days ago has been compressed i times and thus its size has been divided by R for i times. Therefore, the amount of data that we have after the k^{th} day is:

\displaystyle\sum_{i=0}^{k}N.(1/R)^i\ =\ N\displaystyle\sum_{i=0}^{k}(1/R)^i\ =\ N\frac{1-(1/R)^{k+1}}{1-(1/R)}

Since we seek the maximum amount of data that we will ever have, therefore we compute the amount of data after infinite number of days; that is, the amount of data as the value of k tends to infinity:

\displaystyle\lim_{k\rightarrow \infty}N\frac{1-(1/R)^{k+1}}{1-(1/R)}\ =\ N\frac{1}{1-(1/R)}\ =\ \frac{NR}{R-1}

Written by Hatem Abdelghani

November 27, 2009 at 12:07 pm

Posted in Problem Solving

Tagged with , , ,

The number of trailing zeros in the factorial of a large integer

with 4 comments

Problem

PKU – 1401

Given a (large) integer K, find the number of zeros on the right of the factorial of K. For example, factorial 15 = 1,307,674,368,000, so the number of zeros on right of the factorial (let’s denote it Z(1,307,674,368,000) ) equals 3.

The input integer K can be as large as 1000,000,000.

Solution

Given an integer N, let the prime factorization of N be 2^{p^N_2}.3^{p^N_3}.5^{p^N_5} ... .

The number of zeros on the right of N, that is Z(N), is simply the number of times N can be divided by 10 without getting a fraction. It is obvious that Z(N)=min\{p^N_2,p^N_5\}.

Since N, in our case, is the factorial of another integer, K, then the prime factorization of N is just the product of the prime factorizations of all the integers from 1 to K inclusive. This can be written as

\displaystyle\prod_{i=1}^K{2^{p^i_2}.3^{p^i_3}.5^{p^i_5} ...}

So, p^N_2=\displaystyle\sum_{i=1}^K{p^i_2} and p^N_5=\displaystyle\sum_{i=1}^K{p^i_5}.

Consequently, Z(N)=min\{\displaystyle\sum_{i=1}^K{p^i_2},\displaystyle\sum_{i=1}^K{p^i_5}\}

In other words, that is the minimum between the sum of the number of times 2 has appeared as a factor for all integers in the range 1 through K versus the sum of the number of times 5 has appeared as a factor for all integers in the same range. Obviously the number of times 5 has appeared is smaller, because it appears every 5 integers, versus 2 in the case of 2. So actually we are interested in p^N_5.

We might go through all the integers in the range 1 to K and find, for each integer in such range, how many times 5 appears as a factor. But due to the fact that we are dealing with large integers that won’t be feasible. Alternatively, we may solve it as follows: the integer division K/5 gives us the number of integers in the range 1 through K for which 5 is a factor at least once, K/25 gives the number of integers in the same range for which 5 is a factor at least twice, … and so on.

Consequently, p^N_5=\displaystyle\sum_{i=1}^{\lfloor log_5 K\rfloor}{K/5^i}, where the division is integer division.

The formula is logarithmic in the value of K, so it is linear in the number of bits. That is efficient enough to handle the required input.

Written by Hatem Abdelghani

July 28, 2009 at 9:22 am

Second shortest path

with 10 comments

Problem

UVa – 10342

Given a simple weighted graph G=(V,E) (having no loops and no multi-edges), we want to find for every two vertices a path between those vertices whose length is strictly greater than the length of the shortest path between them but still less than or equal to the length of any other path between them. Loops are allowed in our target path.

Solution

First use Floyd-Warshall to obtain shortest paths between all pairs. Let \pi_1(i,j) denote the shortest path between the vertices i and j. The shortest path between a vertex and itself is zero. If there is no path between i and j then \pi_1(i,j)=\infty.

Now use the following algorithm to obtain all-pairs second shortest paths. We will denote the second shortest path between a pair of vertices i and j as \pi_2(i,j). Initially \pi_2(i,j)=\infty for any pair of vertices (i,j).

For each edge (x,y) in E
\ For each pair of vertices (i,j) in V\times V
\ \ Let the path tmp(i,j) be the concatenation: \pi_1(i,x) , (x,y) , \pi_1(y,j)
\ \ If len(\pi_1(i,j)) < len(tmp(i,j)) < len(\pi_2(i,j))
\ \ \ \pi_2(i,j)=tmp(i,j)

The rationale behind that algorithm is as follows: Moving from a source vertex to a target vertex, if -at each vertex along our path- we move from our current vertex to an adjacent vertex that lies on the shortest path between the current vertex and the target vertex, we will eventually reach the target vertex throught shortest path. If, at one step, we made a mistake by moving to an adjacent vertex that doesn’t lie on the shortest path between the current vertex and the target vertex, we won’t reach our target vertex throught the shortest path. We need to know which wrong step will make us arrive through a path that is worse than the shortest path but still better than any other possible path. We also know that we have to make no more than one mistake since two mistakes will give us a path that is even worse than the second shortest path. So, for each pair of vertices i and j, we try to find all paths between i and j that is composed of a sequence of right steps (the shortest path between i and some reachable vertex x) followed by an arbitrary step that could be our mistake (the edge (x,y)) then another sequence of right steps (the shortest path between y and j). We search all such possible paths for a path that is longer than the shortest path (obtained through Floyd-Warshall) but still shorter than the rest.

Written by Hatem Abdelghani

July 4, 2009 at 2:24 am

Minimum-maximum integral rectangle partitioning

leave a comment »

Problem:

PKU-2951

We are given a rectangle with integral dimensions w*h. We need to partition this rectangle into m “sub-rectangles” with integral dimensions (not necessarily equal) such that the area of the largest sub-rectangle is minimized. Partitioning into sub-rectangles is achieved by splitting the rectangle either vertically or horizontally to get two smaller rectangles, then splitting the smaller rectangles to get even smaller rectangles … and so on.

Solution:

The problem is NP-Hard by first look, though I didn’t try to prove that formally. For small input values, the problem lends itself directly to a pseudo-polynomial dynamic programming solution. Let sol(w,h,m) be the minimum maximum integral subrectangle area that results from partitioning a rectangle of dimensions w*h into m subrectangles. Let sol1(w,m,h) and sol2(w,m,h) be defined as follows:

sol1(w,h,m) = min_{1\le w'< w, 1\le m'< m}\{ max\{ sol(w',h,m') , sol(w-w',h,m-m') \} \}

sol2(w,h,m) = min_{1\le h'< h, 1\le m'< m}\{ max\{ sol(w,h',m') , sol(w,h-h',m-m') \} \}

Then, for m>1, the following recursive formula holds:

sol(w,h,m) = min \{ sol1(w,h,m) , sol2(w,h,m) \}

And the base case of the recursion is sol(w,h,1)=w*h.

This is a recursive formula where the value of sol given the arguments w,h,m is a function in the value of sol given smaller arguments; and thus comes dynamic programming.

The recursive formula can be explained as follows: we need to cut the given rectangular area m-1 successive cuts to get m subrectangles. For a w*h rectangular area, we have w+h possible first cuts to be performed: one for each integral step along the width and one for each integral step along the height. If the first cut was done vertically at position w', then we get two sub-rectangles: r1 with dimensions w'*h and r2 with dimensions (w-w')*h. Similarly, a horizontal cut at position h' will give us two sub-rectangles r1=w*h' and r2=w*(h-h'). In either case, the required m subrectangles will be divided among r1 and r2. That is, for some 1\le m'< m, r1 will be paritioned into m' subrectanges while r2 will be partitioned into the remaining m-m' subrectangles. So we divided the original problem into two subproblems with smaller inputs. The process will go on recursively as long as the value of m is greater than 1. Otherwise, if m=1 the only possible parition is the original rectangle itself.

We still need to decide whether to cut vertically or horizontally and, in either case, we need to pick the values of w' and m' or the values of h' and m' so as to minimize the maximum subrectangle area we might get. That is achieved by checking all possible values of w',h' and m' before deciding where to cut and how to divide m. Dynamic programming makes that feasible as the range of the input is relatively small.

Note that we can still get a more optimized recursive formula by utilizing the symmetric nature of rectangles, but the forumla mentioned above is sufficient to get the solution accepted on the online judge.

Written by Hatem Abdelghani

May 22, 2009 at 2:06 am

Analyzing a random search in a tree

with 2 comments

Problem:
This is a 500-points problem from TC SRM440 Div1. It can be summarized as follows:

We are given a matrix. Some cells are marked closed and others are marked open. We can only move from one open cell to an adjacent open cell. The matrix is defined such that there is one and only one way from one open cell to another.

Our target is to reach a specific open cell, let’s call it the target cell. Our starting point is uniformly random (can be the target cell itself). Moving from one cell to another takes one second. Our motion is also uniformly random; that is, whenever we are in an open cell, the probability to move from that cell to any of the adjacent open cells is the same for all adjacent open cells (but we must move, we can’t stand still).

It is required to calculate the expected time to reach the target cell.

Solution:
Since the starting cell is uniformly random, we will calculate the expected time to reach the target cell for each possible starting cell; then the required overall expected time is just the average of all those expected times.

Let the open cells be c_0, c_1, c_2, ... , c_n where c_0 is the target cell. For the special case where c_0 is the starting cell, the expected time equals 0. We need to calculate the expected time for the rest.

For any cell c_i, where 0<i\le n, let the expected time to reach c_0 be E_i. If the degree (the number of adjacent cells) of c_i is d_i then the probability of going to any of the adjacent cells is \frac{1}{d_i}. If we are in cell c_i, and we chose to go to the adjacent cell c_j then the expected time to reach c_0 will be one second (for moving from c_i to c_j) plus E_j (the expected time to reach c_0 from c_j). We end up with the following formula for E_i:

E_i = \displaystyle\sum_{j:\ c_j\ adjacent\ to\ c_i} {\displaystyle\frac{1}{d_i} (1+E_j)} = 1 + \displaystyle\frac{1}{d_i} \displaystyle\sum_{j:\ c_j\ adjacent\ to\ c_i} E_j

This is a system of linear equations that can be solved with any appropriate method, for example Gauss-Jordan elimination.

Comments:
The interesting thing with this problem is that it is asking for an algorithm to analyze the performance of a heuristic! We can think of this matrix as an undirected graph, where open cells of the matrix are vertices of the graph, and adjacent open cells correspond to adjacent vertices. Since there is one and only one way between any two open cells then it is actually a tree, and the algorithm is just analyzing the performance of a random search in a tree.

Written by Hatem Abdelghani

May 13, 2009 at 10:59 pm