Optimal runtime for simple compression scheme

Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.

Consider a very simple compression scheme, whose output consists of a stream of two operations:

put(a): outputs a character a

dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().

For example, "a**b*c" expands to "aaaabaaaabc".

To compress a given string s, find the shortest list of these two operations to generate it.

For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)

My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)

I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

1

How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02

add a comment |

Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.

Consider a very simple compression scheme, whose output consists of a stream of two operations:

put(a): outputs a character a

dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().

For example, "a**b*c" expands to "aaaabaaaabc".

To compress a given string s, find the shortest list of these two operations to generate it.

For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)

My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)

I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

1

How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02

add a comment |

Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.

Consider a very simple compression scheme, whose output consists of a stream of two operations:

put(a): outputs a character a

dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().

For example, "a**b*c" expands to "aaaabaaaabc".

To compress a given string s, find the shortest list of these two operations to generate it.

For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)

My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)

I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.

Consider a very simple compression scheme, whose output consists of a stream of two operations:

put(a): outputs a character a

dup(): duplicates all output written so far

For notational convenience, write the character x itself for put('x') and * for dup().

For example, "a**b*c" expands to "aaaabaaaabc".

To compress a given string s, find the shortest list of these two operations to generate it.

For example for "aaaaaaaaaab" shortens to a**a*b. (a***aab is also possible, but one character longer.)

My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)

I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)

algorithm big-o lossless-compression

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

edited Nov 14 '18 at 17:02

asked Nov 14 '18 at 15:00

Matthias

589

asked Nov 14 '18 at 15:00

Matthias

589

asked Nov 14 '18 at 15:00

Matthias

589

1

How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02

add a comment |

1

How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02

How? Please add an answer, if you have an idea.

– Matthias
Nov 14 '18 at 17:02

add a comment |

1 Answer
1

active

oldest

votes

Yes, a linear run time is possible for this compression scheme.

Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
 dp[i] = min(dp[i], dp[i/2] + 1)

To check if first i/2 elements are equal to second i/2 elements, you can find the longest common prefix between the string and the suffix starting at index i/2. If this prefix is greater than or equal to i/2 in length, then the first i/2 elements are indeed equal to the second i/2 elements.

Speeding up this operation is possible using a modified LCP array.

First, build a suffix array for the string in O(n).

Then, build a longest common prefix array for the suffix array in O(n).

Now, find the index of the complete string in the suffix array. Let's say it's i. Now, iterate from i to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1 to 0 in the LCP array replacing each value with the minimum seen so far.

Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.

answered Nov 15 '18 at 6:20

merlyn

1,75011223

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53303119%2foptimal-runtime-for-simple-compression-scheme%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Yes, a linear run time is possible for this compression scheme.

Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
 dp[i] = min(dp[i], dp[i/2] + 1)

Speeding up this operation is possible using a modified LCP array.

First, build a suffix array for the string in O(n).

Then, build a longest common prefix array for the suffix array in O(n).

answered Nov 15 '18 at 6:20

merlyn

1,75011223

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

add a comment |

Yes, a linear run time is possible for this compression scheme.

Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
 dp[i] = min(dp[i], dp[i/2] + 1)

Speeding up this operation is possible using a modified LCP array.

First, build a suffix array for the string in O(n).

Then, build a longest common prefix array for the suffix array in O(n).

answered Nov 15 '18 at 6:20

merlyn

1,75011223

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

add a comment |

Yes, a linear run time is possible for this compression scheme.

Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
 dp[i] = min(dp[i], dp[i/2] + 1)

Speeding up this operation is possible using a modified LCP array.

First, build a suffix array for the string in O(n).

Then, build a longest common prefix array for the suffix array in O(n).

answered Nov 15 '18 at 6:20

merlyn

1,75011223

Yes, a linear run time is possible for this compression scheme.

Create a list dp. The ith element of this list will the best compression possible for the first i elements of the string.

dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
 dp[i] = min(dp[i], dp[i/2] + 1)

Speeding up this operation is possible using a modified LCP array.

First, build a suffix array for the string in O(n).

Then, build a longest common prefix array for the suffix array in O(n).

answered Nov 15 '18 at 6:20

merlyn

1,75011223

answered Nov 15 '18 at 6:20

merlyn

1,75011223

answered Nov 15 '18 at 6:20

merlyn

1,75011223

answered Nov 15 '18 at 6:20

merlyn

1,75011223

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

add a comment |

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.

– Matthias
Nov 15 '18 at 14:49

I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.

– Matthias
Nov 15 '18 at 14:51

@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.

– merlyn
Nov 15 '18 at 14:55

Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.

– Matthias
Nov 16 '18 at 3:12

BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.

– Matthias
Dec 18 '18 at 16:03

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Pfthb