Optimal runtime for simple compression scheme
Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.
Consider a very simple compression scheme, whose output consists of a stream of two operations:
put(a)
: outputs a charactera
dup()
: duplicates all output written so far
For notational convenience, write the character x
itself for put('x')
and *
for dup()
.
For example, "a**b*c"
expands to "aaaabaaaabc"
.
To compress a given string s
, find the shortest list of these two operations to generate it.
For example for "aaaaaaaaaab"
shortens to a**a*b
. (a***aab
is also possible, but one character longer.)
My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)
I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)
algorithm big-o lossless-compression
add a comment |
Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.
Consider a very simple compression scheme, whose output consists of a stream of two operations:
put(a)
: outputs a charactera
dup()
: duplicates all output written so far
For notational convenience, write the character x
itself for put('x')
and *
for dup()
.
For example, "a**b*c"
expands to "aaaabaaaabc"
.
To compress a given string s
, find the shortest list of these two operations to generate it.
For example for "aaaaaaaaaab"
shortens to a**a*b
. (a***aab
is also possible, but one character longer.)
My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)
I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)
algorithm big-o lossless-compression
1
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02
add a comment |
Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.
Consider a very simple compression scheme, whose output consists of a stream of two operations:
put(a)
: outputs a charactera
dup()
: duplicates all output written so far
For notational convenience, write the character x
itself for put('x')
and *
for dup()
.
For example, "a**b*c"
expands to "aaaabaaaabc"
.
To compress a given string s
, find the shortest list of these two operations to generate it.
For example for "aaaaaaaaaab"
shortens to a**a*b
. (a***aab
is also possible, but one character longer.)
My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)
I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)
algorithm big-o lossless-compression
Here is a simple puzzle with some applications in bio-informatics. It's an abstracted version of something that came up at a friend's job.
Consider a very simple compression scheme, whose output consists of a stream of two operations:
put(a)
: outputs a charactera
dup()
: duplicates all output written so far
For notational convenience, write the character x
itself for put('x')
and *
for dup()
.
For example, "a**b*c"
expands to "aaaabaaaabc"
.
To compress a given string s
, find the shortest list of these two operations to generate it.
For example for "aaaaaaaaaab"
shortens to a**a*b
. (a***aab
is also possible, but one character longer.)
My question is: what's the best achievable runtime for optimal compression? (And what's the algorithm that achieves that runtime.)
I believe linear runtime is possible, but I haven't found anything definitely better than quadratic, yet. (Not too worried about using extra space.)
algorithm big-o lossless-compression
algorithm big-o lossless-compression
edited Nov 14 '18 at 17:02
Matthias
asked Nov 14 '18 at 15:00
MatthiasMatthias
589
589
1
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02
add a comment |
1
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02
1
1
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02
add a comment |
1 Answer
1
active
oldest
votes
Yes, a linear run time is possible for this compression scheme.
Create a list dp
. The ith element of this list will the best compression possible for the first i
elements of the string.
dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)
To check if first i/2
elements are equal to second i/2
elements, you can find the longest common prefix between the string and the suffix starting at index i/2
. If this prefix is greater than or equal to i/2
in length, then the first i/2
elements are indeed equal to the second i/2
elements.
Speeding up this operation is possible using a modified LCP array.
First, build a suffix array for the string in O(n)
.
Then, build a longest common prefix array for the suffix array in O(n)
.
Now, find the index of the complete string in the suffix array. Let's say it's i
. Now, iterate from i
to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1
to 0
in the LCP array replacing each value with the minimum seen so far.
Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53303119%2foptimal-runtime-for-simple-compression-scheme%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Yes, a linear run time is possible for this compression scheme.
Create a list dp
. The ith element of this list will the best compression possible for the first i
elements of the string.
dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)
To check if first i/2
elements are equal to second i/2
elements, you can find the longest common prefix between the string and the suffix starting at index i/2
. If this prefix is greater than or equal to i/2
in length, then the first i/2
elements are indeed equal to the second i/2
elements.
Speeding up this operation is possible using a modified LCP array.
First, build a suffix array for the string in O(n)
.
Then, build a longest common prefix array for the suffix array in O(n)
.
Now, find the index of the complete string in the suffix array. Let's say it's i
. Now, iterate from i
to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1
to 0
in the LCP array replacing each value with the minimum seen so far.
Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
add a comment |
Yes, a linear run time is possible for this compression scheme.
Create a list dp
. The ith element of this list will the best compression possible for the first i
elements of the string.
dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)
To check if first i/2
elements are equal to second i/2
elements, you can find the longest common prefix between the string and the suffix starting at index i/2
. If this prefix is greater than or equal to i/2
in length, then the first i/2
elements are indeed equal to the second i/2
elements.
Speeding up this operation is possible using a modified LCP array.
First, build a suffix array for the string in O(n)
.
Then, build a longest common prefix array for the suffix array in O(n)
.
Now, find the index of the complete string in the suffix array. Let's say it's i
. Now, iterate from i
to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1
to 0
in the LCP array replacing each value with the minimum seen so far.
Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
add a comment |
Yes, a linear run time is possible for this compression scheme.
Create a list dp
. The ith element of this list will the best compression possible for the first i
elements of the string.
dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)
To check if first i/2
elements are equal to second i/2
elements, you can find the longest common prefix between the string and the suffix starting at index i/2
. If this prefix is greater than or equal to i/2
in length, then the first i/2
elements are indeed equal to the second i/2
elements.
Speeding up this operation is possible using a modified LCP array.
First, build a suffix array for the string in O(n)
.
Then, build a longest common prefix array for the suffix array in O(n)
.
Now, find the index of the complete string in the suffix array. Let's say it's i
. Now, iterate from i
to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1
to 0
in the LCP array replacing each value with the minimum seen so far.
Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.
Yes, a linear run time is possible for this compression scheme.
Create a list dp
. The ith element of this list will the best compression possible for the first i
elements of the string.
dp[1] = 1
dp[i] = dp[i-1] + 1
if i is even and first i/2 elements of string are equal to second i/2 elements:
dp[i] = min(dp[i], dp[i/2] + 1)
To check if first i/2
elements are equal to second i/2
elements, you can find the longest common prefix between the string and the suffix starting at index i/2
. If this prefix is greater than or equal to i/2
in length, then the first i/2
elements are indeed equal to the second i/2
elements.
Speeding up this operation is possible using a modified LCP array.
First, build a suffix array for the string in O(n)
.
Then, build a longest common prefix array for the suffix array in O(n)
.
Now, find the index of the complete string in the suffix array. Let's say it's i
. Now, iterate from i
to end of LCP array replacing each value with the minimum seen so far. Similarly, iterate down from i-1
to 0
in the LCP array replacing each value with the minimum seen so far.
Once this is done, each value in the LCP array represents the longest common prefix of that suffix with the complete string, which is what the algorithm required.
Note that these values are ordered according to the sorted suffixes rather than the position of the suffixes in the string. Mapping it is fairly simple using the suffix array though.
answered Nov 15 '18 at 6:20
merlynmerlyn
1,75011223
1,75011223
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
add a comment |
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
Thanks. I'll go through this in detail tomorrow. From my literature research, I was already expecting that LCP and suffix array might be useful.
– Matthias
Nov 15 '18 at 14:49
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
I also made a sketch for a simple algorithm with two rolling hashes: Basically, roll one from [0:i] the other from [i:2*i]. A Rabin fingerprint might work.
– Matthias
Nov 15 '18 at 14:51
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
@Matthias Yeah, fingerprinting sounds like a much simpler idea that might be good enough in practical scenarios.
– merlyn
Nov 15 '18 at 14:55
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
Oh, I think in practice most of the time the naive approach would work fast (much faster than it's worst case quadratic behaviour). Btw, I think you can drop the dynamic programming in your solution and use greedy, if you walk backwards through your string. For fingerprinting to work in theory, I have to get a good handle on error probabilities. (So far I've only managed to achieve O(n log n) runtime via fingerprinting.) PS Please have a look at cstheory.stackexchange.com/q/41251/26936 too.
– Matthias
Nov 16 '18 at 3:12
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
BTW, the dynamic programming bit at the beginning is overkill. Greedy from the back works just fine. I have found a new an simpler linear scheme, too. Will post soon.
– Matthias
Dec 18 '18 at 16:03
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53303119%2foptimal-runtime-for-simple-compression-scheme%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
How? Please add an answer, if you have an idea.
– Matthias
Nov 14 '18 at 17:02