Passing an iterator to dask.delayed function
up vote
0
down vote
favorite
I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed
function. When I try to compute()
, I get the following message from dask, and the traceback below.
distributed.protocol.pickle - INFO - Failed to serialize
([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling
The corresponding part of the source looks like this:
delayed(to_arrow)(vf.fetch(..), ordered_dict)
vf
is the file-like object, and vf.fetch(..)
returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam
library to read it). I hope this provides sufficient context.
The message from dask
shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..))
, which seems to work. Now I'm stumped, what am I missing?
Providing a minimal working example for this is a bit difficult. But maybe the following helps.
- Download a VCF file (and it's index) from here: say,
ALL.chrY*vcf.gz,.tbi
pip3 install --user pysam
- Open the file:
vf = VariantFile('/path/to/file.vcf.gz', mode='r')
- Something like this as the iterator:
vf.fetch("Y", 2_600_000, 2_700_000)
- For the delayed function, you could have an empty loop.
python dask dask-delayed
add a comment |
up vote
0
down vote
favorite
I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed
function. When I try to compute()
, I get the following message from dask, and the traceback below.
distributed.protocol.pickle - INFO - Failed to serialize
([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling
The corresponding part of the source looks like this:
delayed(to_arrow)(vf.fetch(..), ordered_dict)
vf
is the file-like object, and vf.fetch(..)
returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam
library to read it). I hope this provides sufficient context.
The message from dask
shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..))
, which seems to work. Now I'm stumped, what am I missing?
Providing a minimal working example for this is a bit difficult. But maybe the following helps.
- Download a VCF file (and it's index) from here: say,
ALL.chrY*vcf.gz,.tbi
pip3 install --user pysam
- Open the file:
vf = VariantFile('/path/to/file.vcf.gz', mode='r')
- Something like this as the iterator:
vf.fetch("Y", 2_600_000, 2_700_000)
- For the delayed function, you could have an empty loop.
python dask dask-delayed
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed
function. When I try to compute()
, I get the following message from dask, and the traceback below.
distributed.protocol.pickle - INFO - Failed to serialize
([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling
The corresponding part of the source looks like this:
delayed(to_arrow)(vf.fetch(..), ordered_dict)
vf
is the file-like object, and vf.fetch(..)
returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam
library to read it). I hope this provides sufficient context.
The message from dask
shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..))
, which seems to work. Now I'm stumped, what am I missing?
Providing a minimal working example for this is a bit difficult. But maybe the following helps.
- Download a VCF file (and it's index) from here: say,
ALL.chrY*vcf.gz,.tbi
pip3 install --user pysam
- Open the file:
vf = VariantFile('/path/to/file.vcf.gz', mode='r')
- Something like this as the iterator:
vf.fetch("Y", 2_600_000, 2_700_000)
- For the delayed function, you could have an empty loop.
python dask dask-delayed
I'm trying to pass an iterator over a (non-standard) file-like object to a dask.delayed
function. When I try to compute()
, I get the following message from dask, and the traceback below.
distributed.protocol.pickle - INFO - Failed to serialize
([<items>, ... ], OrderedDict(..)).
Exception: self.ptr cannot be converted to a Python object for pickling
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.6/site-packages/distributed/protocol/pickle.py", line 38, in dumps
result = pickle.dumps(x, protocol=pickle.HIGHEST_PROTOCOL)
File "stringsource", line 2, in pysam.libcbcf.VariantRecord.__reduce_cython__
TypeError: self.ptr cannot be converted to a Python object for pickling
The corresponding part of the source looks like this:
delayed(to_arrow)(vf.fetch(..), ordered_dict)
vf
is the file-like object, and vf.fetch(..)
returns the iterator over the records present in the file (this is a VCF file, and I'm using the pysam
library to read it). I hope this provides sufficient context.
The message from dask
shows the iteration happens during the function call instead of inside the function, which led me to believe maybe passing iterators are not okay. So I did a quick check with sum(range(..))
, which seems to work. Now I'm stumped, what am I missing?
Providing a minimal working example for this is a bit difficult. But maybe the following helps.
- Download a VCF file (and it's index) from here: say,
ALL.chrY*vcf.gz,.tbi
pip3 install --user pysam
- Open the file:
vf = VariantFile('/path/to/file.vcf.gz', mode='r')
- Something like this as the iterator:
vf.fetch("Y", 2_600_000, 2_700_000)
- For the delayed function, you could have an empty loop.
python dask dask-delayed
python dask dask-delayed
asked Nov 10 at 7:04
suvayu
1,8681422
1,8681422
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.
If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile
, which is a serializable thing that defers opening until it is used in a with
block. That's one convenient way to do it, but you can probably do something simpler.
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look atOpenFile
, thanks for the pointer.
– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.
If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile
, which is a serializable thing that defers opening until it is used in a with
block. That's one convenient way to do it, but you can probably do something simpler.
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look atOpenFile
, thanks for the pointer.
– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
add a comment |
up vote
1
down vote
The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.
If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile
, which is a serializable thing that defers opening until it is used in a with
block. That's one convenient way to do it, but you can probably do something simpler.
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look atOpenFile
, thanks for the pointer.
– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
add a comment |
up vote
1
down vote
up vote
1
down vote
The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.
If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile
, which is a serializable thing that defers opening until it is used in a with
block. That's one convenient way to do it, but you can probably do something simpler.
The short answer is: restructure your delayed function such that the file opening stage happens inside the function, and you instead pass the arguments (e.g., path) required to point to that particular file.
If you are interested, you can look into how Dask does this internally, the class dask.bytes.core.OpenFile
, which is a serializable thing that defers opening until it is used in a with
block. That's one convenient way to do it, but you can probably do something simpler.
answered Nov 18 at 15:43
mdurant
9,74611435
9,74611435
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look atOpenFile
, thanks for the pointer.
– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
add a comment |
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look atOpenFile
, thanks for the pointer.
– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look at
OpenFile
, thanks for the pointer.– suvayu
Nov 18 at 19:05
I did restructure my code to open the file inside the function as a workaround. But that has a cost, it's a rather large gziped file, which now has to be decompressed multiple times. Seeking to the right record is probably okay, since there's an external index. I'll have a look at
OpenFile
, thanks for the pointer.– suvayu
Nov 18 at 19:05
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
Yes, gzipped file will not allow parallel/random access, and you might want to try other formats or chunking into several files.
– mdurant
Nov 18 at 22:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53236764%2fpassing-an-iterator-to-dask-delayed-function%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown