🐛 Bug
It is not possible to use a custom index_path with StreamingDataset when input_dir is a local path. This appears to be caused by a check in subsample_streaming_dataset that determines whether input_dir is a URL:
|
if not os.path.exists(cache_index_filepath) and isinstance(input_dir.url, str): |
To Reproduce
As a simple example, download a small dataset from Hugging Face as Parquet files and index it with index_parquet_dataset:
from huggingface_hub import snapshot_download
import litdata as ld
def main():
snapshot_download(repo_id="roneneldan/TinyStories", local_dir="tiny_stories_data", repo_type="dataset")
ld.index_parquet_dataset("tiny_stories_data/data", cache_dir="my-custom-cache")
if __name__ == "__main__":
main()
Then create a StreamingDataset with:
dataset = StreamingDataset("tiny_stories_data/data", shuffle=True, index_path="my-custom-cache/index.json", item_loader=ParquetLoader())
This raises:
ValueError: The provided dataset `tiny_stories_data/data` doesn't contain any index.json file.
HINT: Did you successfully optimize a dataset to the provided `input_dir`?
Expected behavior
The index_path docstring states:
Path to index.json for the Parquet dataset. If index_path is a directory, the function will look for index.json within it. If index_path is a full file path, it will use that directly.
Therefore, to my understanding, StreamingDataset should accept an index_path that points to an index.json located outside input_dir (for example, in a separate folder) and use that index regardless of whether input_dir is a local path or a URL.
Additional context
The issue can be worked around by prefixing the path with local:, but the documentation does not make it clear whether that prefix should be required. Thank you in advance!
🐛 Bug
It is not possible to use a custom
index_pathwithStreamingDatasetwheninput_diris a local path. This appears to be caused by a check insubsample_streaming_datasetthat determines whetherinput_diris a URL:litData/src/litdata/utilities/dataset_utilities.py
Line 83 in 36431bd
To Reproduce
As a simple example, download a small dataset from Hugging Face as Parquet files and index it with
index_parquet_dataset:Then create a
StreamingDatasetwith:This raises:
Expected behavior
The
index_pathdocstring states:Therefore, to my understanding,
StreamingDatasetshould accept anindex_paththat points to anindex.jsonlocated outsideinput_dir(for example, in a separate folder) and use that index regardless of whetherinput_diris a local path or a URL.Additional context
The issue can be worked around by prefixing the path with
local:, but the documentation does not make it clear whether that prefix should be required. Thank you in advance!