easycv.datasets.loader package¶
- class easycv.datasets.loader.GroupSampler(dataset, samples_per_gpu=1)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]
- class easycv.datasets.loader.DistributedGroupSampler(dataset, samples_per_gpu=1, seed=0, num_replicas=None, rank=None)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]Sampler that restricts data loading to a subset of the dataset. It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it. .. note:Dataset is assumed to be of constant size.
- Parameters
dataset – Dataset used for sampling.
seed (int, Optional) – The seed. Default to 0.
num_replicas (optional) – Number of processes participating in distributed training.
rank (optional) – Rank of the current process within num_replicas.
- easycv.datasets.loader.build_dataloader(dataset, imgs_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, replace=False, seed=None, reuse_worker_cache=False, odps_config=None, persistent_workers=False, collate_hooks=None, use_repeated_augment_sampler=False, sampler=None, pin_memory=False, **kwargs)[source]¶
Build PyTorch DataLoader. In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs. :param dataset: A PyTorch dataset. :type dataset: Dataset :param imgs_per_gpu: Number of images on each GPU, i.e., batch size of
each GPU.
- Parameters
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
replace (bool) – Replace or not in random shuffle. It works on when shuffle is True.
seed (int, Optional) – The seed. Default to None.
reuse_worker_cache (bool) – If set true, will reuse worker process so that cached data in worker process can be reused.
persistent_workers (bool) – After pytorch1.7, could use persistent_workers=True to avoid reconstruct dataworker before each epoch, speed up before epoch
use_repeated_augment_sampler (bool) – If set true, it will use RASampler. Default: False.
kwargs – any keyword argument to be used to initialize DataLoader
- Returns
A PyTorch dataloader.
- Return type
DataLoader
- class easycv.datasets.loader.DistributedGivenIterationSampler(dataset, total_iter, batch_size, num_replicas=None, rank=None, last_iter=- 1)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]
- class easycv.datasets.loader.DistributedMPSampler(dataset, num_replicas=None, rank=None, shuffle=True, split_huge_listfile_byrank=False, **kwargs)[source]¶
Bases:
torch.utils.data.sampler.Sampler
[torch.utils.data.distributed.T_co
]- __init__(dataset, num_replicas=None, rank=None, shuffle=True, split_huge_listfile_byrank=False, **kwargs)[source]¶
A Distribute sampler which support sample m instance from one class once for classification dataset dataset: pytorch dataset object num_replicas (optional): Number of processes participating in
distributed training.
rank (optional): Rank of the current process within num_replicas. shuffle (optional): If true (default), sampler will shuffle the indices split_huge_listfile_byrank: if split, return all indice for each rank, because list for each rank has been
split before build dataset in dist training
- class easycv.datasets.loader.RASampler(dataset, num_replicas=None, rank=None, shuffle=True, num_repeats: int = 3, **kwargs)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]Sampler that restricts data loading to a subset of the dataset for distributed, with repeated augmentation. It ensures that different each augmented version of a sample will be visible to a different process (GPU) Heavily based on torch.utils.data.DistributedSampler
- class easycv.datasets.loader.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, replace=False, split_huge_listfile_byrank=False)[source]¶
Bases:
torch.utils.data.sampler.Sampler
[torch.utils.data.distributed.T_co
]- __init__(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, replace=False, split_huge_listfile_byrank=False)[source]¶
A Distribute sampler which support sample m instance from one class once for classification dataset :param dataset: pytorch dataset object :param num_replicas: Number of processes participating in
distributed training.
- Parameters
rank (optional) – Rank of the current process within num_replicas.
shuffle (optional) – If true (default), sampler will shuffle the indices
seed (int, Optional) – The seed. Default to 0.
split_huge_listfile_byrank – if split, return all indice for each rank, because list for each rank has been split before build dataset in dist training
Submodules¶
easycv.datasets.loader.build_loader module¶
- easycv.datasets.loader.build_loader.build_dataloader(dataset, imgs_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, replace=False, seed=None, reuse_worker_cache=False, odps_config=None, persistent_workers=False, collate_hooks=None, use_repeated_augment_sampler=False, sampler=None, pin_memory=False, **kwargs)[source]¶
Build PyTorch DataLoader. In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs. :param dataset: A PyTorch dataset. :type dataset: Dataset :param imgs_per_gpu: Number of images on each GPU, i.e., batch size of
each GPU.
- Parameters
workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.
num_gpus (int) – Number of GPUs. Only used in non-distributed training.
dist (bool) – Distributed training/test or not. Default: True.
shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.
replace (bool) – Replace or not in random shuffle. It works on when shuffle is True.
seed (int, Optional) – The seed. Default to None.
reuse_worker_cache (bool) – If set true, will reuse worker process so that cached data in worker process can be reused.
persistent_workers (bool) – After pytorch1.7, could use persistent_workers=True to avoid reconstruct dataworker before each epoch, speed up before epoch
use_repeated_augment_sampler (bool) – If set true, it will use RASampler. Default: False.
kwargs – any keyword argument to be used to initialize DataLoader
- Returns
A PyTorch dataloader.
- Return type
DataLoader
- easycv.datasets.loader.build_loader.worker_init_fn(worker_id, num_workers, rank, seed, odps_config=None)[source]¶
- class easycv.datasets.loader.build_loader.InfiniteDataLoader(*args, **kwargs)[source]¶
Bases:
Generic
[torch.utils.data.dataloader.T_co
]Dataloader that reuses workers. https://github.com/pytorch/pytorch/issues/15849 Uses same syntax as vanilla DataLoader.
- dataset: torch.utils.data.dataset.Dataset[torch.utils.data.dataloader.T_co]¶
- batch_size: Optional[int]¶
- num_workers: int¶
- pin_memory: bool¶
- drop_last: bool¶
- timeout: float¶
- sampler: Union[torch.utils.data.sampler.Sampler, Iterable]¶
- pin_memory_device: str¶
- prefetch_factor: int¶
easycv.datasets.loader.sampler module¶
- class easycv.datasets.loader.sampler.DistributedMPSampler(dataset, num_replicas=None, rank=None, shuffle=True, split_huge_listfile_byrank=False, **kwargs)[source]¶
Bases:
torch.utils.data.sampler.Sampler
[torch.utils.data.distributed.T_co
]- __init__(dataset, num_replicas=None, rank=None, shuffle=True, split_huge_listfile_byrank=False, **kwargs)[source]¶
A Distribute sampler which support sample m instance from one class once for classification dataset dataset: pytorch dataset object num_replicas (optional): Number of processes participating in
distributed training.
rank (optional): Rank of the current process within num_replicas. shuffle (optional): If true (default), sampler will shuffle the indices split_huge_listfile_byrank: if split, return all indice for each rank, because list for each rank has been
split before build dataset in dist training
- class easycv.datasets.loader.sampler.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, replace=False, split_huge_listfile_byrank=False)[source]¶
Bases:
torch.utils.data.sampler.Sampler
[torch.utils.data.distributed.T_co
]- __init__(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, replace=False, split_huge_listfile_byrank=False)[source]¶
A Distribute sampler which support sample m instance from one class once for classification dataset :param dataset: pytorch dataset object :param num_replicas: Number of processes participating in
distributed training.
- Parameters
rank (optional) – Rank of the current process within num_replicas.
shuffle (optional) – If true (default), sampler will shuffle the indices
seed (int, Optional) – The seed. Default to 0.
split_huge_listfile_byrank – if split, return all indice for each rank, because list for each rank has been split before build dataset in dist training
- class easycv.datasets.loader.sampler.GroupSampler(dataset, samples_per_gpu=1)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]
- class easycv.datasets.loader.sampler.DistributedGroupSampler(dataset, samples_per_gpu=1, seed=0, num_replicas=None, rank=None)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]Sampler that restricts data loading to a subset of the dataset. It is especially useful in conjunction with
torch.nn.parallel.DistributedDataParallel
. In such case, each process can pass a DistributedSampler instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it. .. note:Dataset is assumed to be of constant size.
- Parameters
dataset – Dataset used for sampling.
seed (int, Optional) – The seed. Default to 0.
num_replicas (optional) – Number of processes participating in distributed training.
rank (optional) – Rank of the current process within num_replicas.
- class easycv.datasets.loader.sampler.DistributedGivenIterationSampler(dataset, total_iter, batch_size, num_replicas=None, rank=None, last_iter=- 1)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]
- class easycv.datasets.loader.sampler.RASampler(dataset, num_replicas=None, rank=None, shuffle=True, num_repeats: int = 3, **kwargs)[source]¶
Bases:
Generic
[torch.utils.data.sampler.T_co
]Sampler that restricts data loading to a subset of the dataset for distributed, with repeated augmentation. It ensures that different each augmented version of a sample will be visible to a different process (GPU) Heavily based on torch.utils.data.DistributedSampler