Add suggestions for distributing large datasets to book
type: documentation pr: https://github.com/casey/intermodal/pull/360
This commit is contained in:
parent
ff6f6d4c3d
commit
1c9ff0cde4
|
@ -4,7 +4,8 @@ Changelog
|
|||
|
||||
UNRELEASED - 2020-04-11
|
||||
-----------------------
|
||||
- :white_check_mark: [`xxxxxxxxxxxx`](https://github.com/casey/intermodal/commits/master) Test that `--glob`s match entire file paths ([#357](https://github.com/casey/intermodal/pull/357)) - _Casey Rodarmor <casey@rodarmor.com>_
|
||||
- :books: [`xxxxxxxxxxxx`](https://github.com/casey/intermodal/commits/master) Add suggestions for distributing large datasets to book ([#360](https://github.com/casey/intermodal/pull/360)) - _Casey Rodarmor <casey@rodarmor.com>_
|
||||
- :white_check_mark: [`ff6f6d4c3de1`](https://github.com/casey/intermodal/commit/ff6f6d4c3de1a14c6b2ebef270c0ec542300f0de) Test that `--glob`s match entire file paths ([#357](https://github.com/casey/intermodal/pull/357)) - _Casey Rodarmor <casey@rodarmor.com>_
|
||||
- :books: [`b914c175949f`](https://github.com/casey/intermodal/commit/b914c175949fa6063b6fb0428f4ebd66a51fdda3) Add buildtorretn to prior art section of book ([#355](https://github.com/casey/intermodal/pull/355)) - _Casey Rodarmor <casey@rodarmor.com>_
|
||||
|
||||
|
||||
|
|
|
@ -15,6 +15,7 @@ Summary
|
|||
- [`imdl torrent verify`](./commands/imdl-torrent-verify.md)
|
||||
|
||||
- [Bittorrent](./bittorrent.md)
|
||||
- [Distributing Large Datasets](./bittorrent/distributing-large-datasets.md)
|
||||
- [BEP Support](./bittorrent/bep-support.md)
|
||||
- [Alternatives & Prior Art](./bittorrent/prior-art.md)
|
||||
- [UDP Tracker Protocol](./bittorrent/udp-tracker-protocol.md)
|
||||
|
|
151
book/src/bittorrent/distributing-large-datasets.md
Normal file
151
book/src/bittorrent/distributing-large-datasets.md
Normal file
|
@ -0,0 +1,151 @@
|
|||
Distributing Large Data Sets
|
||||
============================
|
||||
|
||||
Even though BitTorrent is well-suited for distributing large amounts of data,
|
||||
very large torrents can still cause problems. Here are some of the problems you
|
||||
might encounter, as well as suggestions for how to avoid or ameliorate those
|
||||
issues.
|
||||
|
||||
Intermodal currently uses a single-threaded piece hashing algorithm. If you're
|
||||
distributing a large data set and hashing time is a problem, please open an
|
||||
issue! I'm eager to improve hashing performance, but want to make sure I do it
|
||||
in such a way that real workloads benefit.
|
||||
|
||||
|
||||
Background
|
||||
----------
|
||||
|
||||
In order to support incremental download and verification, as well as
|
||||
resumption of partial downloads, the contents of a torrent are broken into
|
||||
pieces.
|
||||
|
||||
The length of pieces varies is configurable, and the ideal choice of piece
|
||||
length depends on many factors, but values between 16KiB and 256KiB are common.
|
||||
Very large torrents may use much larger piece lengths, like 16MiB.
|
||||
|
||||
Each piece is hashed, and `.torrent` files, also referred to as metainfo,
|
||||
contain a list of those hashes.
|
||||
|
||||
For all the example commands, I'll be using `dir` for the directory containing
|
||||
the data set you want to share.
|
||||
|
||||
|
||||
Issues
|
||||
------
|
||||
|
||||
### `.torrent` file too large
|
||||
|
||||
When the amount of data is large, or the piece length is small, the number of
|
||||
pieces can make the `.torrent` file very big.
|
||||
|
||||
To avoid this, you can either break the data into multiple torrents, or make
|
||||
the piece length larger, so the `.torrent` file contains fewer pieces.
|
||||
|
||||
#### Breaking data into multiple torrents
|
||||
|
||||
`imdl torrent create` has a `--glob` option that can be used to control which
|
||||
files are included in a torrent. If your data set is divided into multiple
|
||||
files, ideally with a consistent naming scheme, this can be used to easily
|
||||
create multiple torrents with different subsets of the data.
|
||||
|
||||
The name of the created torrent is usually derived from the name of the input,
|
||||
so the output torrent name should be given manually to avoid conflicts:
|
||||
|
||||
$ imdl torrent create -i dir -o a.torrent --glob 'dir/0*'
|
||||
$ imdl torrent create -i dir -o b.torrent --glob 'dir/1*'
|
||||
$ imdl torrent create -i dir -o c.torrent --glob 'dir/2*'
|
||||
# etc…
|
||||
|
||||
#### Making the piece length larger
|
||||
|
||||
`imdl` has an automatic piece length picker, which should choose a good piece
|
||||
length. You can see what choices it makes for different torrent sizes with:
|
||||
|
||||
$ imdl torrrent piece-length
|
||||
|
||||
Some torrent clients don't do well with piece lengths over 16 MiB, so the piece
|
||||
length picker will never pick piece lengths over 16 MiB. This can be
|
||||
overridden by specifying `--piece-length` manually. `--piece-length` takes
|
||||
SI units, like `KiB`, `MiB`, and `KiB`:
|
||||
|
||||
$ imdl torrent create -i dir --piece-length 128mib
|
||||
|
||||
|
||||
### Too many files
|
||||
|
||||
Torrents containing a large number of separate files can cause performance
|
||||
issues. It's not clear if these performance issues are due to BitTorrent client
|
||||
implementations, host OS file system issues, or both.
|
||||
|
||||
#### Distributing your data set as an ISO image
|
||||
|
||||
By distributing your data set as an ISO image, all the files in your torrent
|
||||
will be packed into a single `.iso` file. Additionally, recipients of the ISO
|
||||
won't have to decompress the whole data set to browse or extract individual
|
||||
files.
|
||||
|
||||
You can create an ISO with `genisoimage`, which can be installed on Debian or
|
||||
Ubuntu with:
|
||||
|
||||
$ sudo apt install genisoimage
|
||||
|
||||
To create a compressed ISO containing your data set:
|
||||
|
||||
$ genisoimage \
|
||||
-transparent-compression \ # compress data in the ISO
|
||||
-untranslated-filenames \ # don't mangle filenames
|
||||
-verbose \ # verbose output
|
||||
-output data.iso \ # output path
|
||||
-V DATA_SET_NAME \ # volume name
|
||||
dir \ # input path
|
||||
|
||||
The same command, but with short flags:
|
||||
|
||||
$ genisoimage -zUvo data.iso -V DATA_SET_NAME dir
|
||||
|
||||
A torrent can then be created containing the ISO:
|
||||
|
||||
$ imdl torrent create --input data.iso
|
||||
|
||||
Users can mount and unmount the ISO on Linux:
|
||||
|
||||
$ sudo mkdir -p /mnt # create mount point
|
||||
$ sudo mount --read-only data.iso /mnt # mount ISO
|
||||
$ sudo umount /mnt # unmount when finished
|
||||
|
||||
Or MacOS:
|
||||
|
||||
$ hdiutil mount data.iso # mount ISO
|
||||
# hdiutil unmount /Volumes/DATA_SET_NAME # unmount when finished
|
||||
|
||||
On Windows, MacOS, and some Linux desktop environments, ISOs can also be
|
||||
mounted by double-clicking the file.
|
||||
|
||||
|
||||
### Torrent Client Issues
|
||||
|
||||
Some torrent clients don't do well with torrents with large piece sizes, many
|
||||
files, or a large amount of data.
|
||||
|
||||
#### Switch to a `libtorrent`-based client
|
||||
|
||||
If you're experiencing issues downloading a large data set, switching torrent
|
||||
clients may help.
|
||||
|
||||
In my personal experience, torrent clients that use Arvid Norberg's
|
||||
`libtorrent` have done well with large amounts of data.
|
||||
|
||||
`libtorrent`'s [Wikipedia page](https://en.wikipedia.org/wiki/Libtorrent) has a
|
||||
[list](https://en.wikipedia.org/wiki/Libtorrent#Applications) of torrent
|
||||
clients that use `libtorrent`.
|
||||
|
||||
|
||||
Conclusion
|
||||
----------
|
||||
|
||||
If you have suggestions for this guide, please don't hesitate to open an
|
||||
[issue](https://github.com/casey/intermodal/issues).
|
||||
|
||||
In particular, if you've found particular torrent clients to be good or bad at
|
||||
downloading large data sets, or have run into issues or found solutions not
|
||||
covered by this guide, I would love to know!
|
Loading…
Reference in New Issue
Block a user