In simple terms, how does a BitTorrent client initially discover peers using DHT?
I've already read this SuperUser answer and this Wikipedia article but both are too technical for me to really wrap my head around.
I understand the idea of a tracker: clients connect to a central server which maintains a list of peers in a swarm.
I also understand the idea of peer exchange: clients already in a swarm send the complete list of their peers to each other. If new peers are discovered, they are added to the list.
My question is, how does DHT work? That is, how can a new client join a swarm without either a tracker or the knowledge of at least one member of the swarm to exchange peers with?
(Note: simple explanations are best.)
Solution 1:
Summary
How can a new client join a swarm without either a tracker or the knowledge of at least one member of the swarm to exchange peers with?
You can't. It is impossible.*
* (Unless a node on your local area network happens to already be a node in the DHT. In this case, you could use a broadcasting mechanism, such as Avahi, to "discover" this peer, and bootstrap from them. But how did they bootstrap themselves? Eventually, you'll hit a situation where you need to connect to the public Internet. And the public Internet is unicast-only, not multicast, so you're stuck with using pre-determined lists of peers.)
References
Bittorrent DHT is implemented via a protocol known as Kademlia, which is a special case of theoretical concept of a Distributed hash table.
Exposition
With the Kademlia protocol, when you join the network, you go through a bootstrapping procedure, which absolutely requires that you know, in advance, the IP address and port of at least one node already participating in the DHT network. The tracker that you connect to, for instance, may be itself a DHT node. Once you are connected to one DHT node, you then proceed to download information from the DHT, which provides you connectivity information for more nodes, and you then navigate that "graph" structure to obtain connections to more and more nodes, who can provide both connectivity to other nodes, and payload data (chunks of the download).
I think your actual question in bold -- that of how to join a Kademlia DHT network without knowing any other members -- is based on a false assumption.
The simple answer to your question in bold is, you don't. If you do not know ANY information at all about even one host which might contain DHT metadata, you are stuck -- you can't even get started. I mean, sure, you could brute force attempt to discover an IP on the public internet with an open port that happens to broadcast DHT information. But more likely, your BT client is hard-coded to some specific static IP or DNS which resolves to a stable DHT node, which just provides the DHT metadata.
Basically, the DHT is only as decentralized as the joining mechanism, and because the joining mechanism is fairly brittle (there's no way to "broadcast" over the entire Internet! so you have to unicast to an individual pre-assigned host to get the DHT data), Kademlia DHT isn't really decentralized. Not in the strictest sense of the word.
Imagine this scenario: Someone who wants P2P to stop goes out and prepares an attack on all commonly used stable DHT nodes which are used for bootstrapping. Once they've staged their attack, they spring it on all nodes all at once. Wham; every single bootstrapping DHT node is down all in one fell swoop. Now what? You're stuck with connecting to centralized trackers to download traditional lists of peers from those. Well, if they attack the trackers too, then you're really, really up a creek. In other words, Kademlia and the entire BT network is constrained by the limitations of the Internet itself, in that, there is a finite (and relatively small) number of computers that you would have to successfully attack or take offline to prevent >90% of users from connecting to the network.
Once the "pseudo-centralized" bootstrapping nodes are all gone, the interior nodes of the DHT, which are not bootstrapping because nobody on the outside of the DHT knows about the interior nodes, are useless; they can't bring new nodes into the DHT. So, as each interior node disconnects from the DHT over time, either due to people shutting down their computers, rebooting for updates, etc., the network would collapse.
Of course, to get around this, someone could deploy a patched BitTorrent client with a new list of pre-determined stable DHT nodes or DNS addresses, and loudly advertise to the P2P community to use this new list instead. But this would become a "whack-a-mole" situation where the aggressor (the node-eater) would progressively download these lists themselves, and target the brave new bootstrapping nodes, then take them offline, too.
Solution 2:
Short answer: It gets it from the .torrent file.
When a BitTorrent client generates a trackerless .torrent file (that is, when someone is getting ready to share something new via BitTorrent), it adds a "nodes" key (key as in "key/value pair"; like a section header, not a crypto key) to the .torrent file that contains the K closest DHT nodes known to that client.
http://www.bittorrent.org/beps/bep%5F0005.html#torrent-file-extensions
A trackerless torrent dictionary does not have an "announce" key. Instead, a trackerless torrent has a "nodes" key. This key should be set to the K closest nodes in the torrent generating client's routing table. Alternatively, the key could be set to a known good node such as one operated by the person generating the torrent. Please do not automatically add "router.bittorrent.com" to torrent files or automatically add this node to clients routing tables.
So when you feed your BitTorrent client the .torrent file of a trackerless torrent that you want to download, it uses the value of that "nodes" key from the .torrent file to find its first few DHT nodes.
Solution 3:
you can't ! you have to know at least one IP of one of the swarm, this is the weakness of a p2p network. You can blindly broadcast to find the first IP, but in a large network, if everybody is doing that we'll have congestion problem. You can use a cache, but it is possible for large swarms only (larger peer address cache). You always have to connect a tracker to ask just the first IP.
Distributed in DHT means clients don't have to hold all the list containning the md5 sum of the shared files name, with corresponding peers. The list of hash is shapes into equals parts and distributed with redundancy througout the swarm. If a peer disconnect there is somewhere another one with the same part of the hashlist. The peers share each others the adress to the good holder of the hashlist part.
torrent-freak wrote a post on this subject
Solution 4:
How can a new client join a swarm without either a tracker or the knowledge of at least one member of the swarm to exchange peers with?
It asks for it.
Bittorrent clients that support the DHT run two seperate peer-to-peer applications.
The first one does the file-sharing: A swarm in bittorrent lingo is a group of peers sharing a bittorrent object (e.g. a file or directory structure). Each bittorent object has some metadata that is saved in a .torrent-file. (It includes object size, name of folder, possibly tracker information or nodes. ect.) The hash of the metadata required to download this bittorrent object is called the infohash.
The DHT basically is a second P2P application aiming to replace trackers: It stores pairs of (infohash, swarm) and updates the swarm if it receives announce messages. A new client must have knowledge of some "node" (bittorrent lingo for a peer of the DHT) to bootstrap its information of the DHT. Here the arguments given by @allquixotic apply. As the MDHT currently consists of over 7 million peers a sustained denial of service attack seems unlikely.
It can then query the DHT with respect to an infohash and doesn't have to use a tracker or know a peer that is part of the swarm before. If one of the peers he contacts supports sharing metadata it only needs the infohash can retrieve the .torrent-file from the swarm.