Trawling Tor Hidden Service – Mapping the DHT

Update 2013-08-15: I have been really enthused by reactions I received to this blog post. It has been referenced from Forbes, Gawker and the Daily Mail and a number of people have been in contact about tracking the DHT for themselves. I would recommend the IEEE S&P paper, “Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization” which presents the same issues allowing the DHT to be trawled. They also present some very serious attacks allowing an adversary to locate hidden services with practical resources. It is well worth checking out if you have an interest in Tor.

Tor hidden services have got more media attention lately as a result of some notorious sites like the Silk Road marketplace, an online black market. On a basic level, Tor hidden services allow you to make TCP services available while keeping your server’s physical location hidden via the Tor anonymity network.

TL;DR

  • Tor hidden service directories (HSDir’s) receive a subset of hidden service look-ups from users, allowing them to map relative popularity/usage of hidden service.
  • An adversary with minimal resources can carry out complete DoS attacks of Tor hidden services by running malicious Tor hidden service directories and positioning them in a particular part of the router list.
  • Many look-ups for Tor hidden services go to the incorrect hidden service directories which negatively affects the initial time to access the site.
  • Hidden services such as are popular, sites such as the Silk Road marketplace receive more than 60,000 unique user sessions a day.

Introduction

For users to access a hidden service they must first retrieve a hidden service descriptor. This is a short signed message created by the hidden service approximately every hour contain a list of introduction nodes and some other identifying data such as the descriptor id (desc id). The desc ID is based on a hash of some hidden service information and it changes every 24 hours. This calculation is outlined later in the post. The hidden service then publishes its updated descriptor to a set of 6 responsible hidden service directories (HSDir’s) every hour. These responsible HSDir’s are regular node on the Tor network which have up-time longer than 24 hours and which have received the HSDir flag from the directory authorities. The set of responsible HSDir’s is based on their position of the current descriptor id in a list of all current HSDir’s ordered by their node fingerprint. This is an implementation of a simple DHT (Distributed Hash Table).

A client who would like to retrieve the full HS descriptor will calculate the time based descriptor id and will request it from the responsible HSDir’s directly.

Descriptor is published to a set of HSDir's

Descriptor is published to a set of HSDir’s

Once the client has a copy of the hidden service descriptor they can attempt to connect to one the introduction point and create a complete 7 hop circuit to the hidden service.

This is a very brief explanation which the Tor Project has outlined much more clearly on their hidden service protocol page but it should provide enough information to understand the following information.

The Problems

  • Any one can set up a Tor node (HSDir) and begin logging all hidden service descriptors published to their node. They will also receive all client requests allowing them to observe the number of look-ups for particular hidden services.
  • The list of responsible HSDir’s for a hidden service is based on the calculated descriptor ID and an ordered list of HSDir fingerprints. As the descriptor ID’s are predictable, and the node fingerprint is controlled by an adversary. The can position themselves to be the responsible directories for a targeted hidden service and subsequently perform a DoS attack by not returning the hidden service descriptor to clients. There is nothing the hidden service can do about this attack. They must rely on the Tor Project authority directories to remove the malicious HSDir’s from the consensus.

Information Gathering from the DHT

Hidden services must publish their descriptors to allow clients to reach them. These descriptor id’s are deterministic but over time any HSDir should have an equal chance of receiving the descriptor for any particular hidden service as part of the DHT. As the descriptor is replicated across 6 HSDir’s, any single responsible HSDir should receive up to 1/6 of all client look-ups for that hidden service providing a good means of estimating hidden service popularity.

This data can be made more accurate by running more Tor HSDir’s which have identity digest‘s in the responsible range, to the point where all the responsible HSDir’s are controlled by the observer.

There are currently about 1400 nodes with the ‘HSDir’ flag running. For the past 2 months I have run 4 Tor nodes and logged all hidden service requests that they have received. Each hidden service publishes to 6 HSDir’s every day, and will publish to 360 or approximately 25% of HSDir’s in a 2 month period. As I am running 4 nodes I statistically should have received a copy of every hidden service that was online for those 60 days. While the distribution will not be perfect, my data should contain a representative set of hidden service activity. Here are some stats from those 4 nodes:

  • Received 1.3 million requests for 3815 unique hidden services my nodes were responsible for.
  • Got 16.03 million requests for descriptors my nodes were not currently responsible for.
  • Received published descriptors for 40,500 unique hidden services
  • Received requests from clients for 25,600 unique descriptor id’s.

Even these basic stats point out a number of issues with the way hidden services currently function. 16.03 million or 92.5% of all descriptor requests my nodes received were for hidden services I did not currently have descriptors for. This could be a result of the clients having an out-of-date network consensus and not choosing the correct HSDir’s as responsible. It could also be a result of an out of sync time which causes the clients to look for the wrong/old descriptor id’s. Either way, a significant amount of time is being wasted on descriptor look-up which slows down the time when first accessing a hidden service.

Some More Stats..

The following table contains data on requests my nodes received for some well known Tor based marketplaces. There are also requests for related phishing sites. I have confirmed that some users were directed to these phishing pages from links on the “The Hidden Wiki” (.onion). The number of requests is what my nodes observed. Descriptor look-ups from clients will be divided at random between the 6 responsible HSDir’s and clients will keep trying the remaining HSDir’s until the descriptor is found or all HSDir’s have been checked. As a result the total number of requests received by the following sites per day may be up to 6 times more than the figures below, but these still offer a relative guide to popularity.

Phishing site with a directory listing via forced browsing

Phishing site with a directory listing via forced browsing

Text file containing phished credentials

Text file containing phished credentials

It seems a lot of people, especially scammer’s running phishing sites on Tor hidden services don’t know a lot about web security which leads to sites like this.

Please users, if you enter your credentials on a phishing page, and it doesn’t log you into the site. Don’t try again!

I also observed a large number of requests to the command and control servers of the “Skynet” Tor based botnet which got attention after an AMA with the botnet owner on Reddit. Contact him @skynetbnet on Twitter for more info.

Many of the “Skynet” onion addresses above and other popular addresses are running Bitcoin mining proxies. They generally responded with a basic authentication request for “bitcoin-mining-proxy”. The other services may be Tor based bitcoin mining pools or part of Skynet and/or other botnets. It should be straight forward to find these sites by scanning the service id’s from the raw data on Github.

DoS Attacks on Tor Hidden Services

Tor hidden service desc_id‘s are calculated deterministically and if there is no ‘descriptor cookie’ set in the hidden service Tor config anyone can determine the desc id‘s for any hidden service at any point in time.This is a requirement for the current hidden service protocol as clients must calculate the current descriptor id to request hidden service descriptors from the HSDir’s. The descriptor ID’s are calculated as follows:

descriptor-id = H(permanent-id | H(time-period | descriptor-cookie | replica))

The replica is an integer, currently either 0 or 1 which will generate two separate descriptor ID’s, distributing the descriptor to two sets of 3 consecutive nodes in the DHT. The permanent-id is derived from the service public key. The hash function is SHA1.

time-period = (current-time + permanent-id-byte * 86400 / 256) / 86400

The time-period changes every 24 hours. The first byte of the permanent_id is added to make sure the hidden services do not all try to update their descriptors at the same time.

identity-digest = H(server-identity-key)

The identity-digest is the SHA1 hash of the public key generated from the secret_id_key file in Tor’s keys directory. Normally it should never change for a node as it is used for to determine the router’s long-term fingerprint, but the key is completely user controlled.

A HSDir is responsible if it is one of the three HSDir’s after the calculated desc id in a descending lists of all nodes in the Tor consensus with the HSDir flag, sorted by their identity digest.  The HS descriptor is published to two replica‘s (two set’s of 3 HSDir’s at different points of the router list) based on the two descriptor id’s generated as a result of the ’0′ or ’1′ replica value in the descriptor id hash calculation.

I have implemented a script calculating the descriptor ID’s for a particular hidden service at an arbitrary time and it is available on my Github account. I have also created a modified version of ‘Shallot‘ which can be used to generate keys with an identity key in a specified range. The is more usage information on its Github page.

The Attack

The code listed above could be used to generate identity keys and identity digests for an adversary’s HSDir nodes so that they’ll be selected as 6 of the HSDir’s for a targeted hidden service. These adversary controlled hidden service directories could simple return no data (404 Response) to a client requesting the targeted hidden service’s descriptor and in turn prevent them from finding introduction nodes. As there are no other sources for this hidden service descriptor it would be impossible for a user to set up a circuit a complete circuit to the hidden service and there would be a complete denial of service until the descriptor id changes.

For an adversary to continue this attack over a longer time-frame, they would need to set up their nodes for the upcoming desc id’s of the targeted hidden service more than 24 hours in advance to make sure they will have received the HSDir flag.

An adversary would need to run 12-18 nodes to keep up a complete, persistent DoS on the targeted hidden service. Six nodes would be the “responsible HSDir’s” and the other nodes would be running with identity digests in the range of the upcoming desc id’s, to gain the HSDir flags after 24 hours of up-time. An adversary can cut the resources needed by running two Tor instances/nodes per IPv4 IP or by running the Tor nodes on compromised servers on high, unprivileged ports.

These attacks a quite a real, practical threat against the availability of Tor hidden services. For whatever reason (extortion, censorship etc.) adversary can perform complete DoS attacks with minimal resources and there are no actions hidden service owner can do to mitigate, besides switching to descriptor cookie based authentication or multiple private address. The Tor project can try deal with these attacks by removing known malicious HSDir’s from the network consensus but I don’t see a straight forward way to identify these malicious nodes.

Unfortunately there are no easy solutions to this problem at the moment. I can foresee adversaries employing these attacks in the wild against popular hidden services.

Conclusions

Tor hidden services were originally implemented as some a simple feature on top of the Tor network and unfortunately they haven’t received the attention and love the deserve for such a popular feature. There are discussions under way to re-implement hidden services to alllow them to scale more efficiently. There are also people looking at reducing the ability of HSDir’s to sit and gather data on onion address and look-ups like I have done, by implementing a PIR protocol. A good summary of work that needs to be done is available in a blog post on the Tor Project’s website. I’d urge any developers with an interest to join the tor-dev mailing list and see if there is something  you can contribute! A lot of work is needed.

Anyone interested in learning more about the issues with the current Tor hidden service implementation please check out the presentation “Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization” at the IEEE S&P conference this Monday by Alex Biryukov, Ivan Pustogarov and RalfPhilipp Weinmann will probably provide a much more in-depth, formal investigation and I too look forward to reading it. We have been researching similar areas so I’m very interested in their approach and results.

All raw data and my modified Tor clients are available from my Github repo. This also contains scripts for calculating hidden service descriptors and generating OR private keys with fingerprints in a particular range. This data includes all descriptor requests and hidden service ID’s my nodes observed. Please check it out if you are interested in analyzing a random subset of services on the Tor network.

I’d like to thank @mikesligo for having a heated debate in the pub with me about hidden services and getting me interested in how they work. I’d also like to thank @CiaranmaK for reading a draft of this post and pointing out some corrections.

Thank you for reading my first blog post. I’ll have to work on presenting things better as the information in this post is a bit all over the place. Please let me know if you have any questions or feedback in the comments below.

13 thoughts on “Trawling Tor Hidden Service – Mapping the DHT

  1. gwern

    Interesting… so if I am interpreting your .onion table right, that implies that in April & May 2013, you found a lower bound of 27,836 visitors to SR & 327 to SR phishing sites (so 1.17% of would-be SR visitors were exposed to a phishing site?) and an upper bound of 167,016/1,962 (respectively).

    These directory lookups are one per visit to a hidden service, I take it and the results are subsequently cached?

    Reply
    1. Donncha Post author

      Hi Gwern,

      It is a small sample set. My nodes were only responsible for SR on two days. So it would be an absolute lower bound of 16387 in one day and an upper bound of 98322. This is a very wide range but it gives an idea of the order of magnitude. I wrote a script today to check how many responsible HSDir’s are returning the descriptor. I realize now I should of ran it as I found each service ID to be able to utilize this data better. It seems pretty changeable.

      I think it reasonable that ~1% of would be SR visitors were directed to a phishing page on a particular day it was promoted. Someone was spamming the site on Reddit – Reddit – SR Registration Down. I observed users getting directed to other phishing pages from the hidden wiki.

      I’d just like to reiterate that the absolute values only give a rough estimate of users. But its probably the only way we have of measuring hidden service usage without being an administator on those sites.

      Reply
  2. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At Stake - Forbes

  3. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At ...

  4. Pingback: Bitcoin Black Market Competition Heats Up, With Pro Marketing And Millions At Stake | The Freedom Watch

  5. Pingback: Bitcoin-accepting Atlantis takes on the Silk Road | BTC World News - BitCoin Network

  6. Pingback: Inside Atlantis: The online black market that lets users buy and sell drugs, forgeries and hacking services anonymously | Talesfromthelou's Blog

  7. Pingback: Inside Atlantis: The online black market that lets users buy and sell drugs …

  8. Pingback: Bitcoin-accepting Atlantis & Sheep takes on the Silk Road | unSpy

  9. Pingback: Bitcoin-accepting Atlantis & Sheep take on the Silk Road | unSpy

  10. Pingback: Study: Estimating hidden service traffic from DNS leaks | Deep Dot Web

  11. enquirer

    Hi Donncha, why are there multiple descriptor ids for these services?
    Does that mean that they were hosting their sites on several servers?

    Reply
    1. Donncha Post author

      Hi, no it doesn’t mean the sites were hosted on seperate servers. If you look at the formula for the descriptor-id you will see that it has a replica variable which can be either 0 or 1 and there is also a time-period included. This results in two different descriptor ids being generated for each hidden service every day.

      Reply

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>