From be2da83b1be1f3f112abef6baef232a7b320d0f1 Mon Sep 17 00:00:00 2001 From: Simone Basso Date: Tue, 14 Jun 2022 14:38:29 +0200 Subject: [PATCH] doc: publish the step-by-step design document (#814) This pull request publishes the step-by-step design document that I have been discussing with @hellais and @DecFox recently. Compared to the document that was approved, this one has been edited for readability. While there, I figured it was also be beneficial to publish the few ooni/probe-cli related design documents we produced in the past, because they probably help someone to get acquainted with the codebase. Reference issue for this pull request: https://github.com/ooni/probe/issues/2148 --- DESIGN.md | 4 + docs/design/README.md | 17 + docs/design/dd-001-oonimkall.md | 149 +++ docs/design/dd-002-netx.md | 409 +++++++ docs/design/dd-003-step-by-step.md | 1661 ++++++++++++++++++++++++++++ internal/engine/netx/doc.go | 3 + internal/netxlite/doc.go | 3 + pkg/oonimkall/doc.go | 4 + 8 files changed, 2250 insertions(+) create mode 100644 DESIGN.md create mode 100644 docs/design/README.md create mode 100644 docs/design/dd-001-oonimkall.md create mode 100644 docs/design/dd-002-netx.md create mode 100644 docs/design/dd-003-step-by-step.md diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..f698ce9 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,4 @@ +# Design documents + +Please, see the [docs/design](docs/design) directory for a collection +of important design documents related to this repo. diff --git a/docs/design/README.md b/docs/design/README.md new file mode 100644 index 0000000..b77d1d8 --- /dev/null +++ b/docs/design/README.md @@ -0,0 +1,17 @@ +# Design documents + +This directory contains ooni/probe-cli design documents. We generally +use issues as short design documents, but some more-complex discussions +require proper design documents. We started collecting existing design +documents into this repository on 2022-06-13, to provide additional +clarity regarding our more-complex choices. + +Implementation aspects _may_ change freely over time and there's no need +to rush back and update the related design document. We want to keep a +design document around to document the _why_ of specific choices. +Most design documents are, therefore, informational and historic by +nature, given that we do not aim to continuously update them. + +Rather than radically revamp a design document in case of radical direction +changes, it is advisable to create a new design document referencing the +previous design document and explaining what changed. diff --git a/docs/design/dd-001-oonimkall.md b/docs/design/dd-001-oonimkall.md new file mode 100644 index 0000000..eb57d80 --- /dev/null +++ b/docs/design/dd-001-oonimkall.md @@ -0,0 +1,149 @@ +# OONIMKAll: Replacing Measurement Kit C API + +| | | +|:-------------|:-------------| +| Author | [@bassosimone](https://github.com/bassosimone) | +| Last-Updated | 2020-07-09 | +| Status | historic | + +*Abstract.* We describe our plan of replacing Measurement Kit for OONI +Probe Android and iOS (in particular) and (also) the CLI. + +## Introduction + +We want to write experiments in Go. This reduces our burden +compared to writing them using C/C++ code. + +Go consumers of probe-engine shall directly use its Go API. We +will discuss the Go API in a future revision of this spec. + +For mobile apps, we want to replace these MK APIs: + +- [measurement-kit/android-libs](https://github.com/measurement-kit/android-libs) + +- [measurement-kit/mkall-ios](https://github.com/measurement-kit/mkall-ios) + +We also want consumers of [measurement-kit's FFI API](https://git.io/Jv4Rv) +to be able to replace measurement-kit with probe-engine. + +## APIs to replace + +### Mobile APIs + +We define a Go API that `gomobile` binds to a Java/ObjectiveC +API that is close enough to the MK's mobile APIs. + +### FFI API + +We define a CGO API such that `go build -buildmode=c-shared` +yields an API reasonably close to MK's FFI API. + +## API for running experiments + +It seems the generic API for enabling running experiments both on +mobile devices and for FFI consumers is like: + +```Go +type Task struct{ ... } + +func StartTask(input string) (*Task, error) + +func (t *Task) Interrupt() + +func (t *Task) IsDone() bool + +func (t *Task) WaitForNextEvent() string +``` + +This should be enough to generate a suitable mobile API when +using the `gomobile` Go subcommand. + +We can likewise generate a FFI API as follows: + +```Go +package main + +import ( + "C" + "sync" + + "github.com/ooni/probe-engine/oonimkall" +) + +var ( + idx int64 = 1 + m = make(map[int64]*oonimkall.Task) + mu sync.Mutex +) + +//export ooni_task_start +func ooni_task_start(settings string) int64 { + tp, err := oonimkall.StartTask(settings) + if err != nil { + return 0 + } + mu.Lock() + handle := idx + idx++ + m[handle] = tp + mu.Unlock() + return handle +} + +//export ooni_task_interrupt +func ooni_task_interrupt(handle int64) { + mu.Lock() + if tp := m[handle]; tp != nil { + tp.Interrupt() + } + mu.Unlock() +} + +//export ooni_task_is_done +func ooni_task_is_done(handle int64) bool { + isdone := true + mu.Lock() + if tp := m[handle]; tp != nil { + isdone = tp.IsDone() + } + mu.Unlock() + return isdone +} + +//export ooni_task_wait_for_next_event +func ooni_task_wait_for_next_event(handle int64) (event string) { + mu.Lock() + tp := m[handle] + mu.Unlock() + if tp != nil { + event = tp.WaitForNextEvent() + } + return +} + +func main() {} +``` + +This is close enough to [measurement-kit's FFI API](https://git.io/Jv4Rv) that +a few lines of C allow to implement an ABI-compatible replacement. + +## Other APIs of interest + +We currently don't have plans for replacing other MK APIs. We will introduce +new APIs specifically tailored for our OONI needs, but they will be out of +scope with respect to the main goal of this design document. + +## measurement_kit command line client + +We replace the `measurement_kit` CLI tool with `miniooni`, +which will use the ~same set of CLI flags for QA reasons. + +## History + +[The initial version of this design document]( +https://github.com/measurement-kit/engine/blob/master/DESIGN.md) +lived in the measurement-kit namespace at GitHub. It discussed +a bunch of broad, extra topics, e.g., code bloat that are not +discussed in this document. More details regarding the migration +from MK to probe-engine are at [measurement-kit/measurement-kit#1913]( +https://github.com/measurement-kit/measurement-kit/issues/1913). diff --git a/docs/design/dd-002-netx.md b/docs/design/dd-002-netx.md new file mode 100644 index 0000000..74b8bf1 --- /dev/null +++ b/docs/design/dd-002-netx.md @@ -0,0 +1,409 @@ +# OONI Network Extensions + +| | | +|:-------------|:-------------| +| Author | [@bassosimone](https://github.com/bassosimone) | +| Last-Updated | 2020-04-02 | +| Status | historic | +| Obsoleted-by | [ooni/probe-engine#359](https://github.com/ooni/probe-engine/issues/359) | +| Obsoleted-by | [ooni/probe-engine#522](https://github.com/ooni/probe-engine/pull/522) | +| Obsoleted-by | [dd-003-step-by-step.md](dd-003-step-by-step.md) | + +*Abstract.* Rationale and design of [ooni/netx](https://github.com/ooni/netx), +which was later merged into [ooni/probe-engine]( +https://github.com/ooni/probe-engine) and +[ooni/probe-cli](https://github.com/ooni/probe-cli). + +## Introduction + +OONI experiments send and/or receive network traffic to +determine if there is blocking. We want the implementation +of OONI experiments to be as simple as possible. We also +_want to attribute errors to the major network or protocol +operation that caused them_. + +At the same time, _we want an experiment to collect as much +low-level data as possible_. For example, we want to know +whether and when the TLS handshake completed; what certificates +were provided by the server; what TLS version was selected; +and so forth. These bits of information are very useful +to analyze a measurement and better classify it. + +We also want to _automatically or manually run follow-up +measurements where we change some configuration properties +and repeat the measurement_. For example, we may want to +configure DNS over HTTPS (DoH) and then attempt to +fetch again an URL. Or we may want to detect whether +there is SNI bases blocking. This package allows us to +do that in other parts of probe-engine. + +## Rationale + +As we observed [ooni/probe-engine#13]( +https://github.com/ooni/probe-engine/issues/13), every +experiment consists of two separate phases: + +1. measurement gathering + +2. measurement analysis + +During measurement gathering, we perform specific actions +that cause network data to be sent and/or received. During +measurement analysis, we process the measurement on the +device. For some experiments (e.g., Web Connectivity), this +second phase also entails contacting OONI backend services +that provide data useful to complete the analysis. + +This package implements measurement gathering. The analysis +is performed by other packages in probe-engine. The core +design idea is to provide OONI-measurements-aware replacements +for Go standard library interfaces, e.g., the +`http.RoundTripper`. On top of that, we'll create all the +required interfaces to achive the measurement goals mentioned above. + +We are of course writing test templates in `probe-engine` +anyway, because we need additional abstraction, but we can +take advantage of the fact that the API exposed by this package +is stable by definition, because it mimics the stdlib. Also, +for many experiments we can collect information pertaining +to TCP, DNS, TLS, and HTTP with a single call to `netx`. + +This code used to live at `github.com/ooni/netx`. On 2020-03-02 +we merged github.com/ooni/netx@4f8d645bce6466bb into `probe-engine` +because it was more practical and enabled easier refactoring. + +## Definitions + +Consistently with Go's terminology, we define +_HTTP round trip_ the process where we get a request +to send; we find a suitable connection for sending +it, or we create one; we send headers and +possibly body; and we receive response headers. + +We also define _HTTP transaction_ the process starting +with an HTTP round trip and terminating by reading +the full response body. + +We define _netx replacement_ a Go struct of interface that +has the same interface of a Go standard library object +but additionally performs measurements. + +## Enhanced error handling + +This library MUST wrap `error` such that: + +1. we can classify all errors we care about; and + +2. we can map them to major operations. + +The `github.com/ooni/netx/modelx` MUST contain a wrapper for +Go `error` named `ErrWrapper` that is at least like: + +```Go +type ErrWrapper struct { + Failure string // error classification + Operation string // operation that caused error + WrappedErr error // the original error +} + +func (e *ErrWrapper) Error() string { + return e.Failure +} +``` + +Where `Failure` is one of the errors we care about, i.e.: + +- `connection_refused`: ECONNREFUSED +- `connection_reset`: ECONNRESET +- `dns_bogon_error`: detected bogon in DNS reply +- `dns_nxdomain_error`: NXDOMAIN in DNS reply +- `eof_error`: unexpected EOF on connection +- `generic_timeout_error`: some timer has expired +- `ssl_invalid_hostname`: certificate not valid for SNI +- `ssl_unknown_autority`: cannot find CA validating certificate +- `ssl_invalid_certificate`: e.g. certificate expired +- `unknown_failure `: any other error + +Note that we care about bogons in DNS replies because they are +often used to censor specific websites. + +And where `Operation` is one of: + +- `resolve`: domain name resolution +- `connect`: TCP connect +- `tls_handshake`: TLS handshake +- `http_round_trip`: reading/writing HTTP + +The code in this library MUST wrap returned errors such +that we can cast back to `ErrWrapper` during the analysis +phase, using Go 1.13 `errors` library as follows: + +```Go +var wrapper *modelx.ErrWrapper +if errors.As(err, &wrapper) == true { + // Do something with the error +} +``` + +## Netx replacements + +We want to provide netx replacements for the following +interfaces in the Go standard library: + +1. `http.RoundTripper` + +2. `http.Client` + +3. `net.Dialer` + +4. `net.Resolver` + +Accordingly, we'll define the following interfaces in +the `github.com/ooni/probe-engine/netx/modelx` package: + +```Go +type DNSResolver interface { + LookupHost(ctx context.Context, hostname string) ([]string, error) +} + +type Dialer interface { + Dial(network, address string) (net.Conn, error) + DialContext(ctx context.Context, network, address string) (net.Conn, error) +} + +type TLSDialer interface { + DialTLS(network, address string) (net.Conn, error) + DialTLSContext(ctx context.Context, network, address string) (net.Conn, error) +} +``` + +We won't need an interface for `http.RoundTripper` +because it is already an interface, so we'll just use it. + +Our replacements will implement these interfaces. + +Using an API compatible with Go's standard libary makes +it possible to use, say, our `net.Dialer` replacement with +other libraries. Both `http.Transport` and +`gorilla/websocket`'s `websocket.Dialer` have +functions like `Dial` and `DialContext` that can be +overriden. By overriding such function pointers, +we could use our replacements instead of the standard +libary, thus we could collect measurements while +using third party code to implement specific protocols. + +Also, using interfaces allows us to combine code +quite easily. For example, a resolver that detects +bogons is easily implemented as a wrapper around +another resolve that performs the real resolution. + +## Dispatching events + +The `github.com/ooni/netx/modelx` package will define +an handler for low level events as: + +```Go +type Handler interface { + OnMeasurement(Measurement) +} +``` + +We will provide a mechanism to bind a specific +handler to a `context.Context` such that the handler +will receive all the measurements caused by code +using such context. This mechanism is like: + +```Go +type MeasurementRoot struct { + Beginning time.Time // the "zero" time + Handler Handler // the handler to use +} +``` + +You will be able to assign a `MeasurementRoot` to +a context by using the following function: + +```Go +func WithMeasurementRoot( + ctx context.Context, root *MeasurementRoot) context.Context +``` + +which will return a clone of the original context +that uses the `MeasurementRoot`. Pass this context to +any method of our replacements to get measurements. + +Given such context, or a subcontext, you can get +back the original `MeasurementRoot` using: + +```Go +func ContextMeasurementRoot(ctx context.Context) *MeasurementRoot +``` + +which will return the context `MeasurementRoot` or +`nil` if none is set into the context. This is how our +internal code gets access to the `MeasurementRoot`. + +## Constructing and configuring replacements + +The `github.com/ooni/probe-engine/netx` package MUST provide an API such +that you can construct and configure a `net.Resolver` replacement +as follows: + +```Go +r, err := netx.NewResolverWithoutHandler(dnsNetwork, dnsAddress) +if err != nil { + log.Fatal("cannot configure specifc resolver") +} +var resolver modelx.DNSResolver = r +// now use resolver ... +``` + +where `DNSNetwork` and `DNSAddress` configure the type +of the resolver as follows: + +- when `DNSNetwork` is `""` or `"system"`, `DNSAddress` does +not matter and we use the system resolver + +- when `DNSNetwork` is `"udp"`, `DNSAddress` is the address +or domain name, with optional port, of the DNS server +(e.g., `8.8.8.8:53`) + +- when `DNSNetwork` is `"tcp"`, `DNSAddress` is the address +or domain name, with optional port, of the DNS server +(e.g., `8.8.8.8:53`) + +- when `DNSNetwork` is `"dot"`, `DNSAddress` is the address +or domain name, with optional port, of the DNS server +(e.g., `8.8.8.8:853`) + +- when `DNSNetwork` is `"doh"`, `DNSAddress` is the URL +of the DNS server (e.g. `https://cloudflare-dns.com/dns-query`) + +When the resolve is not the system one, we'll also be able +to emit events when performing resolution. Otherwise, we'll +just emit the `DNSResolveDone` event defined below. + +Any resolver returned by this function may be configured to return the +`dns_bogon_error` if any `LookupHost` lookup returns a bogon IP. + +The package will also contain this function: + +```Go +func ChainResolvers( + primary, secondary modelx.DNSResolver) modelx.DNSResolver +``` + +where you can create a new resolver where `secondary` will be +invoked whenever `primary` fails. This functionality allows +us to be more resilient and bypass automatically certain types +of censorship, e.g., a resolver returning a bogon. + +The `github.com/ooni/probe-engine/netx` package MUST also provide an API such +that you can construct and configure a `net.Dialer` replacement +as follows: + +```Go +d := netx.NewDialerWithoutHandler() +d.SetResolver(resolver) +d.ForceSpecificSNI("www.kernel.org") +d.SetCABundle("/etc/ssl/cert.pem") +d.ForceSkipVerify() +var dialer modelx.Dialer = d +// now use dialer +``` + +where `SetResolver` allows you to change the resolver, +`ForceSpecificSNI` forces the TLS dials to use such SNI +instead of using the provided domain, `SetCABundle` +allows to set a specific CA bundle, and `ForceSkipVerify` +allows to disable certificate verification. All these funcs +MUST NOT be invoked once you're using the dialer. + +The `github.com/ooni/probe-engine/netx` package MUST contain +code so that we can do: + +```Go +t := netx.NewHTTPTransportWithProxyFunc( + http.ProxyFromEnvironment, +) +t.SetResolver(resolver) +t.ForceSpecificSNI("www.kernel.org") +t.SetCABundle("/etc/ssl/cert.pem") +t.ForceSkipVerify() +var transport http.RoundTripper = t +// now use transport +``` + +where the functions have the same semantics as the +namesake functions described before and the same caveats. + +We also have syntactic sugar on top of that and legacy +methods, but this fully describes the design. + +## Structure of events + +The `github.com/ooni/probe-engine/netx/modelx` will contain the +definition of low-level events. We are interested in +knowing the following: + +1. the timing and result of each I/O operation. + +2. the timing of HTTP events occurring during the +lifecycle of an HTTP request. + +3. the timing and result of the TLS handshake including +the negotiated TLS version and other details such as +what certificates the server has provided. + +4. DNS events, e.g. queries and replies, generated +as part of using DoT and DoH. + +We will represent time as a `time.Duration` since the +beginning configured either in the context or when +constructing an object. The `modelx` package will also +define the `Measurement` event as follows: + +```Go +type Measurement struct { + Connect *ConnectEvent + HTTPConnectionReady *HTTPConnectionReadyEvent + HTTPRoundTripDone *HTTPRoundTripDoneEvent + ResolveDone *ResolveDoneEvent + TLSHandshakeDone *TLSHandshakeDoneEvent +} +``` + +The events above MUST always be present, but more +events will likely be available. The structure +will contain a pointer for every event that +we support. The events processing code will check +what pointer or pointers are not `nil` to known +which event or events have occurred. + +To simplify joining events together the following holds: + +1. when we're establishing a new connection there is a nonzero +`DialID` shared by `Connect` and `ResolveDone` + +2. a new connection has a nonzero `ConnID` that is emitted +as part of a successful `Connect` event + +3. during an HTTP transaction there is a nonzero `TransactionID` +shared by `HTTPConnectionReady` and `HTTPRoundTripDone` + +4. if the TLS handshake is invoked by HTTP code it will have a +nonzero `TrasactionID` otherwise a nonzero `ConnID` + +5. the `HTTPConnectionReady` will also see the `ConnID` + +6. when a transaction starts dialing, it will pass its +`TransactionID` to `ResolveDone` and `Connect` + +7. when we're dialing a connection for DoH, we pass the `DialID` +to the `HTTPConnectionReady` event as well + +Because of the following rules, it should always be possible +to bind together events. Also, we define more events than the +above, but they are ancillary to the above events. Also, the +main reason why `HTTPConnectionReady` is here is because it is +the event allowing to bind `ConnID` and `TransactionID`. diff --git a/docs/design/dd-003-step-by-step.md b/docs/design/dd-003-step-by-step.md new file mode 100644 index 0000000..3e47c02 --- /dev/null +++ b/docs/design/dd-003-step-by-step.md @@ -0,0 +1,1661 @@ +# Step-by-step measurements + +| | | +|:-------------|:-------------| +| Author | [@bassosimone](https://github.com/bassosimone) | +| Last-Updated | 2022-06-13 | +| Reviewed-by | [@hellais](https://github.com/hellais) | +| Reviewed-by | [@DecFox](https://github.com/DecFox/) | +| Status | approved | +| Obsoletes | [dd-002-netx.md](dd-002-netx.md) | + +*Abstract.* The original [netx design document](dd-002-netx.md) is now two +years old. Since we wrote such a document, we amended the overall design +several times. The four major design changes were: + +1. saving rather than emitting +[ooni/probe-engine#359](https://github.com/ooni/probe-engine/issues/359); + +2. switching to save measurements using the decorator +pattern [ooni/probe-engine#522](https://github.com/ooni/probe-engine/pull/522); + +3. the netx "pivot" [ooni/probe-cli#396](https://github.com/ooni/probe-cli/pull/396); + +4. measurex [ooni/probe-cli#528](https://github.com/ooni/probe-cli/pull/528). + +In this (long) design document, we will revisit the original problem proposed by +[df-002-netx.md](df-002-netx.md), in light of what we changed and learned from the +changes we applied. We will highlight the significant pain points of the current +implementation, which are the following: + +1. that the measurement library API is significantly different from the Go stdlib +API, therefore violating this original `netx` design goal: that writing a new +experiment means using slightly different constructors that deviate from the standard +library only to meet specific measurement goals we have; + +2. that the decorator pattern leads to complexity in creating measurement types, +which in turn seems to be the reason why the previous issue exists; + +3. that the decorator pattern does not allow us to precisely collect all the +data that matters for events such as TCP connect and DNS round trips using +a custom transport, thus suggesting that we should revisit our choice of using +decorators and revert back to some form of _constructor based injection_ to +inject a data type suitable for saving events. + +In doing that, we will also propose an incremental plan for moving the tree +forward from [the current state](https://github.com/ooni/probe-cli/tree/1685ef75b5a6a0025a1fd671625b27ee989ef111) +to a state where complexity is transferred from the measurement-support library to +the implementation of each individual network experiment. + +## Index + +In [netxlite: the underlying library](#netxlite-the-underlying-network-library) we +describe the current design of the underlying network library. + +In [measurement tactics](#measurement-tactics) we provide an historical perspective +on the measurement tactics, we adopted or just tried. + +The [step-by-step refactoring proposal](#step-by-step-refactoring-proposal) +contains the main contribution of this design document: a proposal to refactor +the existing codebase to address our current measurement-code problems. + +The [reviews](#reviews) section contains information about reviews. + +## netxlite: the underlying network library + +This section describes `netxlite`, the underlying network library, from a +historical perspective. We explain our data collection needs, and what types +from the standard library we're using as patterns. + +### Measurement Observations + +Most OONI experiments need to observe and give meaning to these events: + +1. DNSLookup + +2. TCPConnect + +3. TLSHandshake + +4. QUICHandshake + +5. HTTP GET + +6. TCP/UDP Read + +7. TCP/UDP Write + +8. UDP ReadFrom + +9. UDP WriteTo + +Observing Read, Write, ReadFrom, and WriteTo is optional. However, these +observations provide [valuable information beyond just discussing the +blocking of resources](https://ooni.org/post/2022-russia-blocks-amid-ru-ua-conflict/#twitter-throttled). + +As part of its life cycle, an OONI experiment performs these operations +multiple times. We call *observation* the result of each of these +network operations. + +For each observation, we want to collect when the related operation +started and terminated. + +We also want to collect input parameters and output results. + +When using a custom DNS transport (e.g., DNS over HTTPS), we +should also collect the exchanged DNS messages (query and response). In +this scenario, we may also want to record the child events caused by a +DNS round trip (e.g., TCPConnect, TLSHandshake). + +When we're using getaddrinfo, we should [call it directly and collect +its return code](https://github.com/ooni/probe/issues/2029). + +When we measure HTTP, there are redirections. Each redirection may reuse +an existing TCP or QUIC connection, and each redirection has +an HTTP request and response. (Redirections are more complex than it +seems because of cookies; not entering into details for now but still +worth mentioning.) + +The [OONI data format](https://github.com/ooni/spec/tree/master/data-formats) +defines how we archive experiment results as a set of observations. +(Orthogonally, we may also want to improve the data format, but this is +not under discussion now.) + +### Error Wrapping + +The OONI data format also specifies [how we should represent +errors](https://github.com/ooni/spec/blob/master/data-formats/df-007-errors.md). +Go generates its own errors, and we should *reduce* those errors to the +set of strings specified in the OONI data format. (Orthogonally, we may +also want to introduce better errors when possible.) + +We should also attribute the error to the operation that failed. In +principle, this seems easy. Yet, depending on how we're performing +measurements, it is not. More details later when appropriate. + +A semi-orthogonal aspect is that we would also like to include in +collected measurements the underlying raw syscall or library errors. That would +be, e.g., getaddrinfo's return code or the Rcode of +DNS response messages or the syscall error returned by a Read call. By +adding this, we would give those who analyze the data information to +evaluate the correctness of a measurement. + +### Go Stdlib + +The Go standard library provides the following structs and interfaces +that we can use for measuring: + +```Go +// package net + +type Resolver struct {} + +func (r *Resolver) LookupHost(ctx context.Context, domain string) ([]string, error) +``` + +The Resolver is \~equivalent to calling getaddrinfo. However, we cannot +observe the error returned by getaddrinfo, and we do not have the +guarantee that we're actually calling getaddrinfo. (On Unix, in +particular, [we use the "netgo" resolver](https://github.com/ooni/probe/issues/2118), which +reads `/etc/resolv.conf`, when `CGO_ENABLED=0`.) + +```Go +// package net + +type Dialer struct {} + +func (d *Dialer) DialContext(ctx context.Context, network, address string) (net.Conn, error) +``` + +The Dialer combines DNSLookup and TCPConnect when the address contains a +TCP/UDP endpoint in which the hostname is not an IP address (e.g., +`dns.google:443`). To observe a TCPConnect, we need to make sure that we're +passing an address argument containing an IP address (e.g., `8.8.8.8:443`) +otherwise, the whole operation will be a DNS lookup plus one or more +TCP-connect attempts. + +```Go +// package crypto/tls + +type Conn struct {} + +func Client(conn net.Conn, config *tls.Config) *Conn + +func (c *Conn) HandshakeContext(ctx context.Context) error + +func (c *Conn) ConnectionState() tls.ConnectionState +``` + +The above APIs in `crypto/tls` allow us to perform a TLS handshake and +observe its results. The `crypto/tls` library is quite limited, and this +[caused TLS fingerprinting issues in the +past](https://ooni.org/post/making-ooni-probe-android-more-resilient/). +To overcome this issue we devised two solutions: +[ooni/go](https://github.com/ooni/go) (which forks golang/go) and +[ooni/oocrypto](https://github.com/ooni/oocrypto) (which +is leaner, but still has +[some issues](https://github.com/ooni/probe/issues/2122)). + +```Go +// package net/http + +type Transport struct { + DialContext func(ctx context.Context, network, address string) (net.Conn, error) + + DialTLSContext func (ctx context.Context, network, address string) (net.Conn, error) + + // ... +} + +func (txp *Transport) RoundTrip(req *http.Request) (*http.Response, error) + +func (txp *Transport) CloseIdleConnections() + +type RoundTripper interface { + RoundTrip(req *http.Request) (*http.Response, error) +} + +type HTTPClient struct { + Transport http.RoundTripper +} +``` + +These APIs in `net/http` allow us to create connections and observe HTTP +round trips. The stdlib assumes we're using crypto/tls for TLS +connections and fails to establish HTTP2 connections otherwise because +it cannot read the ALPN array. So we [forked +net/http](https://github.com/ooni/oocrypto) to use +alternative TLS libs (e.g., refraction-networking/utls). + +We could say more here. But I am trying to be brief. Because of that, I +am glossing over HTTP3, which is not part of the standard library but is +implemented by +[lucas-clemente/quic-go](https://github.com/lucas-clemente/quic-go). +Apart from the stdlib and quic-go, the only other significant network code +dependency is [miekg/dns](https://github.com/miekg/dns) +for custom DNS resolvers (e.g., DNS-over-HTTPS). + +### Network Extensions + +A reasonable idea is to try to use types as close as possible to the +standard library. By following this strategy, we can compose our code +with stdlib code. We've been doing this [since day +zero](df-002-netx.md). + +We use the `netx` name to identify **net**work e**x**tensions in ooni/probe-cli. + +What is great about using stdlib-like types is that we're using code +patterns that people already know. + +In this document, we're not going to discuss how netx should be +internally implemented. What matters is that we have a way to mimic +stdlib-like types. See +[internal/model/netx.go](https://github.com/ooni/probe-cli/blob/v3.15.1/internal/model/netx.go) +for details on those types. + +In fact, what remains to discuss in this document is how we use these netx types to +perform measurements. And this seems more of a software engineering +problem than anything else. + +Yet, before jumping right into this topic, I think it is worth +mentioning that netx should do the following: + +1. implement logging (we want ooniprobe -v to provide useful debug +information); + +2. implement error wrapping and failed-operation mapping (as defined +above); + +3. implement reasonable watchdog timeouts for every operation (OONI +runs in weird networks where censorship may cause OONI to become +stuck; see, for example, [ooni/probe#1609](https://github.com/ooni/probe/issues/1609)). + +All network connections we create in OONI (for measuring or +communicating with support services) have these concerns. Measurement +code has additional responsibilities, such as collecting and +interpreting the network observations. Therefore, separation of concerns +suggests that measurement code be implemented by other packages using +netxlite as a dependency. + +(The "lite" in `netxlite` reflects the fact that it does not concern +itself with measurements unlike [the original netx](df-002-netx.md), which contained +both basic networking wrappers and network measurement code.) + +## Measurement Tactics + +Each subsection presents a different tactic for collecting measurement +observations. + +### Context-Based Tracing + +This tactic is the first one we implemented. We call this approach +"tracing" because it produces a trace of the events, and it's +"context-based" because we use features of the context package to +*inject* a *Trace* in the place where we need to collect data. + +The general idea is that we have stdlib-like data structures that a +programmer is already familiar with. So, we tell the programmer to do +what they do best (i.e., use the patterns they already know), and we +superimpose data collection using `Context.WithValue` and `context.Value`. + +It is worth mentioning that historically we chose this approach because +the stdlib allows one to use the context to perform network tracing +([net/http/httptrace](https://pkg.go.dev/net/http/httptrace)), then +we progressively abandoned `httptrace` as our tracing needs +become more complex than what `httptrace` could provide us with. + +#### How this tactic feels like + +I tried to adapt how this code would look if we used it now. As +[dd-002-netx.md](dd-002-netx.md) suggests, here I am trying to separate +data collection and interpretation, which looked like a great idea at +the time but has now become a bottleneck. + +```Go +// doWhatYouKnowBest uses stdlib-like constructs to collect data. +func doWhatYouKnowBest(ctx context.Context, URL string) (*http.Response, []byte, error) { + clnt := netxlite.NewHTTPClientStdlib(log.Log) + defer clnt.CloseIdleConnections() + req, err := http.NewRequestWithContext(ctx, "GET", URL, nil) + if err != nil { + return nil, nil, err + } + resp, err := clnt.Do(req) + if err != nil { + return nil, nil, err + } + r := io.LimitReader(maxBodySnapshotSize) + data, err := netxlite.ReadAllContext(ctx, r) + if err != nil { + return resp, nil, err + } + return resp, data, nil +} + +func ExperimentMain(ctx context.Context, URL string) { + saver := tracex.NewSaver() + ctx = saver.WrapContext(ctx) + resp, body, err := doWhatYouKnowBest(ctx, URL) + // ... + trace := saver.Read() + // figure out what happened by looking into the trace 🔥🔥🔥 +} +``` + +As you can see, I have marked with fire emojis where we +need to figure out what happened by reading the trace. We are going +to discuss this issue in the next section. + +#### Issue #1: distance between collection and interpretation + +The nice part of this approach is that the network-interaction part of +the experiment is \~easy. The bad part is that we must figure out +what happened after the fact by inspecting the trace. In principle, this +*seems* easy. In practice, the code for producing an interpretation of the +results could become quite fragile for several experiments. + +To illustrate what I mean, here's how we process the trace produced by +the telegram experiment in ooni/probe-cli@v3.15.1: + +```Go +func (tk *TestKeys) Update(v urlgetter.MultiOutput) { + // copy data from the trace into the final test keys first + tk.NetworkEvents = append(tk.NetworkEvents, v.TestKeys.NetworkEvents...) + tk.Queries = append(tk.Queries, v.TestKeys.Queries...) + tk.Requests = append(tk.Requests, v.TestKeys.Requests...) + tk.TCPConnect = append(tk.TCPConnect, v.TestKeys.TCPConnect...) + tk.TLSHandshakes = append(tk.TLSHandshakes, v.TestKeys.TLSHandshakes...) + + // then process access points + if v.Input.Config.Method != "GET" { + if v.TestKeys.Failure == nil { + tk.TelegramHTTPBlocking = false + tk.TelegramTCPBlocking = false + return // found successful access point connection + } + if v.TestKeys.FailedOperation == nil || + *v.TestKeys.FailedOperation != netxlite.ConnectOperation { + tk.TelegramTCPBlocking = false + } + return + } + + // now take care of web endpoints + if tk.TelegramWebStatus != "ok" { + return // we already flipped the state + } + if v.TestKeys.Failure != nil { + tk.TelegramWebStatus = "blocked" + tk.TelegramWebFailure = v.TestKeys.Failure + return + } + title := `Telegram Web` + if strings.Contains(v.TestKeys.HTTPResponseBody, title) == false { + failureString := "telegram_missing_title_error" + tk.TelegramWebFailure = &failureString + tk.TelegramWebStatus = "blocked" + return + } +} +``` + +While the code above is compact, it makes me sad. The underpinning reason +for my sadness seems to be that, with this tactic, we have lost any code +*locality*. The code for computing results is *conceptually* far away from +the code that collects observations. Producing results look more like +writing a compiler for a trace than enhancing the results of individual +operations with extra insights. + +Just reflect on this fact: we have a single function for producing all +observations (not shown here) and a single function for interpreting +observations. It would feel more natural to have a single function for +each submeasuremement and have such a function deal with data collection +and interpretation. (So, how we implemented the insights +at [ooni/probe-engine#13](https://github.com/ooni/probe-engine/issues/13), +was probably misguiding us, and I think the conceptually wrong culprit is [this +comment](https://github.com/ooni/probe-engine/issues/13#issuecomment-548900579)). + +We are not using the tactic we're currently analyzing in ooni/probe-cli; still, +the code for determining the results of experiments is the same. Perhaps, the +first take-home lesson of this historical survey is that we should improve in this +respect and make result-determining code more obvious and closer to the code +that performs the measurement. We will eventually come to fix this issue later +in this document. For now, let us continue to analyze this tactic. + +#### Issue #2: the Context is magic and implicit + +Another pain point is that we're using the Context's magic. What happens +there feels more obscure than explicit initialization for performing +measurements. Compare this code: + +```Go +saver := &Saver{} +dialer := netxlite.NewQUICDialerStdlib(log.Log) +ctx := saver.WrapContext(ctx) +conn, err := dialer.DialContext(ctx, /* ... */) +``` + +With this code: + +```Go +saver := &Saver{} +dialer := netxlite.NewQUICDialerStdlib(log.Log) +dialer := saver.WrapQUICDialer(dialer) +conn, err := dialer.DialContext(ctx, /* ... */) +``` + +In the latter case, it's evident that we're *decorating* the original +dialer with an extended dialer that knows how to perform network +measurements. In the former case, it feels magic that we're setting some +value as an opaque any inside of the context, and there's a documented +promise we'll use this value. +(In fairness, however, the second implementation could be extended with wrappers +that makes it look like the first one, which should solve the clarity +problem entailed by using a context *to do dependency injection*.) + +Debugging, in particular, feels clumsy. Suppose we are taking a code path +that, for some reason, does not honor the value put inside the context. +In such a case, we would be more puzzled than we would be when we're +explicitly wrapping a type. I will discuss this topic when we +analyze the next tactic because the next tactic is all about reducing +the cognitive burden and avoiding the context. + +#### Issue #3: we obtain a flat trace + +The most straightforward implementation of this approach yields a flat trace. This +means that one needs to be careful to understand, say, which events are +caused by a DNS-over-HTTPS lookup and which events instead belong to the +HTTP round trip that used a DNS-over-HTTPS resolver as part of the round +trip. When eyeballing a measurement, this is relatively easy. But +programmatically walking the trace is more error-prone. + +A possible solution to overcome this flat-trace problem is to +assign numbers to distinct HTTP round trips and DNS lookups so that +later it is possible to make sense of the trace. This was indeed the +approach we chose initially. However, this approach puts more pressure +on the context magic because it does not just suffice to wrap the +context once with WithValue, but you need to additionally wrap it when +you descend into sub-operations. (Other approaches are possible, but I +will discuss this one because it looks conceptually cleaner to +create a new "node" in the trace like it was a DOM.) + +To illustrate what I mean: when you enter into a DNS lookup using a DNS +transport, you need to provide sub-contexts to each query, such that +each query has a different unique ID. Then, you also need to communicate +such IDs to the parent, so it can reference the two queries as its +children. This strategy leads to fragile and non-obvious code like the +following (where I emphasized the lines one needs to add using patch syntax): + +```Diff + func (r *Resolver) LookupHost(ctx context.Context, domain string) ([]string, error) { ++ trace := ContextTraceOrDefault(ctx) ++ // This basically means: give me a new traceID, reference it into the parent ++ // and then bind the new trace with a new wrapped context using it ++ childTraceA := trace.NewChild() ++ ctxA := trace.WrapContext(ctx) + outputA := make(chan *dnsLookupResult) ++ childTraceAAAA := trace.NewChild() ++ ctxAAAA := trace.WrapContext(ctx) + outputAAAA := make(chan *dnsLookupResult) + go r.lookup(ctxA, /* ... */, outputA) + go r.lookup(ctxAAAA, /* ... */, outputAAAA) + resA := <-outputA + resAAAA := <-outputAAAA + // Continue processing the result... + } +``` +(I reckon we could improve the API +and make it prettier, but my actual point here is that the context magic +combined with other complexities leads to non-obvious code.) + +An alternative approach is to say: look, we just want to run +small-scope operations. For example, we run a DNS lookup and save all +the data. Then we see what IP addresses we've resolved and move onto +the next operation, which could be a GET using each IP +address. The underpinning idea of this simplification is to try to +produce flat and simple traces. Of course, we are +moving away from the "just do what you know best" approach, albeit not +that aggressively. But certainly, we need to tell the contributor about +how to split large operations into smaller chunks using `netxlite` +primitives. + +(We will discuss this idea of performing small operations +several times. Here it is worth anticipating that, at the very least, we +should probably separate DNS lookups from other operations in +experiments so we have a chance to explicitly try \~all IP addresses. +We should be doing that because we are increasingly seeing cases where +different IP addresses for the same domain behave differently regarding +censorship. See our [DoT in Iran +research](https://ooni.org/post/2020-iran-dot/) for more +information.) + +### Decoration-Based Tracing + +In [probe-engine#359](https://github.com/ooni/probe-engine/issues/359), we +started planning refactoring of `netx` to solve the identified issues +in our context-based tracing implementation. Because the context magic was +a significant issue at the time, this refactoring focused on +avoiding the context. After this refactoring, we obtained the tactic +we currently use, i.e., *decoration-based tracing*. + +In retrospect, it might be that we +were just using the context in a complex way and a simpler context-based +implementation was possible. Nonetheless, re-reading my assessment at +the time, it feels like I perceived all these problems as entangled. Hence, +the context needed to go along with other sources of complexity. + +#### Overview + +We defined a wrapper implementing the same interface and saving results +for each `internal/model/netx.go`'s type. For example: + +```Go +// package github.com/ooni/probe-cli/v3/internal/tracex + +type Dialer struct { + Dialer model.Dialer + Saver *Saver +} + +// [...] omitting code for creating this Dialer wrapper + +func (d *Dialer) DialContext(ctx context.Context, network, address string) (net.Conn, error) { + started := time.Now() + conn, err := d.Dialer.DialContext(ctx, network, address) + d.saveConnectEvent(started, network, address, conn, err, time.Now()) + return conn, err +} +``` + +Actual code is more complex than this. However, the concept is as simple as +that. + +All good, then? No. + +We moved complexity from the Context magic to construction. Now one +needs to construct, say, a Dialer by composing netxlite's Dialer +with a bunch of wrappers also implementing model.Dialer. + +It is no coincidence that the code above omits the code to compose a +base Dialer with a saving wrapper. Since we adopted this tactic, we +spent some time wrestling with composing data types. + +#### Decorator ~madness and how to overcome it: netx.Config + +To illustrate my point about construction, please consider this excerpt from +measurex where I am trying to create a `model.Resolver` that saves lookup +and DNS round trip events manually: + +```Go +func (mx *Measurer) NewResolverUDP(db WritableDB, logger model.Logger, address string) model.Resolver { + return mx.WrapResolver(db, netxlite.WrapResolver( + logger, netxlite.NewUnwrappedSerialResolver( + mx.WrapDNSXRoundTripper(db, netxlite.NewUnwrappedDNSOverUDPTransport( + mx.NewDialerWithSystemResolver(db, logger), + address, + )))), + ) +} +``` + +This kind of complexity is probably wrong. We hit +such a complexity wall twice: once initially with the +lockdown-2020 netx refactoring; the second time when trying +to design the alternative approach that eventually became measurex. + +We have already seen vanilla construction complexity +for `measurex`. Let us focus on `netx` own flavor of complexity. +So, back in 2020, we were refactoring netx, and we saw that it was cumbersome +to construct and wrap the basic types we had defined. There's a lot of +wrapping, as we have seen in the measurex example above. This reckoning +led us to design APIs such as the Config-based API in netx and +the (again) Config-based API in urlgetter. + +Here's a (perhaps disgusting?) excerpt from netx in ooni/probe-cli@v3.15.1 +that shows how we overcome constructor complexity by declaring a flat +`Config` structure from which to construct a `model.Resolver`: + +```Go +type Config struct { + BaseResolver model.Resolver // default: system resolver + BogonIsError bool // default: bogon is not error + ByteCounter *bytecounter.Counter // default: no explicit byte counting + CacheResolutions bool // default: no caching + CertPool *x509.CertPool // default: use vendored gocertifi + ContextByteCounting bool // default: no implicit byte counting + DNSCache map[string][]string // default: cache is empty + DialSaver *trace.Saver // default: not saving dials + Dialer model.Dialer // default: dialer.DNSDialer + FullResolver model.Resolver // default: base resolver + goodies + QUICDialer model.QUICDialer // default: quicdialer.DNSDialer + HTTP3Enabled bool // default: disabled + HTTPSaver *trace.Saver // default: not saving HTTP + Logger model.DebugLogger // default: no logging + NoTLSVerify bool // default: perform TLS verify + ProxyURL *url.URL // default: no proxy + ReadWriteSaver *trace.Saver // default: not saving read/write + ResolveSaver *trace.Saver // default: not saving resolves + TLSConfig *tls.Config // default: attempt using h2 + TLSDialer model.TLSDialer // default: dialer.TLSDialer + TLSSaver *trace.Saver // default: not saving TLS +} + +func NewResolver(config Config) model.Resolver { + if config.BaseResolver == nil { + config.BaseResolver = &netxlite.ResolverSystem{} + } + var r model.Resolver = config.BaseResolver + r = &netxlite.AddressResolver{ + Resolver: r, + } + if config.CacheResolutions { + r = &resolver.CacheResolver{Resolver: r} + } + if config.DNSCache != nil { + cache := &resolver.CacheResolver{Resolver: r, ReadOnly: true} + for key, values := range config.DNSCache { + cache.Set(key, values) + } + r = cache + } + if config.BogonIsError { + r = resolver.BogonResolver{Resolver: r} + } + r = &netxlite.ErrorWrapperResolver{Resolver: r} + if config.Logger != nil { + r = &netxlite.ResolverLogger{ + Logger: config.Logger, + Resolver: r, + } + } + if config.ResolveSaver != nil { + r = resolver.SaverResolver{Resolver: r, Saver: config.ResolveSaver} + } + return &netxlite.ResolverIDNA{Resolver: r} +} +``` + +This code could be made prettier ([see how it looks +now](https://github.com/ooni/probe-cli/blob/2502a237fb5e2dd3dc4e0db23dd19eabb292f6a1/internal/engine/netx/resolver.go#L13)). But +we are not here to compliment ourselves on our +now-prettier code, rather we want to analyze +what makes our code more obscure and less +scalable than it could be. + +I would argue that the central issue of this code is that it's +declaring it will take the burden of constructing the Resolver +types required by *any* experiment (and, of course, we have +similar constructors for other fundamental types). Thus, we +are putting lots of pressure on a single fundamental +library (a topic on which we'll return later). It is also +worth noting that we're putting this pressure as +a side effect of +trying to cope with the complexity of constructing +measurement-able types using decoration. + +At the same time, the usage is still at `doWhatYouKnowBest` +levels. We have a declarative way of constructing something, but then we +use that something in a stdlib-like way. However, the +need for a flexible experiment that we could invoke from the command +line moved the complexity to another level: enter the +`urlgetter` experiment and library. + +#### Urlgetter and the perils of too many layers of abstraction + +As I mentioned just now, if we're using netx, now the code looks like: + +```Go +func ExperimentMain(ctx context.Context, URL string) { + config := &netx.Config{ + // set possibly as many as 10 fields + } + txp := netx.NewHTTPTransport(config) + client := netxlite.NewHTTPClient(txp) + resp, body, err := doWhatYouKnowBest(ctx, client, URL) + // ... + trace := saver.Read() + // figure out what happened by looking into the trace +} +``` + +That is, we're still in `doWhatYouKnowBest` territory. Incidentally, this +also means that (1) there's still distance between collection and +interpretation, and (2) we're still producing a flat trace. We have +basically traded the context magic with construction complexity, which +we overcome by adding abstraction without changing all the other pain +points of the original netx implementation. + +Conversely, if we're using urlgetter, the code is even more abstract +(which is not necessarily good): + +```Go + +func ExperimentMain(ctx context.Context, URL string) { + config := &urlgetter.Config{ + // set possibly as many as 10 fields + } + g := urlgetter.Getter{ + Config: config, + Session: sess, + Target: URL, + } + tk, _ := g.Run(ctx) + // figure out what happened by looking into the tk (aka TestKeys) +} + +``` + +With this code, the `doWhatYouKnowBest` feeling is gone forever. A +programmer needs to learn an entirely new API for implementing OONI +measurements (unless they want to hack at the `netx` or `netxlite` library level). + +Crucially, we still have the same post-processing pain we had with netx. There +is always a "now we figured out what happened" step +that runs over a full trace. + +Also, very importantly: debugging was hard with the original netx +because of the context magic. Now it is hard because there are many +layers of abstraction. If you see a line like: + +```Go +dialer := netx.NewDialer(config) +``` + +It's completely opaque to you what the dialer could or could not do. To +understand, you need to read the content of the config. The matter +becomes even more complex if that config is not just declared in the +codebase but is actually code generated, as happens with urlgetter, which +generates a Config for netx. + +#### Difficulty in collecting precise observations + +Some observations happen at places that are not so easy to collect using +decoration. Here is a simple example. Assume we wanted to implement +lightweight measurements where we only record which major operation +failed (i.e., dialing, DNS lookup, TLS handshake), completely ignoring +read/write events and also not bothering ourselves with collecting +traces. In this simplified scenario, we could use this decorator: + +```Go +func (d *Dialer) DialContext(ctx context.Context, network, address string) (net.Conn, error) { + started := time.Now() + conn, err := d.Dialer.DialContext(ctx, network, address) + d.saveConnectEvent(started, network, address, conn, err, time.Now()) + return conn, err +} +``` + +However, even in this +simplified scenario, a fine catch leads to less than +optimal data collection. The address parameter contains +an IP address because our wrapper is part of a chain of model.Dialer +decorators and is inserted in a point in this chain after we have +resolved a domain to IP addresses. However, it would be much more +interesting to record also the domain name that caused us to start +connecting. For example, it's more powerful to know we're failing to +connect to a Cloudflare IP address for a specific domain than not +knowing which is the domain for which we're connecting. Yet, we cannot +do that using a strict decoration-based approach because of the +signature of DialContext and the type chain we build. + +However, consider what we could do with the context (I'm using a +patch syntax to show what needs to change): + +```Diff + func (d *dialerResolver) DialContext(ctx context.Context, network, address string) (net.Conn, error) { + onlyhost, onlyport, err := net.SplitHostPort(address) + if err != nil { + return nil, err + } + addrs, err := d.lookupHost(ctx, onlyhost) + if err != nil { + return nil, err + } + addrs = quirkSortIPAddrs(addrs) + var errorslist []error + for _, addr := range addrs { + target := net.JoinHostPort(addr, onlyport) ++ trace := ContextWithTraceOrDefault(ctx) ++ started := time.Now() + conn, err := d.Dialer.DialContext(ctx, network, target) ++ // XXX: error already wrapped because of how we construct a full dialer, ++ // still we may want to refactor code to make this obvious ++ trace.OnConnect(started, network, onlyhost, target, conn, err) + if err == nil { + return conn, nil + } + errorslist = append(errorslist, err) + } + return nil, quirkReduceErrors(errorslist) +} +``` + +The above snippet allows us to collect both the TCP endpoint containing +the IP address and the port, i.e., the `target` variable and the original +hostname passed to lookupHost, named `onlyhost`. + +The key take-home message here is that the context is more flexible +because we can jump into the implementation's middle and observe what +we need. We are not constrained by the boundaries imposed +by the type signatures we're wrapping. (To be fair, the context here is +just an *opaque method to inject a dependency* called `trace` that +implements methods to register what happened, so the real point here is +that dependency injection could overcome our construction fatigue.) + +A similar problem occurs for DNS lookup when using a transport. The +current implementation of both tracex and netx collects the messages +exchanged in the DNS round trip and the resolved addresses separately. +Then, when one is performing data analysis, it is their responsibility +to find out a sensible way of gluing together related events. We could +do better (and provide the results along with the original DNS messages +and extra information such as the response's Rcode) by injecting a trace +using the context (or possibly also using other dependency injection +mechanisms if the context is too bad): + +```Diff + // In internal/netxlite/resolverparallel.go + + // lookupHost issues a lookup host query for the specified qtype (e.g., dns.A). + func (r *ParallelResolver) lookupHost(ctx context.Context, hostname string, + qtype uint16, out chan\<- \*parallelResolverResult) { + encoder := &DNSEncoderMiekg{} + query := encoder.Encode(hostname, qtype, r.Txp.RequiresPadding()) ++ started := time.Now() + response, err := r.Txp.RoundTrip(ctx, query) ++ trace := ContextTraceOrDefault(ctx) ++ finished := time.Now() + if err != nil { ++ trace.OnDNSRoundTripForLookupHost(started, query, response, []string{}, err, finished) + out <- ¶llelResolverResult{ + addrs: []string{}, + err: err, + } + return + } + addrs, err := response.DecodeLookupHost() ++ trace.OnDNSRoundTripForLookupHost(started, query, response, addrs, err, finished) + out <- ¶llelResolverResult{ + addrs: addrs, + err: err, + } +} +``` + +The code above allows us to create an OONI DNS Lookup event in a +straightforward way. Now, instead, we need to do work to reconstruct +what could have been the original messages basing our judgment only on +the results of LookupHost. Likewise, the decorator approach requires us +to wrap things, while we can collect getaddrinfo results in a way that +almost feels embarrassing in its simplicity (and where we could get +the `CNAME` and getaddrinfo's return code very easily): + +```Diff + // In internal/netxlite/getaddrinfo_cgo.go + + func getaddrinfoLookupANY(ctx context.Context, domain string) ([]string, string, error) { ++ started := time.Now() + addrs, cname, err := getaddrinfoStateSingleton.LookupANY(ctx, domain) ++ // XXX: err isn't wrapped here but, yeah, we can call the error wrapping function ++ // also from this function since wrapping is mostly idempotent ++ trace := ContextTraceOrDefault(ctx) ++ trace.OnGetaddrinfo(started, addrs, cname, err) + return addrs, cname, err + } +``` + +#### Concluding remarks + +Historically, the decorator-based approach helped simplify +the codebase (probably because what we had previously was +messier). Yet it replaced context magic with constructor +bureaucracies. Additionally, it did not allow us to solve most of the +issues we had with flat traces and the distance between collection +and interpretation. Moreover, because this approach is relatively rigid, +it is more difficult to collect precise observations than it would be +using the context to do dependency injection. + +Looking backward, it has not been a bad idea to declare in `netx` docs +that we're OK with keeping this approach, but we would like new +experiments to, ehm, experiment with leaner techniques. + +Whatever choice we make, it should probably include some form of +dependency injection for a trace that allows us to collect the events we +care about more precisely and with less effort. + +### Step-by-Step Measurements + +We had previous conversations around simplifying how we perform +measurements (e.g., I remember a conversation with +[Vinicius](https://github.com/fortuna), where he advocated +for decomposing measurements in simple operations, and he +rightfully pointed out that tracing is excellent for debugging +but complicating assigning meaning to measurements). + +We also mentioned that, in the codebase, we documented that `netx` was discouraged as an +approach for new experiments. The first chance to try a different tactic +was the development of the `websteps` prototype. We tried to implement +step-by-step measurements, which, in its most radical form, calls for +performing each relevant step in isolation, immediately saving a +small trace and interpreting it before moving on to the next step +(unless there's an error, in which case you typically stop). + +Looking back at the Go stdlib API, the main blocker to implementing +this tactic is how to reconcile it with HTTP transports, which expects to +dial and control their own connections. Luckily, +[Kathrin](https://github.com/kelmenhorst) +[implemented](https://github.com/ooni/probe-cli/pull/432) the +following trick that solved such an issue: + +```Go +// NewSingleUseDialer returns a "single use" dialer. The first +// dial will succeed and return the conn regardless of the network +// and address arguments passed to DialContext. Any subsequent +// dial returns ErrNoConnReuse. +func NewSingleUseDialer(conn net.Conn) model.Dialer { + return &dialerSingleUse{conn: conn} +} + +// dialerSingleUse is the Dialer returned by NewSingleDialer. +type dialerSingleUse struct { + mu sync.Mutex + conn net.Conn +} + +var _ model.Dialer = &dialerSingleUse{} + +func (s *dialerSingleUse) DialContext(ctx context.Context, network string, addr string) (net.Conn, error) { + defer s.mu.Unlock() + s.mu.Lock() + if s.conn == nil { + return nil, ErrNoConnReuse + } + var conn net.Conn + conn, s.conn = s.conn, nil + return conn, nil +} +``` + +With a "single-use" dialer, we provide an HTTPTransport with a fake +dialer that provides a previously-established +connection to the transport itself. The following snippet shows +code from my first naive attempt at writing code using this tactic, +where originally identified pain points have been emphasized: + +```Go +dialer := netxlite.NewDialerWithoutResolver() +conn, err := dialer.DialContext(ctx, network, address) +// omitting code showing how to obtain and save a trace for the dial 🔥🔥🔥 +if err != nil { + meas.FailedOperation = netxlite.ConnectOperation + meas.Failure = err + return +} +defer conn.Close() + +thx := netxlite.NewTLSHandshakerStdlib() +tlsConn, _, err := thx.Handshake(ctx, conn, tlsConfig) +// omitting code showing how to obtain and save a trace for the handshake 🔥🔥🔥 +if err != nil { + meas.FailedOperation = netxlite.TLSHandshakeOperation + meas.Failure = err + return +} +defer tlsConn.Close() + +txp := netxlite.NewHTTPTransport( + netxlite.NewNullDialer(), netxlite.NewSingleUseTLSDialer(tlsConn)) +resp, err := txp.RoundTrip(req) +// omitting code showing how to obtain and save a trace for the HTTP round trip 🔥🔥🔥 +if err != nil { + meas.FailedOperation = netxlite.HTTPRoundTrip + meas.Failure = err + return +} +// omitting more code for brevity... 🔥🔥🔥 +``` + +Let's discuss the good parts before moving on to the bad parts. It is +dead obvious which operation failed and why, and we know what went +wrong and can analyze the observations immediately. + +Additionally, if you ask someone who knows the standard library to write +an experiment that provides information about TCP connect, TLS handshake, and +HTTP round trip using `netxlite`, they would probably write something +similar to the above code. + +#### Issue #1: no persistent connections + +Without adding extra complexity, we lose the possibility of using +persistent connections. This may not be a huge problem except for +redirects. Each redirect will require us to set up a new connection, +even though an ordinary http.Transport would probably have reused an +existing one. + +Because we're measuring censorship, I would argue it's OK to not reuse +connections. Sure, the measurement could be slower, but we'd get +more data points on TCP/IP and TLS blocking. + +#### Issue #2: requires manual handling of redirects + +Because we cannot reuse connections easily, we cannot rely on +&http.Client{} to perform redirections automatically for us. This is why +websteps implements HTTP redirection manually. + +While it may seem that discussing redirects is out of scope, +historically I had been reluctant to switch to a step-by-step model +because I felt manually handling redirects was hard, so I wanted +to avoid doing that. + +The bassosimone/websteps-illustrated repository contains primitives for +efficiently handling redirects as part of its fork of the measurex package. +The relevant data struct is +[URLRedirectDeque](https://github.com/bassosimone/websteps-illustrated/blob/b3bf45c44f92bc6d0473f57b27eb73c4bb67e3c0/internal/measurex/url.go#L696), +which encapsulates the details for ensuring we're redirecting correctly. +Here are the crucial aspects to consider: + +1. we should normalize the URL because different endpoints of the same +website may give us different redirect URLs that normalize to the +same URL; + +2. we should consider the name of the cookies to distinguish between +redirects: two redirects to the same URL with a different set of +cookies are actually two distinct redirects; + +3. we need to remember which URLs we have already visited to avoid +recommending a client to follow a redirect we have already +followed previously; + +4. we should limit the total depth of the redirect queue to avoid +looping forever; + +5. we should consider the 301, 302, 307, and 308 status codes as +redirect candidates. + +Because this problem of redirection is fundamental to many experiments +(not only webconnectivity but also, e.g., whatsapp), any step-by-step +approach library needs this functionality. + +#### Issue #3: DRY pressure + +Before, I said that saving traces seems complicated. That is not entirely +true. Depending on the extent to which we are willing to suffer the pain of +spelling out each operation, we can observe everything. + +For example, just combining code we \~already have in tree, we obtain: + +```Diff ++ saver := &Saver{} + dialer := netxlite.NewDialerWithoutResolver() ++ dialer = saver.WrapDialer(dialer) + conn, err := dialer.DialContext(ctx, network, address) ++ meas.ReadTCPConnectFromSaver(saver) + if err != nil { + meas.FailedOperation = netxlite.ConnectOperation + meas.Failure = err + return + } + defer conn.Close() +``` + +The hand waving now is limited to assuming we have a `meas` method called +`ReadTCPConnectFromSaver` that can read the trace and only save TCP +connect events (which seems easy to write.) + +In any case, should we use this tactic, I see a significant DRY +pressure. Many experiments need to perform the same operations +repeatedly. Consider, e.g., how sending a DoH request looks like: + +```Go +addrs, cname, err := netxlite.GetaddrinfoLike(ctx, domain) +if err != nil { + meas.FailedOperation = GetaddrinfoLookup + meas.Failure = err + return +} + +for _, addr := range addrs { + endpoint := net.JoinHostPort(addr, "443") + dialer := netxlite.NewDialerWithoutResolver() + var err error + conn, err := dialer.DialContext(ctx, "tcp", endpoint) + if err != nil { + // What to save here? + continue + } + defer conn.Close() + thx := netxlite.NewTLSHandshaker() + tconn, _, err := thx.Handshake(ctx, conn, config) + if err != nil { + // What to save here? + continue + } + defer tconn.Close() + txp := netxlite.NewHTTPTransport( + netxlite.NewNullDialer(), netxlite.NewSingleUseTLSDialer(tconn)) + clnt := netxlite.NewHTTPClient(txp) + dnstxp := netxlite.NewUnwrappedDNSOverHTTPSTransport(clnt, URL) + defer dnstxp.CloseIdleConnections() + resp, err := dnstxp.RoundTrip(ctx, query) + if err != nil { + // What to save here? + continue + } + // Should we continue measuring here? +} +``` + +Now, if we do that in a single experiment, this code does not feel +terrible. My main concern is the amount of similar code +we'll need to write and what we'll need to do should we discover +that we need to perform some sort of refactoring that includes changing +the code. (How likely is this refactoring if we use basic primitives, though?) + +#### Notes on spreading complexity + +Please see Arturo's comments in his 2022-06-09 +review. The gist of these comments is that we would like more +complexity in experiments than in the support library. This seems like a +scalability argument. If all experiments extensively use the same +support library, then this library will feel the pressure of all the +needs of all experiments. Conversely, the pressure is much lower +if experiments are written in +terms of basic primitives plus a set of support +functions. Hence, my time would be spent +more on improving experiments than polishing and enhancing the +support library. + +Additionally, Arturo argues that we should strive to keep the +implementation of experiments (including the implementation of the +primitives they use) as constant in time as possible to ensure that we +can always keep comparing our results. By +spreading complexity, we reduce the risk that a well-meaning refactoring +in a support library has side effects for an experiment. + +A few additional points in favor of the step-by-step approach are +that we make it easier for people to inspect and understand the +logic of a specific OONI Probe test implementation without having to +learn the OONI test writing framework. This is helpful to data analysts +looking at the data and needing additional information beyond +what is already documented in ooni/spec. Ultimately, the code is the +source of truth; as such, we should make it as easy to inspect as +possible. + +Another aspect worth considering is that some tests might have specific +needs that go beyond what we have thought to provide in the high level +trace-based or decorator-based APIs. For example, I may want to have a +test where I want to do HTTP, but the IP I use doesn't come from the DNS +resolver, but it comes from some other source. I think it's going to be +more explicit and natural to implement this with the base golang +primitives instead of having to define a custom config or have to +implement a specific tracing method for this use case. + +The last point is that there are scenarios in which one might be +interested in triggering some follow-up tests based on the results of +some network operation. If the API is doing too much, I +might not have the ability to hook me into it and run the follow-up +experiment right after the operation I needed to do. + +#### Concluding remarks + +If we have a way to collect observations, +this approach certainly has the advantage of having some +"what you already know" vibes. It is also possible that we will end up +writing very easy-to-maintain code using this style. + +Another interesting consideration is whether we want to use this style for all +experiments. A tracing-based approach seems to work fine for all +the experiments where we are primarily investigating and logging events. +When interpretation is critical, this tactic is definitely one of the best +ways to ensure that our code is correct. It builds on a basic +netxlite-based vocabulary for expressing micro-operations and calls for +only using such a vocabulary. + +For all the other experiments, we just +probably want to split DNS and other operations to get a chance to +test all (or many) of the available IP addresses and use tracing within DNS and +other operations. + +### DNSLookup and Endpoint Measurements + +This approach is currently implemented as measurex. A DNSLookup +measurement should be an obvious concept, while it's probably less +clear what's an endpoint measurement. So, let's clarify. + +An endpoint measurement is one of the following operations: + +1. given a TCP endpoint (IP address and port), TCP connect to it; + +2. given a TCP endpoint and an SNI, TCP connect and perform a TLS handshake; + +3. given a QUIC endpoint and an SNI, QUIC handshake with it; + +4. given a TCP endpoint and an URL, TCP connect, then HTTP GET; + +5. given a TCP endpoint, an SNI, and an URL, TCP connect, TLS handshake, then HTTP GET; + +6. given a QUIC endpoint, an SNI, and an URL, QUIC handshake, then HTTP GET. + +Since +[dnscheck](https://github.com/ooni/spec/blob/master/nettests/ts-028-dnscheck.md), +it's evident to us that we want to split DNS lookup and subsequent +operations. The reason is that there's [endpoint-based blocking and +routing-based +censorship](https://ooni.org/post/2020-iran-dot/). So, at +least in the grand scheme of things, we want to test all the available +IP addresses. (Then, there are concerns with doing that strictly, given +that there may be many, but still...). Also, it's interesting to note +that, at least in part, web connectivity was already testing many IP +addresses, so that pattern was already in OONI somehow (at least for the +most important experiment measuring web censorship.) + +Another interesting observation about the above set of operations is +that each of them could fail exactly once. The DNSLookup could fail or +yield addresses, and TCP connect could fail or succeed. In the TCP connect +plus TLS handshake case, you stop there if you fail the TCP connect. And +so on. + +Because of this reasoning, one could say that the measurex tactic +is equivalent to the previous one in relatively easily identifying +the failed operation and filling the measurement. That seems to be an argument +for having a library containing code to simplify measurements. +However, at the same time, *we are again asking developers to learn +an entirely new API*. After careful consideration, it seems preferable to +select an API that is closer to what a typical Go programmer would +expect. + +## Step-by-step refactoring proposal + +I tried to reimplement the telegram experiment using a pure step-by-step approach +([here's the +gist](https://gist.github.com/bassosimone/f6e680d35805174d1f150bc15ef754af)). +It looks fine, but one ends up writing a support library such as measurex. Yet, +as noted above, the API exposed by such a measurement +library matters, and an API familiar to Go developers seems preferable +to the API implemented by measurex. + +There are two key insights I derived from my telegram PoC. + +The first insight is in line with my previous observation. If the +measurement library provides an API equivalent to the one provided by +netxlite (of course, with some form of measurement saving), we can really +ask a developer to write code like they would in Go. Then, we can apply +some minor refactoring to make measurements collection possible. +Consider, for example, this code extracted from my telegram +[PoC](https://gist.github.com/bassosimone/f6e680d35805174d1f150bc15ef754af) +where I have clearly highlighted the required refactoring changes: + +```Go +const webDomain = "web.telegram.org" + +// measureWebEndpointHTTPS measures a web.telegram.org endpoint using HTTPS +// +// Arguments: +// +// - ctx is the context that allows us to stop early; +// +// - wg is the wait group that this goroutine should signal when done; +// +// - logger is the logger to use; +// +// - zeroTime is when we started measuring; +// +// - tk contains the test keys; +// +// - address is the TCP endpoint address we should use, and it should consist +// of an IP address and a port, separated by a colon. +// +// This method does not return any value and writes results directly inside +// the test keys, which have thread safe methods for that. +func (mx *Measurer) measureWebEndpointHTTPS(ctx context.Context, wg *sync.WaitGroup, +logger model.Logger, zeroTime time.Time, tk \*TestKeys, address string) { + // 0. setup + const webTimeout = 7 * time.Second + ctx, cancel := context.WithTimeout(ctx, webTimeout) + defer cancel() + defer wg.Done() // synchronize with the controller + weburl := measurexlite.NewURL("https", webDomain, "", "") + endpoint := net.JoinHostPort(address, "443") + ol := measurexlite.NewOperationLogger(logger, "GET %s @ %s", weburl.String(), endpoint) + index := tk.newIndex() + tk.registerSubmeasurement(index, endpoint, "web_https") + + // 1. establish a TCP connection with the endpoint + + // dialer := nextlite.NewDialerwithoutResolver(logger) // --- (removed line) + trace := measurexlite.NewTrace(index, logger, zeroTime) // +++ (added line) + dialer := trace.NewDialerWithoutResolver() // +++ (...) + defer tk.addTCPConnectResults(trace.TCPConnectResults()) // +++ (...) + + conn, err := dialer.DialContext(ctx, "tcp", endpoint) + if err != nil { + switch err.Error() { + case netxlite.FailureHostUnreachable: // happens when IPv6 not available + case netxlite.FailureNetworkUnreachable: // ditto + default: + tk.onWebFailure(err) + } + ol.Stop(err) + return + } + defer conn.Close() + + // 2. perform TLS handshake with the endpoint + + // thx := netxlite.NewTLSHandshakerStdlib(logger) // --- + conn = trace.WrapConn(conn) // +++ + defer tk.addNetworkEvents(trace.NetworkEvents()) // +++ + thx := trace.NewTLSHandshakerStdlib() // +++ + defer tk.addTLSHandshakeResult(trace.TLSHandshakeResults()) // +++ + + config := &tls.Config{ + NextProtos: []string{"h2", "http/1.1"}, + RootCAs: netxlite.NewDefaultCertPool(), + ServerName: webDomain, + } + tlsConn, _, err := thx.Handshake(ctx, cw, config) + if err != nil { + tk.onWebFailure(err) + ol.Stop(err) + return + } + defer tlsConn.Close() + + // 3. fetch the webpage at the endpoint + req, err := measurexlite.NewHTTPRequestWithContext(ctx, "GET", weburl.String(), nil) + runtimex.PanicOnError(err, "measurexlite.NewHTTPRequestWithContext failed unexpectedly") + req.Host = webDomain + + // txp := nextlite.NewHTTPTransportWithTLSConn(tlsConn) // --- + const maxBodySnapshotSize = 1 << 17 // +++ + txp := trace.NewHTTPTransportWithTLSConn(tlsConn, maxBodySnapshotSize) // +++ + defer tk.addHTTPRequestResult(trace.HTTPRequestResults()) // +++ + + resp, err := txp.RoundTrip(req) + if err != nil { + tk.onWebFailure(err) + ol.Stop(err) + return + } + resp.Body.Close() + + // const maxBodySnapshotSize = 1 << 17 // --- + + reader := io.LimitReader(resp.Body, maxBodySnapshotSize) + body, err := netxlite.ReadAllContext(ctx, reader) + if err != nil { + tk.onWebFailure(err) + ol.Stop(err) + return + } + + // 4. we expect to see a successful reply + if resp.StatusCode != 200 { + tk.onWebRequestFailed() + ol.StopString(netxlite.FailureHTTPRequestFailed) + return + } + + // 5. we expect to see the telegram web title + if !webCheckForTitle(body) { + tk.onWebMissingTitle() + ol.StopString(netxlite.FailureTelegramMissingTitleError) + return + } + + // 6. it seems we're all good + ol.Stop(nil) +} +``` + +I am satisfied with the above PoC because it shows how writing an +experiment could be done in netxlite primitives and how saving +observations requires patching the already-written experiment to +*inject a trace* into the equation. Another aspect worth mentioning in +the PoC is that *traces are numbered*. This is not what happens +currently in OONI Probe, but it is *beneficial*. By numbering +traces, we can quite easily tell which existing event belongs to which +specific submeasurement. (If there's no need to number traces, we can +just set a zero index to all the traces we collect, *e passa la paura*.) + +One minor aspect to keep in mind in this design is that we need to +communicate to developers that the *trace* will cause the body snapshot +to be read as part of the round trip. This fact occurs because OONI's +definition of a request-response transaction includes the response body +(or a snapshot) while Go does not include a body in http.Response +but allows for streaming the body on demand. Because reading all the +body with netxlite.ReadAllContext without any limit bound is unsafe (as +it could consume lots of RAM, and we're not always running on systems +with lots of RAM), the example was already limiting the response +body length before we introduced data collection. Yet, with the introduction of +data collection, the explicit netxlite.ReadAllContext is now reading +from memory rather than from the network because the body snapshot has +already been read. So, we need to ensure that developers know that +netxlite.ReadAllContext cannot be used to measure/estimate the +download speed. (In such a case, one would need to either use a +different transport or not collect any snapshot and then read +the whole body directly from the network--so perhaps we +need more than a single transport.) + +The second insight, already exemplified in the above code +snippet, is that step-by-step *could be just a style of writing +experiments* and we can use tracing as the underlying mechanism. Instead +of collecting a single trace using the available mechanism, +step-by-step calls for performing smaller operations and collecting +subtraces. In the above snippet, for example, we collect a single trace +for measuring accessing a single endpoint (which is what we also do with +measurex). This insight is handy: we only need to +implement *a single library* for collecting observations. + +The experiments that need to compute results (e.g., telegram) would +likely need to be rewritten in a pure step-by-step style while +experiments that just collect data could use tracing. I suppose there +will always be some form of *limited* step-by-step, where we will always +split DNS lookup and endpoint measurements as we already do in +dnscheck to ensure we measure \~all IP addresses. + +Compared to measurex, I think step-by-step is \~better because it does +not require anyone to learn more beyond how to use netxlite instead of +the standard library. (BTW, we cannot really get rid of netxlite because +we have measurement requirements that call for wrapping and extending +the standard library or to provide enhancements beyond the stdlib +functionality.) + +Regarding the way to implement tracing, from the above discussion, it is +clear that we should move away from the wrapping approach because it +does not allow us to correctly collect specific events. (To be fair, it +could allow us to do that, but it would entail +significant wrapping efforts.) I would therefore recommend rewriting +tracing to use the context (ugh!) but to wrap this implementation inside +an API that *hides* how we're actually collecting the traces. To +be specific, here's what I mean: + +```Go +// package internal/measurexlite + +type Trace struct { /* ... */ } + +var _ model.Trace = &Trace{} + +func NewTrace(index int64, logger model.Logger, zeroTime time.Time) *Trace { + const ( + tlsHandshakeBuffer = 16, + // ... + ) + return &Trace{ + Index: index, + Logger: logger, + TLS: make(chan *model.ArchivalTLSOrQUICHandshakeResult, tlsHandshakeBuffer), + ZeroTime: zeroTime, + // ... + } +} + +func (tx *Trace) NewTLSHandshakerStdlib() model.TLSHandshaker { + return &tlsHandshakerTrace{ + TLSHandshaker: netxlite.NewTLSHandshakerStdlib(tx.Logger), + Trace: tx, + } +} + +type tlsHandshakerTrace { /* ... */ } + +var _ model.TLSHandshaker = &tlsHandshakerTrace{} + +func (thx *tlsHandshakerTrace) Handshake(ctx context.Context, + conn net.Conn, config \*tls.Config) (net.Conn, tls.ConnectionState, error) { + ctx = netxlite.ContextWithTrace(ctx, thx.Trace) // <- here we setup the context magic + return thx.TLSHandshaker.Handshake(ctx, conn, config) +} + +func (tx *Trace) OnTLSHandshake(started time.Time, remoteAddr string, + config *tls.Config, state tls.ConnectionState, err error, finished time.Time) { + const network = "tcp" + event := NewArchivalTLSOrQUICHandshakeResult(tx.Index, started.Sub(tx.ZeroTime), network, + remoteAddr, config, state, err, finished.Sub(tx.ZeroTime)) + select { + case tx.TLS <- event: + default: // buffer full, stop collecting + } +} + +func (tx *Trace) TLSHandshakeResults() (out []\*model.ArchivalTLSOrQUICHandshakeResult) { + for { + select { + case ev := <-tx.TLS: + out = append(out, ev) + default: + return // drained buffer, we're done here + } + } +} + +func NewArchivalTLSOrQUICHandshakeResult(index int64, started time.Time, + network, remoteAddr string, config *tls.Config, state + tls.ConnectionState, err error, finished time.Time) *model.ArchivalTLSOrQUICHandshakeResult { + // ... +} + +// package internal/netxlite + +func (thx *tlsHandshakerConfigurable) Handshake(ctx context.Context, + conn net.Conn, config \*tls.Config) (net.Conn, tls.ConnectionState, error) { + // ... + remoteAddr := conn.RemoteAddr().String() // +++ (i.e., added line) + trace := ContextTraceOrDefault(ctx) // +++ + started := time.Now() // +++ + err := tlsconn.HandshakeContext(ctx) + finished := time.Now() // +++ + if err != nil { + trace.OnTLSHandshake(started, remoteAddr, config, // +++ + tls.ConnectionState{}, err, finished) // +++ + return nil, tls.ConnectionState{}, err + } + state := tlsconn.connectionState() + trace.OnTLSHandshake(started, remoteAddr, config, // +++ + state, nil, finished) // +++ + return tlsconn, state, nil +} + +func ContextTraceOrDefault(ctx context.Context) model.Trace { /* ... */ } + +func ContextWithTrace(ctx context.Context, trace model.Trace) context.Context { /* ... */ } + +// package internal/model + +type Trace interface { + OnTLSHandshake(started time.Time, remoteAddr string, config *tls.Config, + state tls.ConnectionState, err error, finished time.Time) + + // ... +} +``` + +Unlike the current code, in this design, I am using *buffered channels* +to limit the maximum extent of data collection, which is excellent in terms +of avoiding generating giant traces. Massive traces lead to huge JSONs, but +there may be cases where one is not satisfied with the defaults. +The fields of Trace are thus public, so one could possibly choose to use +different buffers for channels. If we don't like this specific change, we +can stick with the current model where we have possibly unbounded +traces. (This is just an implementation detail that does not matter much +regarding the overall design.) + +### Smooth transition + +We should do incremental refactoring. We should create a few issues +describing these design aspects and summarize what would be the way +forward. I propose to freeze measurex, tracex, netx, and urlgetter (the +libs we're currently using for measurements). The rationale for freezing +is that this set of proposed changes contains some tricky bits, and it could +be \~dangerous to apply these changes to libraries we're currently +using. We should select experiments to refactor and migrate each +to the new model independently, adding the required functionality to the +support library while we do that (I called the library measurexlite +above). By proceeding this way, we should have confidence that we're +not changing the fundamental way we perform measurements and +are not breaking existing experiments. We should also probably start +this change from the *least used* support libraries (which means +measurex, only used by tor and some \*ping experiments) so we can end up +earlier with a simplified tree with less measurement-supporting +libraries. It also seems that dash, hhfm, and hirl can be migrated quite +easily away from netx and urlgetter. + +### Netxlite scope change + +If we move forward with this plan, we will slightly change the scope of +netxlite to include lightweight support for collecting traces. We +initially said that we wanted to cleanly separate networking from +measurements, but it's also true that we need *some support* for +measuring. If we cannot use wrapping efficiently, it makes sense +for netxlite to provide a +mechanism for tracing, while the new measurement library would provide a +policy for saving measurements by implementing model.Trace properly. So, +we should also amend the documentation of netxlite to +explicitly mention support for tracing as a new concern. + +### Cleanups + +If we implement this step-by-step change, we no longer need a "flat" data +format. We use the flat data format for processing the results +of measurements. With step-by-step, we already have enough information +from the Go APIs we're wrapping to make decisions. Therefore, we can +directly produce the OONI data format for archival reasons without the +need to introduce an intermediate format. + +Once netx is gone, we can also clean up the code for creating type +chains. We can simplify the internal implementation inside +netxlite and possibly merge a couple of internal types. (My main +concern is with the error wrapping, which probably should be in the same +place where we are using the context to inject a trace to ensure +that error wrapping and tracing happen together.) + +## Reviews + +### 2022-06-09 - Review: Arturo + +According to Arturo, with whom I discussed this matter recently, it is +preferable to have more complexity inside the experiments than the core +engine. This gives people who read the code more confidence about its +correctness and reduces the amount of magic required to understand OONI. +Also, if we are using basic primitives, it is less likely that we will +have to refactor often. Finally, if we are generating observations +step-by-step, in most cases, we don't need a "flat" data format, and we +can directly produce the OONI data format, thus cutting the "flat" data +format layer entirely (which is quite complex). + +### 2022-06-11 - Review: Mehul + +We discussed the proposal in the "step-by-step refactoring +proposal" section. Hiding the context behind an API seems cleaner and +more robust to refactoring. An API that reads the whole body +and returns it as part of the round trip seems fine, but there's a weak +preference for sticking to the model employed by the stdlib. + +### 2022-06-11 - Status change + +The document is now complete and ready for design review. + +### 2022-06-13 - Review: Arturo + +Proposed additional arguments in favor of step-by-step measurements +and approved this document. diff --git a/internal/engine/netx/doc.go b/internal/engine/netx/doc.go index f4495a7..1e4edbf 100644 --- a/internal/engine/netx/doc.go +++ b/internal/engine/netx/doc.go @@ -43,4 +43,7 @@ // // Additionally, new code that does not need to perform measurements // SHOULD NOT use netx and SHOULD instead use netxlite. +// +// See docs/design/dd-002-nets.md in the probe-cli repository for +// the design document describing this package. package netx diff --git a/internal/netxlite/doc.go b/internal/netxlite/doc.go index fa369d2..ba738fd 100644 --- a/internal/netxlite/doc.go +++ b/internal/netxlite/doc.go @@ -34,6 +34,9 @@ // // We want to have reasonable watchdog timeouts for each operation. // +// See also the design document at docs/design/dd-003-step-by-step.md, +// which provides an overview of netxlite's main concerns. +// // Operations // // This package implements the following operations: diff --git a/pkg/oonimkall/doc.go b/pkg/oonimkall/doc.go index b28551d..3377b22 100644 --- a/pkg/oonimkall/doc.go +++ b/pkg/oonimkall/doc.go @@ -9,6 +9,10 @@ // reimplementation of the mkall API implemented by Measurement Kit // in, e.g., https://github.com/measurement-kit/mkall-ios. // +// Design +// +// See docs/design/dd-001-oonimkall.md in the probe-cli repository. +// // Semantic versioning policy // // This package is public for technical reasons. We cannot use `go