doc(step-by-step): readability improvements (#820)

This diff contains readability improvements for the step-by-step design document. Co-authored-by: Simone Basso <bassosimone@gmail.com>
2022-06-30 09:55:18 +02:00 · 2022-06-30 09:55:18 +02:00 · 74aebedac3
commit 74aebedac3
parent 797dd27ffc
4 changed files with 144 additions and 122 deletions
--- a/docs/design/dd-003-step-by-step.md
+++ b/docs/design/dd-003-step-by-step.md
@ -1,16 +1,19 @@
 # Step-by-step measurements

 |              |                                                |
-|:-------------|:-------------|
+|--------------|------------------------------------------------|
 | Author       | [@bassosimone](https://github.com/bassosimone) |
-| Last-Updated | 2022-06-13   |
+| Last-Updated | 2022-06-29                                     |
 | Reviewed-by  | [@hellais](https://github.com/hellais)         |
 | Reviewed-by  | [@DecFox](https://github.com/DecFox/)          |
+| Reviewed-by  | [@ainghazal](https://github.com/ainghazal/)    |
 | Status       | approved                                       |
 | Obsoletes    | [dd-002-netx.md](dd-002-netx.md)               |

-*Abstract.* The original [netx design document](dd-002-netx.md) is now two
-years old. Since we wrote such a document, we amended the overall design
+## Abstract
+
+The original [netx design document](dd-002-netx.md) is now two
+years old. Since we wrote that document, we amended the overall design
 several times. The four major design changes were:

 1. saving rather than emitting
@ -24,50 +27,56 @@ pattern [ooni/probe-engine#522](https://github.com/ooni/probe-engine/pull/522);
 4. measurex [ooni/probe-cli#528](https://github.com/ooni/probe-cli/pull/528).

 In this (long) design document, we will revisit the original problem proposed by
-[df-002-netx.md](df-002-netx.md), in light of what we changed and learned from the
-changes we applied. We will highlight the significant pain points of the current
-implementation, which are the following:
+[dd-002-netx.md](dd-002-netx.md), in light of all the changes and lessons
+learned since then. We will highlight the significant pain points of the
+current implementation, which are the following:

-1. that the measurement library API is significantly different from the Go stdlib
-API, therefore violating this original `netx` design goal: that writing a new
-experiment means using slightly different constructors that deviate from the standard
-library only to meet specific measurement goals we have;
+1. The measurement library API is significantly different from the Go stdlib
+API. This violates one of the central design goals for `netx`: that writing a new
+experiment would involve using constructors very similar to the standard
+library. Such deviations were supposed to be made only to meet our specific
+measurement goals;

-2. that the decorator pattern leads to complexity in creating measurement types,
+2. The decorator pattern has lead to complexity in creating measurement types,
 which in turn seems to be the reason why the previous issue exists;

-3. that the decorator pattern does not allow us to precisely collect all the
-data that matters for events such as TCP connect and DNS round trips using
-a custom transport, thus suggesting that we should revisit our choice of using
-decorators and revert back to some form of _constructor based injection_ to
+3. The decorator pattern does not allow us to precisely collect all the
+data that matters for certain events (such as TCP connect and DNS round trips using
+a custom transport). This suggests that we should revisit our choice of using
+decorators, and revert back to some form of _constructor based injection_ to
 inject a data type suitable for saving events.

-In doing that, we will also propose an incremental plan for moving the tree
+Finally, this document also proposes an incremental plan for moving the tree
 forward from [the current state](https://github.com/ooni/probe-cli/tree/1685ef75b5a6a0025a1fd671625b27ee989ef111)
-to a state where complexity is transferred from the measurement-support library to
-the implementation of each individual network experiment.
+to a state in which the complexity has been transferred from the
+measurement-support library to the implementation of each individual network
+experiment.

 ## Index

-In [netxlite: the underlying library](#netxlite-the-underlying-network-library) we
-describe the current design of the underlying network library.
+There are four main sections in this document:

-In [measurement tactics](#measurement-tactics) we provide an historical perspective
-on the measurement tactics, we adopted or just tried.
+[1. Netxlite: the underlying library](#1-netxlite-the-underlying-network-library)
+describes the current design of the underlying network library.

-The [step-by-step refactoring proposal](#step-by-step-refactoring-proposal)
-contains the main contribution of this design document: a proposal to refactor
-the existing codebase to address our current measurement-code problems.
+[2. Measurement tactics](#2-measurement-tactics) gives an historical perspective
+on different measurement tactics we adopted or tried in the past, and reflects
+on their merits and downsides.

-The [reviews](#reviews) section contains information about reviews.
+[3. Step-by-step refactoring proposal](#3-step-by-step-refactoring-proposal)
+contains the main contribution of this design document: a concrete proposal to
+refactor the existing codebase to address our current measurement-code
+problems.

-## netxlite: the underlying network library
+[4. Document reviews](#4-document-reviews) contains information about reviews of this document.

-This section describes `netxlite`, the underlying network library, from a
+## 1. netxlite: the underlying network library
+
+This section describes `netxlite`, the underlying network library, from an
 historical perspective. We explain our data collection needs, and what types
 from the standard library we're using as patterns.

-### Measurement Observations
+### 1.1. Measurement Observations

 Most OONI experiments need to observe and give meaning to these events:

@ -121,7 +130,7 @@ defines how we archive experiment results as a set of observations.
 (Orthogonally, we may also want to improve the data format, but this is
 not under discussion now.)

-### Error Wrapping
+### 1.2. Error Wrapping

 The OONI data format also specifies [how we should represent
 errors](https://github.com/ooni/spec/blob/master/data-formats/df-007-errors.md).
@ -140,7 +149,7 @@ DNS response messages or the syscall error returned by a Read call. By
 adding this, we would give those who analyze the data information to
 evaluate the correctness of a measurement.

-### Go Stdlib
+### 1.3 Go Stdlib

 The Go standard library provides the following structs and interfaces
 that we can use for measuring:
@ -235,7 +244,7 @@ Apart from the stdlib and quic-go, the only other significant network code
 dependency is [miekg/dns](https://github.com/miekg/dns)
 for custom DNS resolvers (e.g., DNS-over-HTTPS).

-### Network Extensions
+### 1.4. Network Extensions

 A reasonable idea is to try to use types as close as possible to the
 standard library. By following this strategy, we can compose our code
@ -281,12 +290,20 @@ netxlite as a dependency.
 itself with measurements unlike [the original netx](df-002-netx.md), which contained
 both basic networking wrappers and network measurement code.)

-## Measurement Tactics
+## 2. Measurement Tactics

 Each subsection presents a different tactic for collecting measurement
-observations.
+observations, while reflecting on their pros and cons.

-### Context-Based Tracing
+We revisit four distinct tactics:
+
+* [(1) Context-based tracing](#21-context-based-tracing), 
+* [(2) Decorator-based tracing](#22-decorator-based-tracing),
+* [(3) Step-by-step measurements](#23-step-by-step-measurements), and
+* [(4) Measurex: splitting DNSLookup and Endpoint Measurements](#24-measurex-splitting-dnslookup-and-endpoint-measurements).
+
+
+### 2.1. Context-Based Tracing

 This tactic is the first one we implemented. We call this approach
 "tracing" because it produces a trace of the events, and it's
@ -304,7 +321,7 @@ the stdlib allows one to use the context to perform network tracing
 we progressively abandoned `httptrace` as our tracing needs
 become more complex than what `httptrace` could provide us with.

-#### How this tactic feels like
+#### How context-tracing feels like

 I tried to adapt how this code would look if we used it now. As
 [dd-002-netx.md](dd-002-netx.md) suggests, here I am trying to separate
@ -346,7 +363,7 @@ As you can see, I have marked with fire emojis where we
 need to figure out what happened by reading the trace. We are going
 to discuss this issue in the next section.

-#### Issue #1: distance between collection and interpretation
+#### Issue #1 with context tracing: distance between collection and interpretation

 The nice part of this approach is that the network-interaction part of
 the experiment is \~easy. The bad part is that we must figure out
@ -422,7 +439,7 @@ respect and make result-determining code more obvious and closer to the code
 that performs the measurement. We will eventually come to fix this issue later
 in this document. For now, let us continue to analyze this tactic.

-#### Issue #2: the Context is magic and implicit
+#### Issue #2 with context tracing: the Context is magic and implicit

 Another pain point is that we're using the Context's magic. What happens
 there feels more obscure than explicit initialization for performing
@ -444,7 +461,7 @@ dialer := saver.WrapQUICDialer(dialer)
 conn, err := dialer.DialContext(ctx, /* ... */)
 ```

-In the latter case, it's evident that we're *decorating* the original
+In the later case, it's evident that we're *decorating* the original
 dialer with an extended dialer that knows how to perform network
 measurements. In the former case, it feels magic that we're setting some
 value as an opaque any inside of the context, and there's a documented
@ -460,7 +477,7 @@ explicitly wrapping a type. I will discuss this topic when we
 analyze the next tactic because the next tactic is all about reducing
 the cognitive burden and avoiding the context.

-#### Issue #3: we obtain a flat trace
+#### Issue #3 with context tracing: we obtain a flat trace

 The most straightforward implementation of this approach yields a flat trace. This
 means that one needs to be careful to understand, say, which events are
@ -474,7 +491,7 @@ assign numbers to distinct HTTP round trips and DNS lookups so that
 later it is possible to make sense of the trace. This was indeed the
 approach we chose initially. However, this approach puts more pressure
 on the context magic because it does not just suffice to wrap the
-context once with WithValue, but you need to additionally wrap it when
+context once with `WithValue`, but you need to additionally wrap it when
 you descend into sub-operations. (Other approaches are possible, but I
 will discuss this one because it looks conceptually cleaner to
 create a new "node" in the trace like it was a DOM.)
@ -529,7 +546,7 @@ censorship. See our [DoT in Iran
 research](https://ooni.org/post/2020-iran-dot/) for more
 information.)

-### Decoration-Based Tracing
+### 2.2. Decorator-Based Tracing

 In [probe-engine#359](https://github.com/ooni/probe-engine/issues/359), we
 started planning refactoring of `netx` to solve the identified issues
@ -903,7 +920,7 @@ the `CNAME` and getaddrinfo's return code very easily):
 }
 ```

-#### Concluding remarks
+#### Concluding remarks on decorator-based tracing

 Historically, the decorator-based approach helped simplify
 the codebase (probably because what we had previously was
@ -922,29 +939,29 @@ Whatever choice we make, it should probably include some form of
 dependency injection for a trace that allows us to collect the events we
 care about more precisely and with less effort.

-### Step-by-Step Measurements
+### 2.3. Step-by-step measurements

-We had previous conversations around simplifying how we perform
-measurements (e.g., I remember a conversation with
-[Vinicius](https://github.com/fortuna), where he advocated
-for decomposing measurements in simple operations, and he
-rightfully pointed out that tracing is excellent for debugging
-but complicating assigning meaning to measurements).
+We've had many conversations about how to simplify the way we do measurements.
+For instance, [Vinicius](https://github.com/fortuna) at some point advocated 
+for decomposing measurements in simple operations. He rightfully pointed out
+that tracing is excellent for debugging, but it complicates to assign
+meaning to each measurement.

-We also mentioned that, in the codebase, we documented that `netx` was discouraged as an
-approach for new experiments. The first chance to try a different tactic
-was the development of the `websteps` prototype. We tried to implement
-step-by-step measurements, which, in its most radical form, calls for
-performing each relevant step in isolation, immediately saving a
-small trace and interpreting it before moving on to the next step
-(unless there's an error, in which case you typically stop).
+We had documented in the codebase that `netx` was discouraged as an approach
+for new experiments. We got the first chance to try a
+different tactic while developing the `websteps` prototype.
+
+In `websteps`, we tried to implement step-by-step measurements: in its most
+radical form, this calls for performing each relevant step in isolation,
+immediately saving a small trace and interpreting it before moving on to the
+next step (unless there's an error, in which case you typically stop).

 Looking back at the Go stdlib API, the main blocker to implementing
 this tactic is how to reconcile it with HTTP transports, which expects to
 dial and control their own connections. Luckily,
 [Kathrin](https://github.com/kelmenhorst)
 [implemented](https://github.com/ooni/probe-cli/pull/432) the
-following trick that solved such an issue:
+following trick that allows us to solve this issue:

 ```Go
 // NewSingleUseDialer returns a "single use" dialer. The first
@ -978,8 +995,8 @@ func (s *dialerSingleUse) DialContext(ctx context.Context, network string, addr
 With a "single-use" dialer, we  provide an HTTPTransport with a fake
 dialer that provides a previously-established
 connection to the transport itself. The following snippet shows
-code from my first naive attempt at writing code using this tactic,
-where originally identified pain points have been emphasized:
+code from my first naive attempt at writing code using this approach. The pain
+points we had originally identified have been emphasized:

 ```Go
 dialer := netxlite.NewDialerWithoutResolver()
@ -1015,31 +1032,31 @@ if err != nil {
 ```

 Let's discuss the good parts before moving on to the bad parts. It is
-dead obvious which operation failed and why, and we know what went
-wrong and can analyze the observations immediately.
+dead obvious which operation failed and why, and **we know what went
+wrong and can analyze the observations immediately**.

 Additionally, if you ask someone who knows the standard library to write
 an experiment that provides information about TCP connect, TLS handshake, and
 HTTP round trip using `netxlite`, they would probably write something
 similar to the above code.

-#### Issue #1: no persistent connections
+#### Issue #1 with step-by-step approach: no persistent connections

-Without adding extra complexity, we lose the possibility of using
-persistent connections. This may not be a huge problem except for
+Without adding extra complexity, we lose the possibility to use 
+persistent connections. This may not be a huge problem, except for
 redirects. Each redirect will require us to set up a new connection,
-even though an ordinary http.Transport would probably have reused an
+even though an ordinary `http.Transport` would probably have reused an
 existing one.

 Because we're measuring censorship, I would argue it's OK to not reuse
-connections. Sure, the measurement could be slower, but we'd get
+connections. Sure, the measurement can be slower, but we'll also get
 more data points on TCP/IP and TLS blocking.

-#### Issue #2: requires manual handling of redirects
+#### Issue #2 with step-by-step approach: requires manual handling of redirects

 Because we cannot reuse connections easily, we cannot rely on
-&http.Client{} to perform redirections automatically for us. This is why
-websteps implements HTTP redirection manually.
+`&http.Client{}` to perform redirections automatically for us. This is why
+`websteps` implements HTTP redirection manually.

 While it may seem that discussing redirects is out of scope,
 historically I had been reluctant to switch to a step-by-step model
@ -1075,7 +1092,7 @@ Because this problem of redirection is fundamental to many experiments
 (not only webconnectivity but also, e.g., whatsapp), any step-by-step
 approach library needs this functionality.

-#### Issue #3: DRY pressure
+#### Issue #3 with step-by-step approach: DRY pressure

 Before, I said that saving traces seems complicated. That is not entirely
 true. Depending on the extent to which we are willing to suffer the pain of
@ -1195,7 +1212,7 @@ some network operation. If the API is doing too much, I
 might not have the ability to hook me into it and run the follow-up
 experiment right after the operation I needed to do.

-#### Concluding remarks
+#### Concluding remarks on step-by-step measurements

 If we have a way to collect observations,
 this approach certainly has the advantage of having some
@ -1215,13 +1232,16 @@ probably want to split DNS and other operations to get a chance to
 test all (or many) of the available IP addresses and use tracing within DNS and
 other operations.

-### DNSLookup and Endpoint Measurements
+### 2.4. measurex: splitting DNSLookup and Endpoint Measurements

-This approach is currently implemented as measurex. A DNSLookup
-measurement should be an obvious concept, while it's probably less
-clear what's an endpoint measurement. So, let's clarify.
+This fourth and last approach of the ones we'll discuss is currently
+implemented in `measurex`.

-An endpoint measurement is one of the following operations:
+A **DNSLookup measurement** is perhaps an obvious concept, but an **endpoint
+measurement** is probably not obvious. So, let's first clarify the
+terminology:
+
+We define an **endpoint measurement** as one of the following operations:

 1. given a TCP endpoint (IP address and port), TCP connect to it;

@ -1249,12 +1269,12 @@ addresses, so that pattern was already in OONI somehow (at least for the
 most important experiment measuring web censorship.)

 Another interesting observation about the above set of operations is
-that each of them could fail exactly once. The DNSLookup could fail or
+that each of them could fail **exactly once**. The DNSLookup could fail or
 yield addresses, and TCP connect could fail or succeed. In the TCP connect
 plus TLS handshake case, you stop there if you fail the TCP connect. And
 so on.

-Because of this reasoning, one could say that the measurex tactic
+Because of this reasoning, one could say that the `measurex` tactic
 is equivalent to the previous one in relatively easily identifying
 the failed operation and filling the measurement. That seems to be an argument
 for having a library containing code to simplify measurements.
@ -1263,15 +1283,17 @@ an entirely new API*. After careful consideration, it seems preferable to
 select an API that is closer to what a typical Go programmer would
 expect.

-## Step-by-step refactoring proposal
+## 3. Step-by-step refactoring proposal
+
+Finally, all the discussion is in place to get to a concrete proposal.

 I tried to reimplement the telegram experiment using a pure step-by-step approach
 ([here's the
 gist](https://gist.github.com/bassosimone/f6e680d35805174d1f150bc15ef754af)).
-It looks fine, but one ends up writing a support library such as measurex. Yet,
+It looks fine, but one ends up writing a support library such as `measurex`. Yet,
 as noted above, the API exposed by such a measurement
 library matters, and an API familiar to Go developers seems preferable
-to the API implemented by measurex.
+to the API implemented by `measurex`.

 There are two key insights I derived from my telegram PoC.

@ -1415,7 +1437,7 @@ the PoC is that *traces are numbered*. This is not what happens
 currently in OONI Probe, but it is *beneficial*. By numbering
 traces, we can quite easily tell which existing event belongs to which
 specific submeasurement. (If there's no need to number traces, we can
-just set a zero index to all the traces we collect, *e passa la paura*.)
+just set a zero index to all the traces we collect, *[e passa la paura](https://context.reverso.net/traduzione/italiano-inglese/passa+la+paura)*.)

 One minor aspect to keep in mind in this design is that we need to
 communicate to developers that the *trace* will cause the body snapshot
@ -1423,14 +1445,14 @@ to be read as part of the round trip. This fact occurs because OONI's
 definition of a request-response transaction includes the response body
 (or a snapshot) while Go does not include a body in http.Response
 but allows for streaming the body on demand. Because reading all the
-body with netxlite.ReadAllContext without any limit bound is unsafe (as
+body with `netxlite.ReadAllContext` without any limit bound is unsafe (as
 it could consume lots of RAM, and we're not always running on systems
 with lots of RAM), the example was already limiting the response
 body length before we introduced data collection. Yet, with the introduction of
-data collection, the explicit netxlite.ReadAllContext is now reading
+data collection, the explicit `netxlite.ReadAllContext` is now reading
 from memory rather than from the network because the body snapshot has
 already been read. So, we need to ensure that developers know that
-netxlite.ReadAllContext cannot be used to measure/estimate the
+`netxlite.ReadAllContext` cannot be used to measure/estimate the
 download speed. (In such a case, one would need to either use a
 different transport or not collect any snapshot and then read
 the whole body directly from the network--so perhaps we
@ -1453,16 +1475,16 @@ will always be some form of *limited* step-by-step, where we will always
 split DNS lookup and endpoint measurements as we already do in
 dnscheck to ensure we measure \~all IP addresses.

-Compared to measurex, I think step-by-step is \~better because it does
-not require anyone to learn more beyond how to use netxlite instead of
-the standard library. (BTW, we cannot really get rid of netxlite because
+Compared to `measurex`, I think step-by-step is \~better because it does
+not require anyone to learn more beyond how to use `netxlite` instead of
+the standard library. (BTW, we cannot really get rid of `netxlite` because
 we have measurement requirements that call for wrapping and extending
 the standard library or to provide enhancements beyond the stdlib
 functionality.)

 Regarding the way to implement tracing, from the above discussion, it is
-clear that we should move away from the wrapping approach because it
-does not allow us to correctly collect specific events. (To be fair, it
+clear that **we should move away from the wrapping approach because it
+does not allow us to correctly collect specific events**. (To be fair, it
 could allow us to do that, but it would entail
 significant wrapping efforts.) I would therefore recommend rewriting
 tracing to use the context (ugh!) but to wrap this implementation inside
@ -1580,7 +1602,7 @@ can stick with the current model where we have possibly unbounded
 traces. (This is just an implementation detail that does not matter much
 regarding the overall design.)

-### Smooth transition
+### 3.1. Smooth transition

 We should do incremental refactoring. We should create a few issues
 describing these design aspects and summarize what would be the way
@ -1600,7 +1622,7 @@ earlier with a simplified tree with less measurement-supporting
 libraries. It also seems that dash, hhfm, and hirl can be migrated quite
 easily away from netx and urlgetter.

-### Netxlite scope change
+### 3.2. Netxlite scope change

 If we move forward with this plan, we will slightly change the scope of
 netxlite to include lightweight support for collecting traces. We
@ -1613,7 +1635,7 @@ policy for saving measurements by implementing model.Trace properly. So,
 we should also amend the documentation of netxlite to
 explicitly mention support for tracing as a new concern.

-### Cleanups
+### 3.3. Cleanups

 If we implement this step-by-step change, we no longer need a "flat" data
 format. We use the flat data format for processing the results
@ -1629,7 +1651,7 @@ concern is with the error wrapping, which probably should be in the same
 place where we are using the context to inject a trace to ensure
 that error wrapping and tracing happen together.)

-## Reviews
+## 4. Document Reviews

 ### 2022-06-09 - Review: Arturo

--- a/internal/engine/experiment/README.md
+++ b/internal/engine/experiment/README.md
@ -1,4 +1,4 @@
-# Directory github.com/ooni/probe-engine/experiment
+# Directory github.com/ooni/probe-cli/internal/engine/experiment

 This directory contains the implementation of all the supported
 experiments, one for each directory. The [OONI spec repository
--- a/internal/model/README.md
+++ b/internal/model/README.md
@ -1,4 +1,4 @@
-# Package github.com/ooni/probe-engine/model
+# Package github.com/ooni/probe-cli/internal/model

 Shared data structures and interfaces. We include in this
 package the most fundamental types. Use `go doc` to get
--- a/internal/model/measurement.go
+++ b/internal/model/measurement.go
@ -57,7 +57,7 @@ type Measurement struct {

 	// MeasurementStartTimeSaved is the moment in time when we
 	// started the measurement. This is not included into the JSON
-	// and is only used within probe-engine as a "zero" time.
+	// and is only used within the ./internal pkg as a "zero" time.
 	MeasurementStartTimeSaved time.Time `json:"-"`

 	// Options contains command line options