Commit d3dbafa80b "Use blocking IO in P2P to reduce CPU load" (merged
into v2.1.0) introduced the constant IO_TIMEOUT, setting it to 1 second.
On nodes with high-latency connections, this short timeout causes the
txhashset archive download during step 2 of the IBD process to
invariably fail before it completes. Since there's no mechanism for
resuming a failed download, this means the node gets stuck at this stage
and never syncs.
Increasing IO_TIMEOUT to 10 seconds solves the issue on my node; others
might suggest a more optimal value for the constant.
* Verify headers and blocks only when needed
Curretnly we have some lightweigt validation implemented as part of
entity deserialization, which is safer and allows us to not parse the
entire object if some part is invalid. At the same time this logic
always applies when we read an entity, eg when reading from DB.
This PR introduces UntrustedHeader/Block which is used when we read from
the network. It does partial validation during read, then it is supposed
to be converted into regular header/block which doesn't validate itself.
Also this PR adds "lightweight" validation to block header read like we have
for block body, so we don't parse block body if the header is invalid.
Fixes#1642
* Move version validation to untrusted header
* update fuzz tests
* wip
* exhaustive match
* write with fixed v1 strategy when writing for hashing
* local protocol version is 2
* cleanup "size" tests that exercise v1 vs v2 vs default protocol versions
* add proto version to Connected! log msg
* cleanup docs
* negotiate protocol version min(local, peer) when doing hand/shake
To convert option to error we generate an error message. In some places
it contains header or block hash code or other data which is costly to
produce. So during the initial header sync we spend 12% of all time on
generating those messages (in 99% cases we don't use it). This PR
introduces a lazy generation of error messages which completely
eliminates CPU load during the header sync.
* Add check for p2p connection limits
* Simplify undesirable connection shutdown
* Make inbound and outbound connections more explicit
* Cleanup inbound and outbound connections
* Cleanup an outbound peers check
* Rename healthy_peers_mix to enough_outbound_peers
* be a lot less restrictive when picking some candidate peers to connect to
keep the peer address queue drained but actually attempt a healthy number of connections
as most of these attempts are going to fail due to majority of nodes not being publicly accessible
* Ban on cannot get block header in txhashset_write
* Rusfmt
* Fix typo
* Missing error handling
* Rustfmt
* Only accept txhashset from corresponding peer
* Switch to AtomicBool instead of RwLock<bool>
* Rustfmt
* introduce protocol version to deserialize and read
* thread protocol version through our reader
* cleanup
* cleanup
* streaming_reader cleanup
* Pass protocol version into BinWriter to allow for version specific serialization rules.
* rustfmt
* read and write now protocol version specific
When we send a txhashet archive a peer's thread is busy with sending it
and can't send other messages, eg pings. If the network connection is
slow buffer capacity 10 may be not enough, hence the peer's drop.
Safer attempt to address #2929 in 2.0.0
* introduce protocol version to deserialize and read
* thread protocol version through our reader
* example protocol version access in kernel read
* fix our StreamingReader impl (WouldBlock woes)
* debug log progress of txhashset download
* create 2.0.0 branch
* fix humansize version
* update grin.yml version
* PoW HardFork (#2866)
* allow version 2 blocks for next 6 months
* add cuckarood.rs with working tests
* switch cuckaroo to cuckarood at right heights
* reorder to reduce conditions
* remove _ prefix on used args; fix typo
* Make Valid Header Version dependant on ChainType
* Rustfmt
* Add tests, uncomment header v2
* Rustfmt
* Add FLOONET_FIRST_HARD_FORK height and simplify logic
* assume floonet stays closer to avg 60s block time
* move floonet hf forward by half a day
* update version in new block when previous no longer valid
* my next commit:-)
* micro optimization
* Support new Bulletproof rewind scheme (#2848)
* Update keychain with new rewind scheme
* Refactor: proof builder trait
* Update tests, cleanup
* rustfmt
* Move conversion of SwitchCommitmentType
* Add proof build trait to tx builders
* Cache hashes in proof builders
* Proof builder tests
* Add ViewKey struct
* Fix some warnings
* Zeroize proof builder secrets on drop
* Modify mine_block to use wallet V2 API (#2892)
* update mine_block to use V2 wallet API
* rustfmt
* Add version endpoint to node API, rename pool/push (#2897)
* add node version API, tweak pool/push parameter
* rustfmt
* Upate version api call (#2899)
* Update version number for next (potential) release
* zeroize: Upgrade to v0.9 (#2914)
* zeroize: Upgrade to v0.9
* missed Cargo.lock
* [PENDING APPROVAL] put phase outs of C32 and beyond on hold (#2714)
* put phase outs of C32 and beyond on hold
* update tests for phaseouts on hold
* Don't wait for p2p-server thread (#2917)
Currently p2p.stop() stops and wait for all peers to exit, that's
basically all we need. However we also run a TCP listener in this thread
which is blocked on `accept` most of the time. We do an attempt to stop
it but it would work only if we get an incoming connection during the
shutdown, which is a week guarantee.
This fix remove joining to p2p-server thread, it stops all peers and
makes an attempt to stop the listener.
Fixes [#2906]
* rustfmt
* generate txhashset archives on 250 block intervals.
* moved txhashset_archive_interval to global and added a simple test.
* cleaning up the tests and adding license.
* increasing cleanup duration to 24 hours to prevent premature deletion of the current txhashset archive
* bug fixes and changing request_state to request height using archive_interval.
* removing stopstate from chain_test_helper to fix compile issue
I made an suboptimal (aka stupid) decision to stop and wait for peers
one by one which makes shutdown very slow - O(n). This PR decouples sending
stop signal from waiting a thread to exit. On top of it in Peers we
first send stop signal to all peers and only after that start waiting
for them to exit. It gives us a constant time of shutdown in most of the
cases.