Skip to content

Commit

Permalink
blog: Add IPFS experiment post
Browse files Browse the repository at this point in the history
  • Loading branch information
kallisti5 committed May 8, 2021
1 parent f7d4d8d commit 9dd91bd
Showing 1 changed file with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions content/blog/kallisti5/2021-05-03_ipfs_experiment.md
@@ -0,0 +1,96 @@
+++
type = "blog"
title = "Haiku's CDN, an IPFS Experiment"
author = "kallisti5"
date = "2021-05-03 15:50:05-05:00"
tags = ["haiku", "software"]
+++

Hello! I'm Alex, a member of our systems administration team and on the Haiku, Inc. board of directors. I've been playing with moving our repositories over to IPFS, and wanted to collect some user feedback.

This comment has been minimized.

Copy link
@nielx

nielx May 8, 2021

Member

Suggestion: maybe make explicit what kind of feedback you want? It makes sense to make it more actionable: maybe invite readers/users with bandwidth and an interest to play around with this technology to try switching/pinning themselves and then ask them to report the outcome?

That's the gist of what I read further down, so it makes sense to mention that at the top to get people into action mode.


## First a little history

With the addition of package management in 2013, Haiku's amount of data to manage has been growing.

In ~2018 I moved our Haiku package repositories (and nightly images, and release images) to S3 object storage. This helped to reduce the large amount of data we were lugging around on our core infrastructure, and offloaded it onto an externally mananged service which we could progmatically manage. All of our CI/CD could securely and pragmatically build artifacts into these S3 buckets. We found a great vendor which let us host a lot of data with unlimited egress bandwidth for an amazing price. This worked great through 2021, however the vendor recently began walking back their "unlimited egress bandwidth" position. Last week they shutdown our buckets resulting in a repo + nightly outage of ~24 hours while we negotiated with their support team.

This comment has been minimized.

Copy link
@nielx

nielx May 8, 2021

Member

Suggestion: explain 'egress bandwith'.


## The problem

In these S3 buckets, we host around 1 TiB of data, and have between 2-3 TiB of egress data monthly. We also have around 4 TiB of egress data on our servers.

***This is almost 8 TiB of bandwidth a month***

We have large number of wonderful organizations and individuals offering to mirror our package repositories and artifacts via rsync (the defacto way to mirror large amounts of data in the open source world)... however we have one major issue which historically has prevented us from taking people up on these offers for anything except release images. Haiku's package management kit doesn't have any kind of built-in signature checking of packages. While our CI/CD system **does** sign Haiku nightly images, releases, and repositories with minisig (and haikuports buildmaster could be extended to do the same), our package management tools today perform zero checking of package or repository security.

This means a malicious actor could add tainted packages to a mirror, regenerate the repository file (which contains checksums of each package), and redistribute "bad things" to every Haiku user using the mirror.

Is this likely to happen? No. Is it possible? Yes.

This comment has been minimized.

Copy link
@nielx

nielx May 8, 2021

Member

Suggestion: I do not find the argument of security particularly convincing for trying out ipfs. A better argument would be that (1) finding good mirrors is not easy and (2) the ipfs enhances mirroring by optimizing downloads over multiple mirrors, whereas the load-balancer approach makes the decision beforehand and has no way to redirect you if another mirror with better performance is available.

This comment has been minimized.

Copy link
@kallisti5

kallisti5 Jun 8, 2021

Author Contributor

The upcoming IPFS 0.9 release actually has some features to verify public gateways are presenting "truthful" data

ipfs/kubo#8058

That would give us tools to validate public gateways are presenting "real" copies of our repositories.

This comment has been minimized.

Copy link
@kallisti5

kallisti5 Jun 8, 2021

Author Contributor

I could cut all the security stuff out.. but people will ask "why not just let XYZ" rsync mirror your repository?


## The solution

In steps IPFS (InterPlanetary File System). In mid-2020, I (quietly) setup http://ipfs.haiku-os.org as an IPFS mirror of our release images.

You can access this on any public ipfs gateway...

* https://ipfs.io/ipns/ipfs.haiku-os.org
* https://cloudflare-ipfs.com/ipns/ipfs.haiku-os.org

The official description states *"A peer-to-peer hypermedia protocol designed to make the web faster, safer, and more open."*. In a bit more technical words, essentially IPFS is a network of peer-to-peer systems exchanging chunks of data based on a hash of the data within the chunk. (Think BitTorrent, where every seed is also an http gateway, and you're 1/4 way there). A great overview is available on their [website](https://ipfs.io/#how) (which is also hosted on IPFS).

This comment has been minimized.

Copy link
@nielx

nielx May 8, 2021

Member

Remark: I know it is outside your scope of control, but I find the ipfs website not particularly clear on explaining it. I think you have made it much more clear by simply mentioning bittorrent.

Maybe adding a link to the IPFS Course is useful? https://proto.school/course/ipfs


**Essentially:**

* We add repositories and artifacts to our "central" IPFS node (A Raspberry Pi 4 on my home network today)
* We update /ipns/hpkg.haiku-os.org (using our private key) to point to the latest "hash" of our repositories
* If you want to help host our repositories and artifacts, you *pin* /ipns/hpkg.haiku-os.org nightly, weekly, etc
* People mirroring our repositories don't need a "static ip", they only need to be able to expose port 4001 to the internet from their instance of IPFS
* Users can access our repositories, artifacts, etc on **any** public IPFS gateway node.
* Gateway nodes "pull in all of the chunks" from all of the users pinning /ipns/hpkg.haiku-os.org when requested, and serve them to users
* Haiku hosts a few dedicated gateway nodes globally. These act as regional gateways to the "closest people hosting artifacts"

**Out of this we get:**

* Anyone can mirror Haiku (not just those with large amounts of free bandwidth or static ip addresses)
* We no longer have to worry about country IP blocks
* Russia blocks our Digital Ocean VM's for example.
* Throw up some Russian gateways and have a few folks pin the data to mirror it.
* We get transparent deduplication
* The repo today is ~140 GiB of data, and is ~95 GiB to mirror on disk.
* We get transparent repo signatures
* As long as you trust the gateway node, the data is secure
* Users can just "access data", or mirror everything locally for a "hyper fast" software repository.

Cloudflare has been an early adopter and offers a free public IPFS gateway (with SSL): [cloudflare-ipfs.com](https://cloudflare-ipfs.com)

## The downsides

IPFS *is* a new technology, and there are a lot of pointy bits.

* Needing everyone to manually "repin" constantly data to mirror the latest repository chunks
* If few people are pinning the latest data, initial lookups can be a bit slow (3-4 minutes)
* IPFS has a steep learning curve for anyone mirroring. It takes time to find out how to do what
* IPFS (the application) does have bugs. I've ran into several.

## Summary

I have no idea if this will work.
The idea is great since it fixes "pretty much all" of our content distribution issues.

* We empower more tech-savy individuals to leverage IPFS locally, while still offering "turn key" access to our software
* We decouple the "large amount of storage" and "large amount of bandwidth" making finding reasonable hosting solutions easier
* We enable getting Haiku's software into restrictive geographic regions
* It has built in signature checking ensuring some level of security
* It has deduplication built-in, saving space

Time will tell if the implementation is viable and reliable enough. In the short term, our current more traditional repositories
are not going away as long as we can continue to host data from our S3 buckets. I'm hopeful we can get enough
people playing with the new system to reduce S3 bandwidth and give us some time to investigate this alternative path.

A few people have mentioned adding native IPFS support to pkgman.. this would enable Haiku to obtain updates
directly from a peer-to-peer network. That seems like an awesome potental future.

## What *you* can do

* Learn about IFPS
* [Try to pin haiku's repositories at locations that don't have bandwidth caps](https://github.com/haiku/infrastructure/blob/master/docs/ipfs-pinning.md)
* Try out our IPFS gateway for haiku updates ```pkgman add-repo https://de.hpkg.haiku-os.org/haiku/master/$(getarch)/current```
* Provide feedback below

1 comment on commit 9dd91bd

@nielx
Copy link
Member

@nielx nielx commented on 9dd91bd May 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions, it might make sense to make it a more active call to action. Also: is there a way to inspect the network and see how many nodes have pinned and/or are serving the content?

Please sign in to comment.