linux 内核内置了很多文件系统，可以通过命令 ls /lib/modules/`uname -r`/kernel/fs/ 查看，比如有 ext3、ext4、xfs 等文件系统。但是，linux 提供了 fuse 用户态文件系统框架，供开发用户态文件系统。

我们开发的 polefs-client 就是一个自定义的用户态文件系统，使用 jacobsa/fuse 库，当 pod1 消费一个 pv1 文件存储时，kubelet 会调用 polefs-csi 接口，该接口会为这个 pv1 启动一个 polefs-client 进程，挂在指定目录。

1. linux fuse 基本知识

fuse (filesystem in userspace) 用户态文件系统框架主要包含三部分：

fuse 内核模块：接收 vfs 的请求，把这个 I/O 封装后通过管道 /dev/fuse 发送到用户态程序，一般 linux 默认安装。
fuse 用户态模块：一般有开源的库，比如 go 语言的开源库 jacobsa/fuse ，基于该库开发自己的用户态文件系统。
mount 命令工具：fusermount 命令，比如需要卸时一般不直接 kill polefs-fuse 进程，而是 fusermount -u /mnt/volume1 来关闭挂进程。

linux vfs 主要包含两个基础概念：inode 和 dentry，相关资料比较多。

inode: 每一个 file/directory 只有一个对应的 inode，保存的是 file/directory 实际数据的一些元数据，比如，文件大小/设备标识符/文件模式/扩展属性/连接数量/指向存储数据内容磁盘区块的指针，等等。所以，知道了 inode，就能找到该文件的数据内容。
dentry: directory entry 目录项缓存，主要建立文件名到 inode 的映射关系。如果没有 dentry，打开 /mnt/volume1/1.txt 文件，则需要如下复杂步骤：
需要去 / 所在的 inode 找到 / 的数据块，从中找到 mnt 条目的 inode。
跳转到 mnt 对应的 inode，根据 /mnt inode 找到数据块，并找到 volume1 条目的 inode。
跳转到 /mnt/volume1 对应的 inode，根据 /mnt/volume1 inode 找到数据块，找到 1.txt 的 inode。
根据 1.txt 的 inode，读取其数据内容。
如果有了 dentry，则直接找到 1.txt 的 inode，进而直接读取数据内容。

2. fuse 协议数据格式

内核态 fuse 模块会把从 vfs 接收到的请求打包成 fuse 格式请求数据，然后通过 /dev/fuse 管道发给 polefs-client，然后接收 polefs-client 返回的 fuse 格式响应数据。

fuse 格式请求数据主要分为两部分：Header 和 Payload。

Header: 所有请求如 open/read/write 都会包含 header 结构体 InHeader 。
Len: 整个请求的字节数长度，包含 header+payload。
Opcode: 来自内核请求的类型，比如常用的 OpReaddir/OpCreate 等操作请求，L45-L727
Unique: 请求唯一标识，和响应中 OutHeader Unique 要对应。
NodeId: 当前请求目标文件的 inode，重点字段。
Uid: 当前请求内，操作目标文件的 user id。
Gid: 当前请求内，操作目标文件的 user group id。
Pid: 当前请求内，操作目标文件的 process id。
Padding: 暂时无用。

type InHeader struct {
	Len     uint32
	Opcode  uint32
	Unique  uint64
	Nodeid  uint64
	Uid     uint32
	Gid     uint32
	Pid     uint32
	Padding uint32
}

对于来自内核 fuse 模块的请求数据报文，使用 InMessage 结构体来定义 Header+Payload：

type InMessage struct {
	remaining []byte
	storage   []byte
	size      int
}

func (m *InMessage) Header() *fusekernel.InHeader {
	return (*fusekernel.InHeader)(unsafe.Pointer(&m.storage[0]))
}

polefs-client 经过业务逻辑处理后，返回给 fuse 内核模块的响应格式也分为两部分：Header 和 Payload。

Header 为返回给 fuse 内核模块的响应的 header，使用 OutHeader 结构体表示
Len: 整个响应的字节数长度，包含 header+payload。
Error: 响应错误码，成功返回 0，其他对应着系统 Errno L759-L768
Unique: 请求唯一标识，和请求中 InHeader Unique 要对应。

type OutHeader struct {
	Len    uint32
	Error  int32
	Unique uint64
}

对于返回给 fuse 内核模块的响应数据报文，使用 OutMessage 结构体来定义

type OutMessage struct {
	header fusekernel.OutHeader
	Sglist [][]byte
}

3. fuse 文件系统

知道了 fuse 请求和响应数据格式，还需要启动 filesystem server 监听在 /dev/fuse 管道，读取内核 fuse 发来的请求。

jacobsa/fuse 库已经提供了一个 filesystem server，并来处理所有 I/O 请求，无需我们重复实现，代码在 file_system.go#L97-L242 。

文件剩余容量 capacity 和资源相关操作：包含 StatFSOp 等请求处理。
查询 Inode 以及 attributes 等操作：包含 LookUpInodeOp/GetInodeAttributesOp/SetInodeAttributesOp/ForgetInodeOp/BatchForgetOp 等请求处理。
inode 创建操作，包括文件和目录等创建：包含 MkDirOp/MkNodeOp/CreateFileOp/CreateLinkOp/CreateSymlinkOp 等请求处理。
unlink 删除或 rename 操作：包含 RenameOp/RmDirOp/UnlinkOp 等请求处理。
打开/读取/删除目录 directory 相关操作：包含 OpenDirOp/ReadDirOp/ReleaseDirHandleOp 等请求处理。
打开/读取/写入/同步/删除文件 file 相关操作：包含OpenFileOp/ReadFileOp/WriteFileOp/SyncFileOp/FlushFileOp/ReleaseFileHandleOp等请求处理。
读取 symlink inode 操作：包含 ReadSymlinkOp 等请求处理。
读取/设置/删除 extended attributes 扩展属性的相关操作：包含 RemoveXattrOp/GetXattrOp/ListXattrOp/SetXattrOp 等请求处理。

尽管 fuse 库已经提供了 filesystem server 基本框架来处理所有 fuse 内核模块的请求，但是，还需要开发自己的用户态文件系统 filesystem 来处理具体的请求。

fuse 库定义了 filesystem 的接口函数 file_system.go#L26-L71 ，所以 polefs-client 需要实现具体的接口函数:

type FileSystem interface {
	StatFS(context.Context, *fuseops.StatFSOp) error
	LookUpInode(context.Context, *fuseops.LookUpInodeOp) error
	GetInodeAttributes(context.Context, *fuseops.GetInodeAttributesOp) error
	SetInodeAttributes(context.Context, *fuseops.SetInodeAttributesOp) error
	ForgetInode(context.Context, *fuseops.ForgetInodeOp) error
	BatchForget(context.Context, *fuseops.BatchForgetOp) error
	MkDir(context.Context, *fuseops.MkDirOp) error
	MkNode(context.Context, *fuseops.MkNodeOp) error
	CreateFile(context.Context, *fuseops.CreateFileOp) error
	CreateLink(context.Context, *fuseops.CreateLinkOp) error
	CreateSymlink(context.Context, *fuseops.CreateSymlinkOp) error
	Rename(context.Context, *fuseops.RenameOp) error
	RmDir(context.Context, *fuseops.RmDirOp) error
	Unlink(context.Context, *fuseops.UnlinkOp) error
	OpenDir(context.Context, *fuseops.OpenDirOp) error
	ReadDir(context.Context, *fuseops.ReadDirOp) error
	ReleaseDirHandle(context.Context, *fuseops.ReleaseDirHandleOp) error
	OpenFile(context.Context, *fuseops.OpenFileOp) error
	ReadFile(context.Context, *fuseops.ReadFileOp) error
	WriteFile(context.Context, *fuseops.WriteFileOp) error
	SyncFile(context.Context, *fuseops.SyncFileOp) error
	FlushFile(context.Context, *fuseops.FlushFileOp) error
	ReleaseFileHandle(context.Context, *fuseops.ReleaseFileHandleOp) error
	ReadSymlink(context.Context, *fuseops.ReadSymlinkOp) error
	RemoveXattr(context.Context, *fuseops.RemoveXattrOp) error
	GetXattr(context.Context, *fuseops.GetXattrOp) error
	ListXattr(context.Context, *fuseops.ListXattrOp) error
	SetXattr(context.Context, *fuseops.SetXattrOp) error
	Fallocate(context.Context, *fuseops.FallocateOp) error

	// Regard all inodes (including the root inode) as having their lookup counts
	// decremented to zero, and clean up any resources associated with the file
	// system. No further calls to the file system will be made.
	Destroy()
}

polefs-client 进程的基本逻辑是：

调用 meta api 来存储/删除/更新 inode/dentry 相关的元数据。
如果是文件内容操作，还需要从 S3 中读取/更新/删除文件的实际数据。
为了提高性能，还需要 LRU 缓存相关从 meta api 中读取的元数据，不需要每次都网络请求相关数据。

4. 结论

linux fuse 提供了一个框架，供我们开发自定义的用户态文件系统，并且我们采用社区比较好用的 jacobsa/fuse golang 库，只需要实现对应的接口函数就行。

polefs-client 进程就是一个 fuse 用户态进程，从 /dev/fuse 管道读取所有 I/O 请求，并经过业务处理返回响应。当当前 node 上多个业务 pod 容器需要消费一个 pv1 时，只会启动一个 polefs-client 进程挂在 /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pv1/globalmount 目录，再 mount bind 到各个 pod 对应的目录，最后，通过容器运行时 containerd 再映射到 pod 容器内的数据目录，比如 /var/www/html。有了该机制，当前 node 上所有业务 pod 就能正常消费 /var/www/html 的所有 file 和 directory。这就是容器存储整个内部机制的关键部分。

1. linux fuse 基本知识

fuse (filesystem in userspace) 用户态文件系统框架主要包含三部分：

fuse 内核模块：接收 vfs 的请求，把这个 I/O 封装后通过管道 /dev/fuse 发送到用户态程序，一般 linux 默认安装。
fuse 用户态模块：一般有开源的库，比如 go 语言的开源库 jacobsa/fuse ，基于该库开发自己的用户态文件系统。
mount 命令工具：fusermount 命令，比如需要卸时一般不直接 kill polefs-fuse 进程，而是 fusermount -u /mnt/volume1 来关闭挂进程。

linux vfs 主要包含两个基础概念：inode 和 dentry，相关资料比较多。

inode: 每一个 file/directory 只有一个对应的 inode，保存的是 file/directory 实际数据的一些元数据，比如，文件大小/设备标识符/文件模式/扩展属性/连接数量/指向存储数据内容磁盘区块的指针，等等。所以，知道了 inode，就能找到该文件的数据内容。
dentry: directory entry 目录项缓存，主要建立文件名到 inode 的映射关系。如果没有 dentry，打开 /mnt/volume1/1.txt 文件，则需要如下复杂步骤：
需要去 / 所在的 inode 找到 / 的数据块，从中找到 mnt 条目的 inode。
跳转到 mnt 对应的 inode，根据 /mnt inode 找到数据块，并找到 volume1 条目的 inode。
跳转到 /mnt/volume1 对应的 inode，根据 /mnt/volume1 inode 找到数据块，找到 1.txt 的 inode。
根据 1.txt 的 inode，读取其数据内容。
如果有了 dentry，则直接找到 1.txt 的 inode，进而直接读取数据内容。

2. fuse 协议数据格式

fuse 格式请求数据主要分为两部分：Header 和 Payload。

Header: 所有请求如 open/read/write 都会包含 header 结构体 InHeader 。
Len: 整个请求的字节数长度，包含 header+payload。
Opcode: 来自内核请求的类型，比如常用的 OpReaddir/OpCreate 等操作请求，L45-L727
Unique: 请求唯一标识，和响应中 OutHeader Unique 要对应。
NodeId: 当前请求目标文件的 inode，重点字段。
Uid: 当前请求内，操作目标文件的 user id。
Gid: 当前请求内，操作目标文件的 user group id。
Pid: 当前请求内，操作目标文件的 process id。
Padding: 暂时无用。

type InHeader struct {
	Len     uint32
	Opcode  uint32
	Unique  uint64
	Nodeid  uint64
	Uid     uint32
	Gid     uint32
	Pid     uint32
	Padding uint32
}

对于来自内核 fuse 模块的请求数据报文，使用 InMessage 结构体来定义 Header+Payload：

type InMessage struct {
	remaining []byte
	storage   []byte
	size      int
}

func (m *InMessage) Header() *fusekernel.InHeader {
	return (*fusekernel.InHeader)(unsafe.Pointer(&m.storage[0]))
}

polefs-client 经过业务逻辑处理后，返回给 fuse 内核模块的响应格式也分为两部分：Header 和 Payload。

Header 为返回给 fuse 内核模块的响应的 header，使用 OutHeader 结构体表示
Len: 整个响应的字节数长度，包含 header+payload。
Error: 响应错误码，成功返回 0，其他对应着系统 Errno L759-L768
Unique: 请求唯一标识，和请求中 InHeader Unique 要对应。

type OutHeader struct {
	Len    uint32
	Error  int32
	Unique uint64
}

对于返回给 fuse 内核模块的响应数据报文，使用 OutMessage 结构体来定义

type OutMessage struct {
	header fusekernel.OutHeader
	Sglist [][]byte
}

3. fuse 文件系统

知道了 fuse 请求和响应数据格式，还需要启动 filesystem server 监听在 /dev/fuse 管道，读取内核 fuse 发来的请求。

jacobsa/fuse 库已经提供了一个 filesystem server，并来处理所有 I/O 请求，无需我们重复实现，代码在 file_system.go#L97-L242 。

文件剩余容量 capacity 和资源相关操作：包含 StatFSOp 等请求处理。
查询 Inode 以及 attributes 等操作：包含 LookUpInodeOp/GetInodeAttributesOp/SetInodeAttributesOp/ForgetInodeOp/BatchForgetOp 等请求处理。
inode 创建操作，包括文件和目录等创建：包含 MkDirOp/MkNodeOp/CreateFileOp/CreateLinkOp/CreateSymlinkOp 等请求处理。
unlink 删除或 rename 操作：包含 RenameOp/RmDirOp/UnlinkOp 等请求处理。
打开/读取/删除目录 directory 相关操作：包含 OpenDirOp/ReadDirOp/ReleaseDirHandleOp 等请求处理。
打开/读取/写入/同步/删除文件 file 相关操作：包含OpenFileOp/ReadFileOp/WriteFileOp/SyncFileOp/FlushFileOp/ReleaseFileHandleOp等请求处理。
读取 symlink inode 操作：包含 ReadSymlinkOp 等请求处理。
读取/设置/删除 extended attributes 扩展属性的相关操作：包含 RemoveXattrOp/GetXattrOp/ListXattrOp/SetXattrOp 等请求处理。

尽管 fuse 库已经提供了 filesystem server 基本框架来处理所有 fuse 内核模块的请求，但是，还需要开发自己的用户态文件系统 filesystem 来处理具体的请求。

fuse 库定义了 filesystem 的接口函数 file_system.go#L26-L71 ，所以 polefs-client 需要实现具体的接口函数:

type FileSystem interface {
	StatFS(context.Context, *fuseops.StatFSOp) error
	LookUpInode(context.Context, *fuseops.LookUpInodeOp) error
	GetInodeAttributes(context.Context, *fuseops.GetInodeAttributesOp) error
	SetInodeAttributes(context.Context, *fuseops.SetInodeAttributesOp) error
	ForgetInode(context.Context, *fuseops.ForgetInodeOp) error
	BatchForget(context.Context, *fuseops.BatchForgetOp) error
	MkDir(context.Context, *fuseops.MkDirOp) error
	MkNode(context.Context, *fuseops.MkNodeOp) error
	CreateFile(context.Context, *fuseops.CreateFileOp) error
	CreateLink(context.Context, *fuseops.CreateLinkOp) error
	CreateSymlink(context.Context, *fuseops.CreateSymlinkOp) error
	Rename(context.Context, *fuseops.RenameOp) error
	RmDir(context.Context, *fuseops.RmDirOp) error
	Unlink(context.Context, *fuseops.UnlinkOp) error
	OpenDir(context.Context, *fuseops.OpenDirOp) error
	ReadDir(context.Context, *fuseops.ReadDirOp) error
	ReleaseDirHandle(context.Context, *fuseops.ReleaseDirHandleOp) error
	OpenFile(context.Context, *fuseops.OpenFileOp) error
	ReadFile(context.Context, *fuseops.ReadFileOp) error
	WriteFile(context.Context, *fuseops.WriteFileOp) error
	SyncFile(context.Context, *fuseops.SyncFileOp) error
	FlushFile(context.Context, *fuseops.FlushFileOp) error
	ReleaseFileHandle(context.Context, *fuseops.ReleaseFileHandleOp) error
	ReadSymlink(context.Context, *fuseops.ReadSymlinkOp) error
	RemoveXattr(context.Context, *fuseops.RemoveXattrOp) error
	GetXattr(context.Context, *fuseops.GetXattrOp) error
	ListXattr(context.Context, *fuseops.ListXattrOp) error
	SetXattr(context.Context, *fuseops.SetXattrOp) error
	Fallocate(context.Context, *fuseops.FallocateOp) error

	// Regard all inodes (including the root inode) as having their lookup counts
	// decremented to zero, and clean up any resources associated with the file
	// system. No further calls to the file system will be made.
	Destroy()
}

polefs-client 进程的基本逻辑是：

调用 meta api 来存储/删除/更新 inode/dentry 相关的元数据。
如果是文件内容操作，还需要从 S3 中读取/更新/删除文件的实际数据。
为了提高性能，还需要 LRU 缓存相关从 meta api 中读取的元数据，不需要每次都网络请求相关数据。

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

linux fuse 用户态文件系统

1. linux fuse 基本知识

2. fuse 协议数据格式

3. fuse 文件系统

4. 结论

linux fuse 用户态文件系统

1. linux fuse 基本知识

2. fuse 协议数据格式

3. fuse 文件系统

4. 结论

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

linux fuse 用户态文件系统

1. linux fuse 基本知识

2. fuse 协议数据格式

3. fuse 文件系统

4. 结论

linux fuse 用户态文件系统

1. linux fuse 基本知识

2. fuse 协议数据格式

3. fuse 文件系统

4. 结论