tests: move spell check tool to main repo

Move tool as part of static checks migration.

Fixes #8187

Signed-off-by: Bo Chen <chen.bo@intel.com>
Signed-off-by: Carlos Venegas <jos.c.venegas.munoz@intel.com>
Signed-off-by: Chao Wu <chaowu@linux.alibaba.com>
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Signed-off-by: Dan Middleton <dan.middleton@intel.com>
Signed-off-by: Derek Lee <derlee@redhat.com>
Signed-off-by: Eric Ernst <eric.ernst@intel.com>
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Signed-off-by: Gabriela Cervantes <gabriela.cervantes.tellez@intel.com>
Signed-off-by: Graham Whaley <graham.whaley@intel.com>
Signed-off-by: Hui Zhu <teawater@antfin.com>
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Jimmy Xu <xjmmyshcn@gmail.com>
Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Signed-off-by: Mikko Ylinen <mikko.ylinen@intel.com>
Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Signed-off-by: Wainer dos Santos Moschetta <wainersm@redhat.com>
This commit is contained in:
Chelsea Mafrica 2023-11-21 17:47:16 -08:00
parent 8ad433d4ad
commit 7f3c12f1dd
12 changed files with 1406 additions and 0 deletions

View File

@ -0,0 +1,178 @@
# Spell check tool
## Overview
The `kata-spell-check.sh` tool is used to check a markdown file for
typographical (spelling) mistakes.
## Approach
The spell check tool is based on
[`hunspell`](https://github.com/hunspell/hunspell). It uses standard Hunspell
English dictionaries and supplements these with a custom Hunspell dictionary.
The document is cleaned of several entities before the spell-check begins.
These entities include the following:
- URLs
- Email addresses
- Code blocks
- Most punctuation
- GitHub userids
## Custom words
A custom dictionary is required to accept specific words that are either well
understood by the community or are defined in various document files, but do
not appear in standard dictionaries. The custom dictionaries allow those words
to be accepted as correct. The following lists common examples of such words:
- Abbreviations
- Acronyms
- Company names
- Product names
- Project names
- Technical terms
## Spell check a document file
```sh
$ ./kata-spell-check.sh check /path/to/file
```
> **Note:** If you have made local edits to the dictionaries, you may
> [re-create the master dictionary files](#create-the-master-dictionary-files)
> as documented in the [Adding a new word](#adding-a-new-word) section,
> in order for your local edits take effect.
## Other options
Lists all available options and commands:
```sh
$ ./kata-spell-check.sh -h
```
## Technical details
### Hunspell dictionary format
A Hunspell dictionary comprises two text files:
- A word list file
This file defines a list of words (one per line). The list includes optional
references to one or more rules defined in the rules file as well as optional
comments. Specify fixed words (e.g. company names) verbatim. Enter “normal”
words in their root form.
The root form of a "normal" word is the simplest and shortest form of that
word. For example, the following list of words are all formed from the root
word "computer":
- Computers
- Computers
- Computing
- Computed
Each word in the previous list is an example of using the word "computer" to
construct said word through a combination of applying the following
manipulations:
- Remove one or more characters from the end of the word.
- Add a new ending.
Therefore, you list the root word "computer" in the word list file.
- A rules file
This file defines named manipulations to apply to root words to form new
words. For example, rules that make a root word plural.
### Source files
The rules file and the the word list file for the custom dictionary generate
from "source" fragment files in the [`data`](data/) directory.
All the fragment files allow comments using the hash (`#`) comment
symbol and all files contain a comment header explaining their content.
#### Word list file fragments
The `*.txt` files are word list file fragments. Splitting the word list
into fragments makes updates easier and clearer as each fragment is a
grouping of related terms. The name of the file gives a clue as to the
contents but the comments at the top of each file provide further
detail.
Every line that does not start with a comment symbol contains a single
word. An optional comment for a word may appear after the word and is
separated from the word by whitespace followed by the comment symbol:
```
word # This is a comment explaining this particular word list entry.
```
You *may* suffix each word by a forward slash followed by one or more
upper-case letters. Each letter refers to a rule name in the rules file:
```
word/AC # This word references the 'A' and 'C' rules.
```
#### Rules file
The [rules file](data/rules.aff) contains a set of general rules that can be
applied to one or more root words in the word list files. You can make
comments in the rules file.
For an explanation of the format of this file see
[`man 5 hunspell`](http://www.manpagez.com/man/5/hunspell)
([source](https://github.com/hunspell/hunspell/blob/master/man/hunspell.5)).
## Adding a new word
### Update the word list fragment
If you want to allow a new word to the dictionary,
- Check to ensure you do need to add the word
Is the word valid and correct? If the word is a project, product,
or company name, is the capitalization correct?
- Add the new word to the appropriate [word list fragment file](data).
Specifically, if it is a general word, add the *root* of the word to
the appropriate fragment file.
- Add a `/` suffix along with the letters for each rule to apply in order to
add rules references.
### Optionally update the rules file
It should not generally be necessary to update the rules file since it
already contains rules for most scenarios. However, if you need to
update the file, [read the documentation carefully](#rules-file).
### Create the master dictionary files
Every time you change the dictionary files you must recreate the master
dictionary files:
```sh
$ ./kata-spell-check.sh make-dict
```
As a convenience, [checking a file](#spell-check-a-document-file) will
automatically create the database.
### Test the changes
You must test any changes to the [word list file
fragments](#word-list-file-fragments) or the [rules file](#rules-file)
by doing the following:
1. Recreate the [master dictionary files](#create-the-master-dictionary-files).
1. [Run the spell checker](#spell-check-a-document-file) on a file containing the
words you have added to the dictionary.

View File

@ -0,0 +1,123 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: List of acronyms and abbreviations.
ACPI/AB
ACS/AB
API/AB
AUFS # Another Union FS
AWS/AB
BDF/AB
CFS/AB
CLI/AB
CNI/AB
CNM/AB
CPUID/AB
CRI/AB
CVE/AB
DAX/AB
DinD/B # Docker in Docker
dind/B
DMA/AB
DPDK/AB
FaaS/B # Function as a Service
FS/AB
fs/B # For terms like "virtio-fs"
GCE/AB
GOPATH/AB
GPG/AB
GPU/AB
gRPC/AB
GSC/AB
GVT/AB
IaaS/B # Infrastructure as a Service
IOMMU/AB
IoT/AB # Internet of Things
IOV/AB
JSON/AB
k8s/B
KCSA/AB
KSM/AB
KVM/AB
LTS/AB
MACVTAP/AB
mem/B # For terms like "virtio-mem"
memdisk/B
MDEV/AB
NEMU/AB
NIC/AB
NVDIMM/AB
OCI/AB
OVMF/AB
OverlayFS/B
PaaS/B # Platform as a Service
PCDIMM/AB
PCI/AB
PCIe/AB
PID/AB
pmem/B # persistent memory
PNG/AB
POD/AB
PR/AB
PSS/AB
QA/AB
QAT/AB
QEMU/AB
RBAC/AB
RDMA/AB
RNG/AB
SaaS/B # Software as a Service
SCSI/AB
SDK/AB
seccomp # secure computing mode
SHA/AB
SPDX/AB
SRIOV/AB
SVG/AB
TBD/AB
TOC/AB
TOML/AB
TTY/AB
UI/AB
UTS/AB
UUID/AB
vCPU/AB
VETH/AB
VF/AB
VFIO/AB
VGPU/AB
vhost/AB
VHOST/AB
virtio/AB
VirtIO/AB
Virtio-fs/AB
Virtio-mem/AB
VLAN/AB
VM/AB
VMCache/AB
vmm
VMM/AB
VMX/AB
VPP/AB
VSOCK/AB
VSS/AB
WIP/AB # Work In Progress
WRT/AB # With Respect To
XIP/AB
YAML/AB
irq/AB
mmio/AB
APIC
msg/AB
UDS
dbs # Dragonball Sandbox
TDX
tdx
mptable
fdt
gic
msr
cpuid
pio

View File

@ -0,0 +1,21 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: List of architectures.
# Architectures
aarch64/B
amd64/B
arm64/B
ppc64el/B
ppc64le/B
s390x/B
x86_64/B
x86/B
# Micro architecture names
Haswell/B
Ivybridge/B

View File

@ -0,0 +1,18 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: List of Linux Distributions.
CentOS/B
Debian/B
EulerOS/B
Fedora/B
macOS/B
MacOS/B
minikube/B
openSUSE/B
OpenSUSE/B
RHEL/B
SLES/B
Ubuntu/B

View File

@ -0,0 +1,25 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: Names of commands, files and packages.
#
# Notes: These *should* strictly be placed in backticks but alas this
# doesn't always happen.
#
# References: https://github.com/kata-containers/kata-containers/blob/main/docs/Documentation-Requirements.md#files-and-command-names
cgroup/AB
coredump/A
cpuset/AB
Dockerfile/AB
init/AB
initramfs/AB
initrd/AB
netns/AB
rootfs/AB
stderr/AB
stdin/AB
stdout/AB
syslog/AB
Vagrantfile/B

View File

@ -0,0 +1,13 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: List of words that are missing from Hunspell dictionaries
# on some platforms.
committer/AB # Not available on Ubuntu 16.04 or CentOS 7
plugin/AB # Not available on Ubuntu 16.04
regexp/AB # Not available on Ubuntu 16.04
screenshot/AB # Not available on Ubuntu 16.04 or CentOS 7
tarball/AB # Not available on Ubuntu 16.04
uninstall # Not available on Ubuntu 16.04

View File

@ -0,0 +1,135 @@
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: General word list.
ack/A
arg # Argument
auditability
backend
backport/ACD
backtick/AB
backtrace
bootloader/AB
centric/B
checkbox/A
chipset/AB
codebase
commandline
config/AB
crypto # Cryptography
cryptoprocessor/AB
DaemonSet/AB
deliverable/AB
dev
devicemapper/B
deploy
dialer
dialog/A
Diffie/B # DiffieHellman (cryptography)
distro/AB
emptydir/A
enablement/AB
entrypoint/AB
ethernet
filename/AB
filesystem/AB
freeform
goroutine/AB
hostname/AB
hotplug/ACD
howto/AB
HugePage/AB
hugepage/AB
Hyp
hypercall/A
hypervisor/AB
implementer/A
implementor/A
Infiniband
iodepth/A
ioengine/A
iptables
Itanium/AB
kata
Kat/AB # "Kat Herding Team" :)
keypair/A
lifecycle/A
linter/AB
logfile/A
Longterm
longterm
loopback
memcpy/A
mergeable
metadata
microcontroller/AB
miniOS
mmap/AB
nack/AB
namespace/ABCD
netlink
NVIDIA/A
nvidia/A
onwards
OpenAPI
OS/AB
parallelize/AC
passthrough
patchset/A
pluggable/AB
portmapper/AB
portmapping/A
pre
prefetch/ACD
prestart
programmatically
proxying
Quadro
ramdisk/A
readonly
rebase/ACD
refactor/ACD
remediate
repo/A
runtime/AB
scalability
serverless
signoff/A
stalebot/B
startup
subdirectory/A
swappiness
sysctl/AB
teardown
templating
timestamp/AB
tracability
ttRPC/B
udev/B
uevent/AB
unbootable
uncomment/ACD
unported
unskip/AC
untrusted
untrusting
userid/AB
userspace/B
vendored
vendoring
versioning
vGPU
virtualization
virtualized
webhook/AB
whitespace
workflow/A
Xeon/A
yaml
upcall
Upcall
ioctl/A
struct/A # struct in Rust
Struct/A

View File

@ -0,0 +1,101 @@
# Copyright (c) 2019-2023 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: Names of projects, companies and services.
Ansible/B
AppArmor/B
blogbench/B
BusyBox/B
Cassandra/B
ccloudvm/B
codecov/B
containerd/B
cnn/B
cri-o/B
CRI-O/B
DevStack/B
Django/B
Docker/B
dracut/B
Dragonball/B
Facebook/B
fio/B
Fluentd/B
Frakti/B
Git/B
GitHub/B
GoDoc/B
golang/B
Golang/B
Grafana/B
Gramine/B
Huawei/B
Inclavare/B
iPerf/B
IPerf/B
Istio/B
Jaeger/B
Jenkins/B
Jupyter/B
journald/B
jq/B
Kata/B
Kibana/B
Kubelet/B
Kubernetes/B
Launchpad/B
LevelDB/B
libcontainer/B
libelf/B
libvirt/B
Linkerd/B
LinuxONE/B
Logrus/B
Logstash/B
Mellanox/B
Minikube/B
MITRE/B
musl/B
Netlify/B
Nginx/B
OpenCensus/B
OpenPGP/B
OpenShift/B
OpenSSL/B
OpenStack/B
OpenTelemetry/B
OpenTracing/B
osbuilder/B
packagecloud/B
Pandoc/B
Podman/B
PullApprove/B
Pytorch/B
QuickAssist/B
R/B
raytracer/B
rkt/B/B
runc/B
runV/B
rustlang/B
Rustlang/B
SELinux/B
SemaphoreCI/B
snapcraft/B
snapd/B
SQLite/B
SUSE/B
Sysbench/B
systemd/B
tf/B
TravisCI/B
Tokio/B
Vexxhost/B
virtcontainers/B
VMWare/B
vSphere/B
Yamux/B
yq/B
Zun/B

View File

@ -0,0 +1,36 @@
#
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
SET UTF-8
# Add the following characters so they are accepted as part of a word
WORDCHARS 0123456789'
# Disable hyphenation
BREAK 0
# plural
SFX A N 3
SFX A 0 s [^x]
SFX A 0 es x
SFX A y ies
# possession
SFX B N 1
SFX B 0 's
# past tense
SFX C N 4
SFX C 0 d e
SFX C 0 ed [rt]
SFX C 0 ped p
SFX C 0 ged g
# present continuous
SFX D N 3
SFX D 0 ging g
SFX D 0 ing [rt]
SFX D e ing e

View File

@ -0,0 +1,36 @@
#
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
SET UTF-8
# Add the following characters so they are accepted as part of a word
WORDCHARS 0123456789'
# Disable hyphenation
BREAK 0
# plural
SFX A N 3
SFX A 0 s [^x]
SFX A 0 es x
SFX A y ies
# possession
SFX B N 1
SFX B 0 's
# past tense
SFX C N 4
SFX C 0 d e
SFX C 0 ed [rt]
SFX C 0 ped p
SFX C 0 ged g
# present continuous
SFX D N 3
SFX D 0 ging g
SFX D 0 ing [rt]
SFX D e ing e

View File

@ -0,0 +1,384 @@
383
ACPI/AB
ACS/AB
API/AB
APIC
AUFS
AWS/AB
Ansible/B
AppArmor/B
BDF/AB
BusyBox/B
CFS/AB
CLI/AB
CNI/AB
CNM/AB
CPUID/AB
CRI-O/B
CRI/AB
CVE/AB
Cassandra/B
CentOS/B
DAX/AB
DMA/AB
DPDK/AB
DaemonSet/AB
Debian/B
DevStack/B
Diffie/B
DinD/B
Django/B
Docker/B
Dockerfile/AB
Dragonball/B
EulerOS/B
FS/AB
FaaS/B
Facebook/B
Fedora/B
Fluentd/B
Frakti/B
GCE/AB
GOPATH/AB
GPG/AB
GPU/AB
GSC/AB
GVT/AB
Git/B
GitHub/B
GoDoc/B
Golang/B
Grafana/B
Gramine/B
Haswell/B
Huawei/B
HugePage/AB
Hyp
IOMMU/AB
IOV/AB
IPerf/B
IaaS/B
Inclavare/B
Infiniband
IoT/AB
Istio/B
Itanium/AB
Ivybridge/B
JSON/AB
Jaeger/B
Jenkins/B
Jupyter/B
KCSA/AB
KSM/AB
KVM/AB
Kat/AB
Kata/B
Kibana/B
Kubelet/B
Kubernetes/B
LTS/AB
Launchpad/B
LevelDB/B
Linkerd/B
LinuxONE/B
Logrus/B
Logstash/B
Longterm
MACVTAP/AB
MDEV/AB
MITRE/B
MacOS/B
Mellanox/B
Minikube/B
NEMU/AB
NIC/AB
NVDIMM/AB
NVIDIA/A
Netlify/B
Nginx/B
OCI/AB
OS/AB
OVMF/AB
OpenAPI
OpenCensus/B
OpenPGP/B
OpenSSL/B
OpenSUSE/B
OpenShift/B
OpenStack/B
OpenTelemetry/B
OpenTracing/B
OverlayFS/B
PCDIMM/AB
PCI/AB
PCIe/AB
PID/AB
PNG/AB
POD/AB
PR/AB
PSS/AB
PaaS/B
Pandoc/B
Podman/B
PullApprove/B
Pytorch/B
QA/AB
QAT/AB
QEMU/AB
Quadro
QuickAssist/B
R/B
RBAC/AB
RDMA/AB
RHEL/B
RNG/AB
Rustlang/B
SCSI/AB
SDK/AB
SELinux/B
SHA/AB
SLES/B
SPDX/AB
SQLite/B
SRIOV/AB
SUSE/B
SVG/AB
SaaS/B
SemaphoreCI/B
Struct/A#
Sysbench/B
TBD/AB
TDX
TOC/AB
TOML/AB
TTY/AB
Tokio/B
TravisCI/B
UDS
UI/AB
UTS/AB
UUID/AB
Ubuntu/B
Upcall
VETH/AB
VF/AB
VFIO/AB
VGPU/AB
VHOST/AB
VLAN/AB
VM/AB
VMCache/AB
VMM/AB
VMWare/B
VMX/AB
VPP/AB
VSOCK/AB
VSS/AB
Vagrantfile/B
Vexxhost/B
VirtIO/AB
Virtio-fs/AB
Virtio-mem/AB
WIP/AB
WRT/AB
XIP/AB
Xeon/A
YAML/AB
Yamux/B
Zun/B
aarch64/B
ack/A
amd64/B
arg
arm64/B
auditability
backend
backport/ACD
backtick/AB
backtrace
blogbench/B
bootloader/AB
ccloudvm/B
centric/B
cgroup/AB
checkbox/A
chipset/AB
cnn/B
codebase
codecov/B
commandline
committer/AB
config/AB
containerd/B
coredump/A
cpuid
cpuset/AB
cri-o/B
crypto
cryptoprocessor/AB
dbs
deliverable/AB
deploy
dev
devicemapper/B
dialer
dialog/A
dind/B
distro/AB
dracut/B
emptydir/A
enablement/AB
entrypoint/AB
ethernet
fdt
filename/AB
filesystem/AB
fio/B
freeform
fs/B
gRPC/AB
gic
golang/B
goroutine/AB
hostname/AB
hotplug/ACD
howto/AB
hugepage/AB
hypercall/A
hypervisor/AB
iPerf/B
implementer/A
implementor/A
init/AB
initramfs/AB
initrd/AB
ioctl/A
iodepth/A
ioengine/A
iptables
irq/AB
journald/B
jq/B
k8s/B
kata
keypair/A
libcontainer/B
libelf/B
libvirt/B
lifecycle/A
linter/AB
logfile/A
longterm
loopback
macOS/B
mem/B
memcpy/A
memdisk/B
mergeable
metadata
microcontroller/AB
miniOS
minikube/B
mmap/AB
mmio/AB
mptable
msg/AB
msr
musl/B
nack/AB
namespace/ABCD
netlink
netns/AB
nvidia/A
onwards
openSUSE/B
osbuilder/B
packagecloud/B
parallelize/AC
passthrough
patchset/A
pio
pluggable/AB
plugin/AB
pmem/B
portmapper/AB
portmapping/A
ppc64el/B
ppc64le/B
pre
prefetch/ACD
prestart
programmatically
proxying
ramdisk/A
raytracer/B
readonly
rebase/ACD
refactor/ACD
regexp/AB
remediate
repo/A
rkt/B/B
rootfs/AB
runV/B
runc/B
runtime/AB
rustlang/B
s390x/B
scalability
screenshot/AB
seccomp
serverless
signoff/A
snapcraft/B
snapd/B
stalebot/B
startup
stderr/AB
stdin/AB
stdout/AB
struct/A
subdirectory/A
swappiness
sysctl/AB
syslog/AB
systemd/B
tarball/AB
tdx
teardown
templating
tf/B
timestamp/AB
tracability
ttRPC/B
udev/B
uevent/AB
unbootable
uncomment/ACD
uninstall
unported
unskip/AC
untrusted
untrusting
upcall
userid/AB
userspace/B
vCPU/AB
vGPU
vSphere/B
vendored
vendoring
versioning
vhost/AB
virtcontainers/B
virtio/AB
virtualization
virtualized
vmm
webhook/AB
whitespace
workflow/A
x86/B
x86_64/B
yaml
yq/B

View File

@ -0,0 +1,336 @@
#!/bin/bash
# Copyright (c) 2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
# Description: spell-check utility.
[ -n "$DEBUG" ] && set -x
set -o errexit
set -o pipefail
set -o nounset
# Ensure we spell check in English
LANG=C
LC_ALL=C
script_name=${0##*/}
if [ "$(uname -s)" == "Darwin" ]
then
# Hunspell dictionaries are a not easily available
# on this platform it seems.
echo "INFO: $script_name: OSX not supported - exiting"
exit 0
fi
self_dir=$(dirname "$(readlink -f "$0")")
cidir="${self_dir}/../../.ci"
source "${cidir}/lib.sh"
# Directory containing word lists.
#
# Each file in this directory must:
#
# - Have the ".txt" extension.
# - Contain one word per line.
#
# Additionally, the files may contain blank lines and comments
# (lines beginning with '#').
KATA_DICT_FRAGMENT_DIR=${KATA_DICT_FRAGMENT_DIR:-data}
KATA_DICT_NAME="${KATA_DICT_NAME:-kata-dictionary}"
# Name of dictionary file suitable for using with hunspell(1)
# as a personal dictionary.
KATA_DICT_FILE="${KATA_DICT_FILE:-${KATA_DICT_NAME}.dic}"
KATA_RULES_FILE="${KATA_RULES_FILE:-${KATA_DICT_FILE/.dic/.aff}}"
# command to remove code from markdown (inline and blocks)
strip_cmd="${cidir}/kata-doc-to-script.sh"
fragment_dir="${self_dir}/${KATA_DICT_FRAGMENT_DIR}"
# Name of file containing dictionary rules that apply to the
# KATA_DICT_FILE word list.
rules_file_name="rules.aff"
# Command to spell check a file
spell_check_cmd="${KATA_SPELL_CHECK_CMD:-hunspell}"
# Command to convert a markdown file into plain text
md_convert_tool="${KATA_MARKDOWN_CONVERT_TOOL:-pandoc}"
KATA_DICT_DIR="${KATA_DICT_DIR:-${self_dir}}"
dict_file="${KATA_DICT_DIR}/${KATA_DICT_FILE}"
rules_file="${KATA_DICT_DIR}/${KATA_RULES_FILE}"
# Hunspell refers to custom dictionary by their path followed by the name of
# the dictionary (without the file extension).
kata_dict_ref="${KATA_DICT_DIR}/${KATA_DICT_NAME}"
# All project documentation must be written in English,
# with American English taking priority.
#
# We also use a custom dictionary which has to be specified by its
# "directory and name prefix" and which must also be the first specified
# dictionary.
dict_languages="${kata_dict_ref},en_US,en_GB"
make_dictionary()
{
[ -d "$fragment_dir" ] || die "invalid fragment directory"
[ -z "$dict_file" ] && die "missing dictionary output file name"
# Note: the first field is extracted to allow for inline
# comments in each fragment. For example:
#
# word # this text describes why the word is in the dictionary.
#
local dict
dict=$(cat "$fragment_dir"/*.txt |\
grep -v '^\#' |\
grep -v '^$' |\
awk '{print $1}' |\
sort -u || true)
[ -z "$dict" ] && die "generated dictionary is empty"
# Now, add in the number of words as a header (required by Hunspell)
local count
count=$(echo "$dict"| wc -l | awk '{print $1}' || true)
[ -z "$count" ] && die "cannot determine dictionary length"
[ "$count" -eq 0 ] && die "invalid dictionary length"
# Construct the dictionary
(echo "$count"; echo "$dict") > "$dict_file"
cp "${fragment_dir}/${rules_file_name}" "${rules_file}"
}
spell_check_file()
{
local file="$1"
[ -z "$file" ] && die "need file to check"
[ -e "$file" ] || die "file does not exist: '$file'"
[ -e "$dict_file" ] || make_dictionary
info "Spell checking file '$file'"
# Determine the pandoc input format.
local pandoc_input_fmts
local pandoc_input_fmt
local pandoc_input_fmts=$(pandoc --list-input-formats 2>/dev/null || true)
if [ -z "$pandoc_input_fmts" ]
then
# We're using a very old version of pandoc that doesn't
# support listing its available input formats, so
# specify a default.
pandoc_input_fmt="markdown_github"
else
# Pandoc has multiple names for the gfm parser so find one of them
pandoc_input_fmt=$(echo "$pandoc_input_fmts" |\
grep -E "gfm|github" |\
head -1 || true)
fi
[ -z "$pandoc_input_fmt" ] && die "cannot find usable pandoc input format"
local stripped_doc
local pandoc_doc
local utf8_free_doc
local pre_hunspell_doc
local hunspell_results
local final_results
# First strip out all code blocks and convert all
# "quoted apostrophe's" ('\'') back into a single apostrophe.
stripped_doc=$("$strip_cmd" -i "$file" -)
# Next, convert the remainder it into plain text to remove the
# remaining markdown syntax.
#
# Before pandoc gets hold of it:
#
# - Replace pipes with spaces. This
# fixes an issue with old versions of pandoc (Ubuntu 16.04)
# which completely mangle tables into nonsense.
#
# - Remove empty reference links.
#
# For example, this markdown
#
# blah [`qemu-lite`][qemu-lite] blah.
# :
# [qemu-lite]: https://...
#
# Gets converted into
#
# blah [][qemu-lite] blah.
# :
# [qemu-lite]: https://...
#
# And the empty set of square brackets confuses pandoc.
#
# After pandoc has processed the data, remove any remaining
# "inline links" in this format:
#
# [link name](#link-address)
#
# This is strictly only required for old versions of pandoc.
pandoc_doc=$(echo "$stripped_doc" |\
tr '|' ' ' |\
sed 's/\[\]\[[^]]*\]//g' |\
"$md_convert_tool" -f "${pandoc_input_fmt}" -t plain - |\
sed 's/\[[^]]*\]([^\)]*)//g' || true)
# Convert the file into "pure ASCII" by removing all awkward
# Unicode characters that won't spell check.
#
# Necessary since pandoc is "clever" and will convert things like
# GitHub's colon emojis (such as ":smile:") into the actual utf8
# character where possible.
utf8_free_doc=$(echo "$pandoc_doc" | iconv -c -f utf-8 -t ascii)
# Next, perform the following simplifications:
#
# - Remove URLs.
# - Remove email addresses.
# - Replace most punctuation symbols with a space
# (excluding a dash (aka hyphen!)
# - Carefully remove non-hyphen dashes.
# - Remove GitHub @userids.
pre_hunspell_doc=$(echo "$utf8_free_doc" |\
sed 's,https*://[^[:space:]()][^[:space:]()]*,,g' |\
sed -r 's/[a-zA-Z0-9.-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9.-]+//g' |\
tr '[,\[\]()\*\\/\|=]' ' ' |\
sed -e 's/^ *-//g' -e 's/- $//g' -e 's/ -//g' |\
sed 's/@[a-zA-Z0-9][a-zA-Z0-9]*\b//g')
# Call the spell checker
hunspell_results=$(echo "$pre_hunspell_doc" | $spell_check_cmd -d "${dict_languages}")
# Finally, post-process the hunspell output:
#
# - Parse the output to ignore:
# - Hunspell banner.
# - Correctly spelt words (lines starting with '*', '+' or '-').
# - All words containing numbers (like "100MB").
# - All words that appear to be acronymns / Abbreviations
# (atleast two upper-case letters and which may be plural or
# possessive).
# - All words that appear to be numbers.
# - All possessives and the dreaded isolated "'s" which occurs
# for input like this:
#
# `kata-shim`'s
#
# which gets converted by $strip_cmd into simply:
#
# 's
#
# - Sort output.
final_results=$(echo "$hunspell_results" |\
grep -Evi "(ispell|hunspell)" |\
grep -Ev '^(\*|\+|-)' |\
grep -Evi "^(&|#) [^ ]*[0-9][^ ]*" |\
grep -Ev "^. [A-Z][A-Z][A-Z]*(s|'s)*" |\
grep -Ev "^. 's" |\
sort -u || true)
local line
local incorrects
local near_misses
near_misses=$(echo "$final_results" | grep '^&' || true)
incorrects=$(echo "$final_results" | grep '^\#' | awk '{print $2}' || true)
local -i failed=0
[ -n "$near_misses" ] && failed+=1
[ -n "$incorrects" ] && failed+=1
echo "$near_misses" | while read -r line
do
[ "$line" = "" ] && continue
local word
local possibles
word=$(echo "$line" | awk '{print $2}')
possibles=$(echo "$line" | cut -d: -f2- | sed 's/^ *//g')
warn "Word '${word}': did you mean one of the following?: ${possibles}"
done
local incorrect
for incorrect in $incorrects
do
warn "Incorrect word: '$incorrect'"
done
[ "$failed" -gt 0 ] && die "Spell check failed for file: '$file'"
info "Spell check successful for file: '$file'"
}
delete_dictionary()
{
rm -f "${KATA_DICT_FILE}" "${KATA_RULES_FILE}"
}
setup()
{
local cmd
for cmd in "$spell_check_cmd" "$md_convert_tool"
do
command -v "$cmd" &>/dev/null || die "Need $cmd command"
done
}
usage()
{
cat <<-EOF
Usage: ${script_name} <command> [arguments]
Description: Spell-checking utility.
Commands:
check <file> : Spell check the specified file
(implies 'make-dict').
delete-dict : Delete the dictionary.
help : Show this usage.
make-dict : Create the dictionary.
EOF
}
main()
{
setup
[ -z "${1:-}" ] && usage && echo && die "need command"
case "$1" in
check) shift && spell_check_file "$1" ;;
delete-dict) delete_dictionary ;;
help|-h|--help) usage && exit 0 ;;
make-dict) make_dictionary ;;
*) die "invalid command: '$1'" ;;
esac
}
main "$@"