Package 'UniversalCVI' reference manual

Title:	Hard and Soft Cluster Validity Indices
Description:	Algorithms for checking the accuracy of a clustering result with known classes, computing cluster validity indices, and generating plots for comparing them. The package is compatible with K-means, fuzzy C means, EM clustering, and hierarchical clustering (single, average, and complete linkage). The details of the indices in this package can be found in: J. C. Bezdek, M. Moshtaghi, T. Runkler, C. Leckie (2016) <doi:10.1109/TFUZZ.2016.2540063>, T. Calinski, J. Harabasz (1974) <doi:10.1080/03610927408827101>, C. H. Chou, M. C. Su, E. Lai (2004) <doi:10.1007/s10044-004-0218-1>, D. L. Davies, D. W. Bouldin (1979) <doi:10.1109/TPAMI.1979.4766909>, J. C. Dunn (1973) <doi:10.1080/01969727308546046>, F. Haouas, Z. Ben Dhiaf, A. Hammouda, B. Solaiman (2017) <doi:10.1109/FUZZ-IEEE.2017.8015651>, M. Kim, R. S. Ramakrishna (2005) <doi:10.1016/j.patrec.2005.04.007>, S. H. Kwon (1998) <doi:10.1049/EL:19981523>, S. H. Kwon, J. Kim, S. H. Son (2021) <doi:10.1049/ell2.12249>, G. W. Miligan (1980) <doi:10.1007/BF02293907>, M. K. Pakhira, S. Bandyopadhyay, U. Maulik (2004) <doi:10.1016/j.patcog.2003.06.005>, M. Popescu, J. C. Bezdek, T. C. Havens, J. M. Keller (2013) <doi:10.1109/TSMCB.2012.2205679>, S. Saitta, B. Raphael, I. Smith (2007) <doi:10.1007/978-3-540-73499-4_14>, A. Starczewski (2017) <doi:10.1007/s10044-015-0525-8>, Y. Tang, F. Sun, Z. Sun (2005) <doi:10.1109/ACC.2005.1470111>, N. Wiroonsri (2024) <doi:10.1016/j.patcog.2023.109910>, N. Wiroonsri, O. Preedasawakul (2023) <doi:10.48550/arXiv.2308.14785>, C. H. Wu, C. S. Ouyang, L. W. Chen, L. W. Lu (2015) <doi:10.1109/TFUZZ.2014.2322495>, X. Xie, G. Beni (1991) <doi:10.1109/34.85677> and Rousseeuw (1987) and Kaufman and Rousseeuw(2009) <doi:10.1016/0377-0427(87)90125-7> and <doi:10.1002/9780470316801> C. Alok. (2010).
Authors:	Nathakhun Wiroonsri [cre, aut] , Onthada Preedasawakul [aut]
Maintainer:	Nathakhun Wiroonsri <[email protected]>
License:	GPL (>= 3)
Version:	1.2.0
Built:	2025-03-28 06:05:49 UTC
Source:	https://github.com/cran/UniversalCVI

Accuracy detection for a clustering result with known classes

Description

Computes the accuracy of a clustering result of a dataset with known classes from the k-means, fuzzy c-means, or EM algorithm.

Usage

AccClust(x, label.names = "label", algorithm = "FCM", fzm = 2,
  scale = TRUE, nstart = 100, iter = 100)
AccClust(x, label.names = "label", algorithm = "FCM", fzm = 2,
  scale = TRUE, nstart = 100, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`label.names`	a character string indicating the true label column name. The default is `"label"`
`algorithm`	a character string indicating which clustering methods to be used (`"FCM"`, `"EM"`, `"Kmeans"`). More than one methods may be selected. The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`scale`	logical, if `TRUE` (default), the dataset is normalized before clustering.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM" or "Kmeans" or c("Kmeans","FCM")`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Value

`kmeans`	Accuracy score from `0` to `1` of the k-means result
`FCM`	Accuracy score from `0` to `1` of the FCM result
`EM`	Accuracy score from `0` to `1` of the EM result

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data

# Check accuracy of clustering results obtained by kmeans, FCM, and EM clustering
AccClust(x, label.names = "label",algorithm = c("Kmeans","FCM","EM"), fzm = 2,
  scale = TRUE, nstart = 20,iter = 100)

# Check accuracy of a clustering result obtained by the FCM algoritm
AccClust(x, label.names = "label",algorithm = "FCM", fzm = 2,
  scale = TRUE, nstart = 20,iter = 100)
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data

# Check accuracy of clustering results obtained by kmeans, FCM, and EM clustering
AccClust(x, label.names = "label",algorithm = c("Kmeans","FCM","EM"), fzm = 2,
  scale = TRUE, nstart = 20,iter = 100)

# Check accuracy of a clustering result obtained by the FCM algoritm
AccClust(x, label.names = "label",algorithm = "FCM", fzm = 2,
  scale = TRUE, nstart = 20,iter = 100)

Correlation Cluster Validity (CCV) index

Description

Computes the CCVP and CCVS (M. Popescu et al., 2013) indexes for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

CCV.IDX(x, cmax, cmin = 2, indexlist = "all", method = 'FCM', fzm = 2,
  iter = 100, nstart = 20)
CCV.IDX(x, cmax, cmin = 2, indexlist = "all", method = 'FCM', fzm = 2,
  iter = 100, nstart = 20)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`indexlist`	a character string indicating which The generalized C index be computed ("`all`","`CCVP`","`CCVS`"). More than one indexes can be selected.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.

Details

A new cluster validity framework that compares the structure in the data to the structure of dissimilarity matrices induced by a matrix transformation of the partition being tested. The largest value of $CCV(c)$ indicates a valid optimal partition.

Value

Each of the followings shows the values of each index for c from cmin to cmax in a data frame.

`CCVP`	the Pearson Correlation Cluster Validity index.
`CCVS`	the Spearman’s (rho) Correlation Cluster Validity index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

M. Popescu, J. C. Bezdek, T. C. Havens and J. M. Keller (2013). "A Cluster Validity Framework Based on Induced Partition Dissimilarity." https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6246717&isnumber=6340245

Examples


library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----


# Compute all the indices by CCV.IDX
FCM.ALL.CCV = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'FCM', fzm = 2, iter = 100, nstart = 20)
print(FCM.ALL.CCV)

# Compute CCVP index
FCM.CCVP = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "CCVP",
  method =  'FCM', fzm = 2, iter = 100, nstart = 20)
print(FCM.CCVP)


# ---- EM algorithm ----

# Compute all the indices by CCV.IDX
EM.ALL.CCV = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'EM', iter = 100, nstart = 20)
print(EM.ALL.CCV)

# Compute CCVP index
EM.CCVP = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "CCVP",
  method =  'EM', iter = 100, nstart = 20)
print(EM.CCVP)
library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----


# Compute all the indices by CCV.IDX
FCM.ALL.CCV = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'FCM', fzm = 2, iter = 100, nstart = 20)
print(FCM.ALL.CCV)

# Compute CCVP index
FCM.CCVP = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "CCVP",
  method =  'FCM', fzm = 2, iter = 100, nstart = 20)
print(FCM.CCVP)


# ---- EM algorithm ----

# Compute all the indices by CCV.IDX
EM.ALL.CCV = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'EM', iter = 100, nstart = 20)
print(EM.ALL.CCV)

# Compute CCVP index
EM.CCVP = CCV.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "CCVP",
  method =  'EM', iter = 100, nstart = 20)
print(EM.CCVP)

Calinski–Harabasz (CH) index

Description

Computes the CH (T. Calinski and J. Harabasz, 1974) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

CH.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)
CH.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The CH index is defined as

$CH(k) = \frac{n-k}{k-1}\frac{\sum_{i=1}^k|C|_id(v_i,\bar{x})}{\sum_{i=1}^k\sum_{x_j\in C_i}d(x_j,v_i)}$

The largest value of $CH(k)$ indicates a valid optimal partition.

Value

`CH`	the CH index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the CH index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

T. Calinski, J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics, 3, 1-27 (1974).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the CH index
K.CH = CH.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.CH)

# The optimal number of cluster
K.CH[which.max(K.CH$CH),]

# ---- Hierarchical ----

# Average linkage

# Compute the CH index
H.CH = CH.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.CH)

# The optimal number of cluster
H.CH[which.max(H.CH$CH),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the CH index
K.CH = CH.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.CH)

# The optimal number of cluster
K.CH[which.max(K.CH$CH),]

# ---- Hierarchical ----

# Average linkage

# Compute the CH index
H.CH = CH.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.CH)

# The optimal number of cluster
H.CH[which.max(H.CH$CH),]

Chou-Su-Lai (CSL) index

Description

Computes the CSL (C. H. Chou et al., 2004) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

CSL.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)
CSL.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The CSL index is defined as

$CSL(k) = \frac{\sum_{i=1}^k \left\{\frac{1}{|C_i|}\sum_{x_j \in C_i} \max_{x_l \in C_i} d(x_j,x_l)\right\}}{\sum_{i=1}^k \left\{\min_{j:j \ne i}d(v_i,v_j)\right\}}.$

The smallest value of $CSL(k)$ indicates a valid optimal partition.

Value

CSL

the CSL index for k from kmin to kmax shown in a data frame where the first and the second columns are k and the CSL index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

C. H. Chou, M. C. Su, E. Lai, "A new cluster validity measure and its application to image compression," Pattern Anal Applic, 7, 205-220 (2004).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the CSL index
K.CSL = CSL.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.CSL)

# The optimal number of cluster
K.CSL[which.min(K.CSL$CSL),]

# ---- Hierarchical ----

# Average linkage

# Compute the CSL index
H.CSL = CSL.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.CSL)

# The optimal number of cluster
H.CSL[which.min(H.CSL$CSL),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the CSL index
K.CSL = CSL.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.CSL)

# The optimal number of cluster
K.CSL[which.min(K.CSL$CSL),]

# ---- Hierarchical ----

# Average linkage

# Compute the CSL index
H.CSL = CSL.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.CSL)

# The optimal number of cluster
H.CSL[which.min(H.CSL$CSL),]

D1 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

D1_data
D1_data

Format

A data frame with 1500 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D10 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 3 different Gaussian and 2 Uniform distributions labeled as 1-5.

Usage

D10_data
D10_data

Format

A data frame with 1250 data points and 3 variables

x: Numeric values generated from Gaussian and Uniform distributions
y: Numeric values generated from Gaussian and Uniform distributions
label: Categorical labels 1,2,3,4,5

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D2 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

D2_data
D2_data

Format

A data frame with 1200 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D3 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 4 different Gaussian distributions labeled as 1-4.

Usage

D3_data
D3_data

Format

A data frame with 1400 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D4 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 4 different Gaussian distributions labeled as 1-4.

Usage

D4_data
D4_data

Format

A data frame with 2400 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D5 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 5 different Gaussian distributions labeled as 1-5.

Usage

D5_data
D5_data

Format

A data frame with 350 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D6 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 5 different Gaussian distributions labeled as 1-5.

Usage

D6_data
D6_data

Format

A data frame with 1100 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D7 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

D7_data
D7_data

Format

A data frame with 1500 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D8 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

D8_data
D8_data

Format

A data frame with 2000 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

D9 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 3 different Uniform distributions labeled as 1-3.

Usage

D9_data
D9_data

Format

A data frame with 1000 data points and 3 variables

x: Numeric values generated from Uniform distributions
y: Numeric values generated from Uniform distributions
label: Categorical labels 1,2,3

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

Davies–Bouldin (DB) and DB* (DBs) indexes

Description

Computes the DB (D. L. Davies and D. W. Bouldin, 1979) and DBs (M. Kim and R. S. Ramakrishna, 2005) indexes for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

DB.IDX(x, kmax, kmin = 2, method = "kmeans",
  indexlist = "all", p = 2, q = 2, nstart = 100)
DB.IDX(x, kmax, kmin = 2, method = "kmeans",
  indexlist = "all", p = 2, q = 2, nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`indexlist`	a character string indicating which cluster validity indexes to be computed (`"all"`, `"DB"`, `"DBs"`). More than one indexes can be selected.
`p`	the power of the Minkowski distance between centroids of clusters. The default is `2`.
`q`	the power of dispersion measure of a cluster. The default is `2`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The lowest value of $DB(k),DBs(k)$ indicates a valid optimal partition.

Value

`DB`	the DB index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the DB index, respectively.
`DBs`	the DBs index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the DBs index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

D. L. Davies, D. W. Bouldin, "A cluster separation measure," IEEE Trans Pattern Anal Machine Intell, 1, 224-227 (1979).

M. Kim, R. S. Ramakrishna, "New indices for cluster validity assessment," Pattern Recognition Letters, 26, 2353-2363 (2005).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by DB.IDX
K.ALL = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "all", p = 2, q = 2, nstart = 100)
print(K.ALL)

# Compute DB index
K.DB = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "DB", p = 2, q = 2, nstart = 100)
print(K.DB)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by DB.IDX
H.ALL = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "all", p = 2, q = 2)
print(H.ALL)

# Compute DB index
H.DB = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "DB", p = 2, q = 2)
print(H.DB)
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by DB.IDX
K.ALL = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "all", p = 2, q = 2, nstart = 100)
print(K.ALL)

# Compute DB index
K.DB = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "DB", p = 2, q = 2, nstart = 100)
print(K.DB)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by DB.IDX
H.ALL = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "all", p = 2, q = 2)
print(H.ALL)

# Compute DB index
H.DB = DB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "DB", p = 2, q = 2)
print(H.DB)

Dunn index

Description

Computes the DI (J. C. Dunn, 1973) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

DI.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)
DI.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The DI index is defined as

$DI(k) = \min_{i \ne j \in [k]}\left\{\frac{\min\left\{d(x_u,x_v)|x_u\in C_i,x_v \in C_j\right\}}{\max_{l \in [k]}\max\left\{d(x_u,x_v)|x_u,x_v \in C_l\right\}}\right\}.$

The largest value of $DI(k)$ indicates a valid optimal partition.

Value

`DI`	the DI index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the DI index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

J. C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters," J Cybern, 3(3), 32-57 (1973).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the DI index
K.DI = DI.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.DI)

# The optimal number of cluster
K.DI[which.max(K.DI$DI),]

# ---- Hierarchical ----

# Average linkage

# Compute the DI index
H.DI = DI.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.DI)

# The optimal number of cluster
H.DI[which.max(H.DI$DI),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the DI index
K.DI = DI.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.DI)

# The optimal number of cluster
K.DI[which.max(K.DI$DI),]

# ---- Hierarchical ----

# Average linkage

# Compute the DI index
H.DI = DI.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.DI)

# The optimal number of cluster
H.DI[which.max(H.DI$DI),]

Fuzzy cluster validity indexes used in Wiroonsri and Preedasawakul (2023)

Description

Computes the cluster validity indexes for a result of either FCM or EM clustering from user specified cmin to cmax used in Wiroonsri and Preedasawakul (2023). It includes the XB (X. L. Xie and G. Beni, 1991) index, KWON (S. H. Kwon, 1998) index, KWON2 (S. H. Kwon et al., 2021) index, TANG (Y. Tang et al., 2005) index , HF (F. Haouas et al., 2017) index, WL (C. H. Wu et al., 2015) index, PBM (M. K. Pakhira et al., 2004) index, KPBM (C. Alok, 2010) index, CCVP and CCVS (M. Popescu et al., 2013) index, GC1, GC2, GC3, and GC4 (J. C. Bezdek et al., 2016) indexes , WPC, WP, WPCI1, and, WPCI2 (N. Wiroonsri and O. Preedasawakul, 2023) indexes.

Usage

FzzyCVIs(x, cmax, cmin = 2, indexlist = 'all', corr = 'pearson',
  method = 'FCM', fzm = 2, gamma = (fzm^2*7)/4, sampling = 1,
  iter = 100, nstart = 20, NCstart = TRUE)
FzzyCVIs(x, cmax, cmin = 2, indexlist = 'all', corr = 'pearson',
  method = 'FCM', fzm = 2, gamma = (fzm^2*7)/4, sampling = 1,
  iter = 100, nstart = 20, NCstart = TRUE)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`indexlist`	a character string indicating which cluster validity indexes to be computed (`"all"`, `"WPC"`, `"WP"`, `"WPCI1"`, `"WPCI2"`, `"XB"`, "`"KWON"`", "`"KWON2"`", "`"TANG"`", "`"HF"`", `"WL"`, `"PBM"`, `"KPBM"`, `"CCVP"`, `"CCVS"`, `"GC1"`, `"GC2"`, `"GC3"`, `"GC4"`). More than one indexes can be selected.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`) for `indexlist` = (`"WP"`, `"WPC"`, `"WPCI1"`,`"WPCI2"`, `"CCVP"`, `"CCVS"`, `"GC1"`, `"GC2"`, `"GC3"` or `"GC4"`). The default is `"pearson"`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`gamma`	adjusted fuzziness parameter for `indexlist` = (`"WP"`, `"WPC"`, `"WPCI1"`, `"WPCI2"`). The default is $7fzm^2/4$ .
`sampling`	a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is `1`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`NCstart`	logical for `indexlist` includes either of the `"WP"`, `"WPC"`, `"WPCI1"`, and `"WPCI2"`), if `TRUE`, the WP correlation at `c=1` is defined as the ratio introduced in the reference. Otherwise, it is assigned as `0`.

Details

The well-known cluster validity indexes for either FCM or EM clustering. It includes the XB (X. L. Xie and G. Beni., 1991) index, KWON (S. H. Kwon, 1998) index, KWON2 (S. H. Kwon et al., 2021) index, TANG (Y. Tang et al., 2005) index , HF (F. Haouas et al., 2017) index, WL (C. H. Wu et al., 2015) index, PBM (M. K. Pakhira et al., 2004) index, KPBM (C. Alok, 2010) index, CCVP and CCVS (M. Popescu et al., 2013) index, GC1, GC2, GC3, and GC4 (J. C. Bezdek et al., 2016) indexes , WPC, WP, WPCI1, and, WPCI2 (N. Wiroonsri and O. Preedasawakul, 2023) indexes.

The WPC computes the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to the pair. WPCI1 and WPCI2 are the proportion and the subtraction, respectively, of the same two ratios. The first ratio is the WPC improvement from c-1 clusters to c clusters over the entire room for improvement. The second ratio is the WPC improvement from c clusters to c+1 clusters over the entire room for improvement. WP is defined as a combination of WPCI1 and WPCI2.

Value

WPC

the WP correlation from c from cmin-1 to cmax+1 shown in a data frame.

Each of the followings shows the values of each index for c from cmin to cmax in a data frame.

`WP`	the WP index.
`WPCI1`	the WPCI1 index.
`WPCI2`	the WPCI2 index.
`XB`	the XB index.
`KWON`	the KWON index.
`KWON2`	the KWON2 index.
`TANG`	the TANG index.
`HF`	the HF index.
`WL`	the WL index.
`PBM`	the PBM index
`KPBM`	the KPBM index
`CCVP`	the Pearson Correlation Cluster Validity index.
`CCVS`	the Spearman’s (rho) Correlation Cluster Validity index.
`GC1`	the generalized C index ( $\sum\cdot \sim$ Sum-Product).
`GC2`	the generalized C index ( $\sum\wedge \sim$ Sum-Min).
`GC3`	the generalized C index ( $\vee\cdot \sim$ Max-Product).
`GC4`	the generalized C index ( $\vee\wedge \sim$ Max-Min).

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

C. Alok. (2010). "An investigation of clustering algorithms and soft computing approaches for pattern recognition," Department of Computer Science, Assam University.

J. C. Bezdek, M. Moshtaghi, T. Runkler, C. Leckie, “The generalized c index for internal fuzzy cluster validity,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 6, pp. 1500–1512, 2016.

F. Haouas, Z. Ben Dhiaf, A. Hammouda, B. Solaiman, "A new efficient fuzzy cluster validity index: Application to images clustering," 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Naples, Italy, 2017, pp. 1-6.

S. H. Kwon, “Cluster validity index for fuzzy clustering,” Electronics letters, vol. 34, no. 22, pp. 2176–2177, 1998.

S. H. Kwon, J. Kim, S. H. Son, “Improved cluster validity index for fuzzy clustering,” Electronics Letters, vol. 57, no. 21, pp. 792–794, 2021.

M. K. Pakhira, S. Bandyopadhyay, U. Maulik, “Validity index for crisp and fuzzy clusters,” Pattern recognition, vol. 37, no. 3, pp. 487–501, 2004.

M. Popescu, J. C. Bezdek, T. C. Havens, J. M. Keller, "A Cluster Validity Framework Based on Induced Partition Dissimilarity," in IEEE Transactions on Cybernetics, vol. 43, no. 1, pp. 308-320, Feb. 2013.

Y. Tang, F. Sun, Z. Sun, “Improved validation index for fuzzy clustering,” in Proceedings of the 2005, American Control Conference, 2005., pp. 1120–1125 vol. 2, 2005.

N. Wiroonsri, O. Preedasawakul, "A correlation-based fuzzy cluster validity index with secondary options detector," arXiv:2308.14785, 2023

C. H. Wu, C. S. Ouyang, L. W. Chen, L. W. Lu, “A new fuzzy clustering validity index with a median factor for centroid-based clustering,” IEEE Transactions on Fuzzy Systems, vol. 23, no. 3, pp. 701–718, 2015.

X. Xie, G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.

Examples


library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----


# Compute selected a set of indices ("WPC","WP","XB") using default gamma
F.s = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = c("WPC","WP","XB"),
  corr = 'pearson', method = 'FCM', fzm = 2, iter = 100, nstart = 20, NCstart = TRUE)

# Plot the computed indexes
plot_idx(F.s)

# ---- EM algorithm ----

# Compute all the indices by FzzyCVIs using default gamma
E.all = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = 'all', corr = 'pearson',
  method = 'EM', iter = 100, nstart = 20, NCstart = TRUE)

# Plot the computed indexes
plot_idx(E.all)

library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----


# Compute selected a set of indices ("WPC","WP","XB") using default gamma
F.s = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = c("WPC","WP","XB"),
  corr = 'pearson', method = 'FCM', fzm = 2, iter = 100, nstart = 20, NCstart = TRUE)

# Plot the computed indexes
plot_idx(F.s)

# ---- EM algorithm ----

# Compute all the indices by FzzyCVIs using default gamma
E.all = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = 'all', corr = 'pearson',
  method = 'EM', iter = 100, nstart = 20, NCstart = TRUE)

# Plot the computed indexes
plot_idx(E.all)

The generalized C index

Description

Computes the GC1 GC2 GC3 and GC4 (J. C. Bezdek et al., 2016) indexes for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

GC.IDX(x, cmax, cmin = 2, indexlist = "all", method = 'FCM', fzm = 2,
  iter = 100, nstart = 20)
GC.IDX(x, cmax, cmin = 2, indexlist = "all", method = 'FCM', fzm = 2,
  iter = 100, nstart = 20)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`indexlist`	a character string indicating which The generalized C index be computed ("`all`","`GC1`","`GC2`","`GC3`","`GC4`"). More than one indexes can be selected.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.

Details

The GC index is a soft version of the C-index, formulated based on relational transformations of the membership degree matrix $\mu$ . It comprises four distinct variants, each with its own definition.
The smallest value of $GC(c)$ indicates a valid optimal partition.

Value

Each of the followings shows the values of each index for c from cmin to cmax in a data frame.

`GC1`	the generalized C index ( $\sum\cdot \sim$ Sum-Product).
`GC2`	the generalized C index ( $\sum\wedge \sim$ Sum-Min).
`GC3`	the generalized C index ( $\vee\cdot \sim$ Max-Product).
`GC4`	the generalized C index ( $\vee\wedge \sim$ Max-Min).

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

J. C. Bezdek, M. Moshtaghi, T. Runkler, and C. Leckie, “The generalized c index for internal fuzzy cluster validity,” IEEE Transactions on Fuzzy Systems, vol. 24, no. 6, pp. 1500–1512, 2016. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7429723&isnumber=7797168

Examples


library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----

# Compute all the indices by GC.IDX
FCM.all.GC = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'FCM', fzm = 2, iter = 100, nstart = 5)
print(FCM.all.GC)

# Compute GC2 index
FCM.GC2 = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "GC2",
  method = 'FCM', fzm = 2, iter = 100, nstart = 5)
print(FCM.GC2)

# ---- EM algorithm ----

# Compute all the indices by GC.IDX
EM.all.GC = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'EM', iter = 100, nstart = 5)
print(EM.all.GC)

# Compute GC2 index
EM.GC2 = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "GC2",
  method =  'EM', iter = 100, nstart = 5)
print(EM.GC2)
library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ---- FCM algorithm ----

# Compute all the indices by GC.IDX
FCM.all.GC = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'FCM', fzm = 2, iter = 100, nstart = 5)
print(FCM.all.GC)

# Compute GC2 index
FCM.GC2 = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "GC2",
  method = 'FCM', fzm = 2, iter = 100, nstart = 5)
print(FCM.GC2)

# ---- EM algorithm ----

# Compute all the indices by GC.IDX
EM.all.GC = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "all",
  method =  'EM', iter = 100, nstart = 5)
print(EM.all.GC)

# Compute GC2 index
EM.GC2 = GC.IDX(scale(x), cmax = 10, cmin = 2, indexlist = "GC2",
  method =  'EM', iter = 100, nstart = 5)
print(EM.GC2)

HF index

Description

Computes the HF (F. Haouas et al., 2017) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

HF.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
HF.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The HF index is defined as

$HF(c) = \frac{\sum_{j=1}^c \sum_{i=1}^n\mu_{ij}^m\| {x}_i-{v}_j\|^2 + \frac{1}{c(c-1)}\sum_{j\neq k}\| {v}_j-{v}_k\|^2}{\frac{n}{2c}\left(\min_{j \neq k}\{\| {v}_j-{v}_k\|^2\} +\text{median}_{j \neq k }\{\| {v}_j-{v}_k\|^2\}\right)}.$

The smallest value of $HF(c)$ indicates a valid optimal partition.

Value

`HF`	the HF index for `c` from `cmin` to `cmax` shown in a data frame where the first and the second columns are `c` and the HF index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

F. Haouas, Z. Ben Dhiaf, A. Hammouda and B. Solaiman, "A new efficient fuzzy cluster validity index: Application to images clustering," 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Naples, Italy, 2017, pp. 1-6. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8015651&isnumber=8015374

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the HF index
FCM.HF = HF.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.HF)

# The optimal number of cluster
FCM.HF[which.min(FCM.HF$HF),]

# ---- EM algorithm ----

# Compute the HF index
EM.HF = HF.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.HF)

# The optimal number of cluster
EM.HF[which.min(EM.HF$HF),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the HF index
FCM.HF = HF.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.HF)

# The optimal number of cluster
FCM.HF[which.min(FCM.HF$HF),]

# ---- EM algorithm ----

# Compute the HF index
EM.HF = HF.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.HF)

# The optimal number of cluster
EM.HF[which.min(EM.HF$HF),]

Wiroonsri(2024) correlation-based cluster validity indices and other well-known cluster validity indices

Description

Computes the cluster validity indexes for a result of either kmeans or hierarchical clustering from user specified kmin to kmax used in Wiroonsri(2024). It includes the DI (J. C. Dunn, 1973) index, CH (T. Calinski and J. Harabasz, 1974) index, DB (D. L. Davies and D. W. Bouldin, 1979) index, PB (G. W. Miligan, 1985) index, CSL (C. H. Chou et al., 2004) index, PBM (M. K. Pakhira et al., 2004) index, DBs (M. Kim and R. S. Ramakrishna, 2005), Score function (S. Saitta et al., 2007), STR (A. Starczewski, 2017) index, NC, NCI, NCI1, and, NCI2 (N. Wiroonsri, 2024) indexes.

Usage

Hvalid(x, kmax, kmin = 2, indexlist = "all", method = "kmeans",
  p = 2, q = 2, corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)
Hvalid(x, kmax, kmin = 2, indexlist = "all", method = "kmeans",
  p = 2, q = 2, corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`indexlist`	a character string indicating which cluster validity indexes to be computed (`"all"`, `"NC"`, `"NCI"`, `"NCI1"`, `"NCI2"`, `"PB"`, `"CSL"`, `"CH"`, `"DB"`, `"DBs"`, `"SF"`, `"DI"`, `"STR"`, `"PBM"`). More than one indexes can be selected.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`p`	the power of the Minkowski distance between centroids of clusters for `indexlist = c("DB","DBs")`. The default is `2`.
`q`	the power of dispersion measure of a cluster for `indexlist = c("DB","DBs")`. The default is `2`.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`). The default is `"pearson"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.
`sampling`	a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is `1`.
`NCstart`	logical for `indexlist` includes the `"NC"`, `"NCI"`, `"NCI1"`, and `"NCI2"`), if `TRUE`, the NC correlation at `k=1` is defined as the ratio introduced in the reference. Otherwise, it is assigned as `0`.

Details

The well-known cluster validity indices used in Wiroonsri(2024). It includes the DI (J. C. Dunn, 1973) index, CH (T. Calinski and J. Harabasz, 1974) index, DB (D. L. Davies and D. W. Bouldin, 1979) index, PB (G. W. Miligan, 1980) index, CSL (C. H. Chou et al., 2004) index, PBM (M. K. Pakhira et al., 2004) index, DBs (M. Kim and R. S. Ramakrishna, 2005), Score function (S. Saitta et al., 2007), STR (A. Starczewski, 2017), NC, NCI, NCI1, and, NCI2 (N. Wiroonsri, 2024) indexes.

The NC correlation computes the correlation between an actual distance between a pair of data points and a centroid distance of clusters that the two points locate in. NCI1 and NCI2 are the proportion and the subtraction, respectively, of the same two ratios. The first ratio is the NC improvement from k-1 clusters to k clusters over the entire room for improvement. The second ratio is the NC improvement from k clusters to k+1 clusters over the entire room for improvement. NCI is a combination of NCI1 and NCI2.

Value

`NC`	the NC correlations for `k` from `kmin-1` to `kmax+1` shown in a data frame where the first and the second columns are `k` and the NC, respectively.

Each of the followings shows the values of each index for k from kmin to kmax in a data frame.

`NCI`	the NCI index.
`NCI1`	the NCI1 index.
`NCI2`	the NCI2 index.
`PB`	the PB index.
`DI`	the DI index.
`DB`	the DB index.
`DBs`	the DBs index.
`CSL`	the CSL index.
`CH`	the CH index.
`SF`	the Score function.
`STR`	the STR index.
`PBM`	the PBM index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

J. C. Bezdek, N. R. Pal, "Some new indexes of cluster validity," IEEE Transactions on Systems, Man, and Cybernetics, Part B, 28, 301-315 (1998).

T. Calinski, J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics, 3, 1-27 (1974).

C. H. Chou, M. C. Su, E. Lai, "A new cluster validity measure and its application to image compression," Pattern Anal Applic, 7, 205-220 (2004).

D. L. Davies, D. W. Bouldin, "A cluster separation measure," IEEE Trans Pattern Anal Machine Intell, 1, 224-227 (1979).

J. C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters," J Cybern, 3(3), 32-57 (1973).

M. Kim, R. S. Ramakrishna, "New indices for cluster validity assessment," Pattern Recognition Letters, 26, 2353-2363 (2005).

G. W. Miligan, "An examination of the effect of six types of error perturbation on fifteen clustering algorithms," Psychometrika, 45, 325-342 (1980).

M. K. Pakhira, S. Bandyopadhyay and U. Maulik, "Validity index for crisp and fuzzy clusters," Pattern Recogn 37(3):487–501 (2004).

S. Saitta, B. Raphael, I. Smith, "A bounded index for cluster validity," In Perner, P.: Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science, 4571, Springer (2007).

A. Starczewski, "A new validity index for crisp clusters," Pattern Anal Applic 20, 687–700 (2017).

N. Wiroonsri, "Clustering performance analysis using a new correlation based cluster validity index," Pattern Recognition, 145, 109910, 2024.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]


# ---- Kmeans ----

# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
  method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
  method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

#---Plot and compare the indexes---

# Compute six cluster validity indexes of a kmeans clustering result for k from 2 to 15
IDX.list = c("NCI", "DI", "DB", "DBs", "CSL", "CH")

Hvalid.result = Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = IDX.list,
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Plot the computed indexes
plot_idx(Hvalid.result)
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]


# ---- Kmeans ----

# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
  method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
  method = "kmeans", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Hvalid
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = "all",
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Compute selected a set of indices ("NC","NCI","DI","DB")
Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = c("NC","NCI","DI","DB"),
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

#---Plot and compare the indexes---

# Compute six cluster validity indexes of a kmeans clustering result for k from 2 to 15
IDX.list = c("NCI", "DI", "DB", "DBs", "CSL", "CH")

Hvalid.result = Hvalid(scale(x), kmax = 15, kmin = 2, indexlist = IDX.list,
  method = "hclust_average", p = 2, q = 2, corr = "pearson", nstart = 100, NCstart = TRUE)

# Plot the computed indexes
plot_idx(Hvalid.result)

Modified Kernel form of Pakhira-Bandyopadhyay-Maulik (KPBM) index

Description

Computes the KPBM (C. Alok, 2010) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

KPBM.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
KPBM.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The KPBM index is defined as

$KPBM(c) = \left(\frac{\max_{j \neq k}\| {v}_j-{v}_k\|}{c\sum_{j=1}^c\sum_{i=1}^n\mu_{ij}\| {x}_i-{v}_j\|}\right)^2.$

The largest value of $KPBM(c)$ indicates a valid optimal partition.

Value

KPBM

the KPBM index for c from cmin to cmax shown in a data frame where the first and the second columns are c and the KPBM index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

C. Alok. (2010). "An investigation of clustering algorithms and soft computing approaches for pattern recognition", Department of Computer Science, Assam University.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KPBM index
FCM.KPBM = KPBM.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KPBM)

# The optimal number of cluster
FCM.KPBM[which.max(FCM.KPBM$KPBM),]

# ---- EM algorithm ----

# Compute the KPBM index
EM.KPBM = KPBM.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KPBM)

# The optimal number of cluster
EM.KPBM[which.max(EM.KPBM$KPBM),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KPBM index
FCM.KPBM = KPBM.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KPBM)

# The optimal number of cluster
FCM.KPBM[which.max(FCM.KPBM$KPBM),]

# ---- EM algorithm ----

# Compute the KPBM index
EM.KPBM = KPBM.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KPBM)

# The optimal number of cluster
EM.KPBM[which.max(EM.KPBM$KPBM),]

KWON index

Description

Computes the KWON (S. H. Kwon, 1998) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

KWON.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
KWON.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The KWON index is defined as

$KWON(c) = \frac{\sum_{j=1}^c\sum_{i=1}^n \mu_{ij}^2 \|{x}_i-{v}_j\|^2 +\frac{1}{c}\sum_{j=1}^c\| {v}_j-{v}_0\|^2}{\min_{i \neq j} \| {v}_i-{v}_j\|^2}.$

The smallest value of $KWON(c)$ indicates a valid optimal partition.

Value

KWON

the KWON index for c from cmin to cmax shown in a data frame where the first and the second columns are c and the KWON index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

S. H. Kwon, “Cluster validity index for fuzzy clustering,” Electronics letters, vol. 34, no. 22, pp. 2176–2177, 1998. doi:10.1049/el:19981523

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KWON index
FCM.KWON = KWON.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KWON)
# The optimal number of cluster
FCM.KWON[which.min(FCM.KWON$KWON),]

# ---- EM algorithm ----

# Compute the KWON index
EM.KWON = KWON.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KWON)
# The optimal number of cluster
EM.KWON[which.min(EM.KWON$KWON),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KWON index
FCM.KWON = KWON.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KWON)
# The optimal number of cluster
FCM.KWON[which.min(FCM.KWON$KWON),]

# ---- EM algorithm ----

# Compute the KWON index
EM.KWON = KWON.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KWON)
# The optimal number of cluster
EM.KWON[which.min(EM.KWON$KWON),]

KWON2 index

Description

Computes the KWON2 (S. H. Kwon et al., 2021) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

KWON2.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
KWON2.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

KWON2 is defined as

$KWON2(c) = \frac{w_1\left[w_2\sum_{j=1}^c\sum_{i=1}^n \mu_{ij}^{2^{\sqrt{\frac{m}{2}}}} \|{x}_i-{v}_j\|^2 + \frac{\sum_{j=1}^c\| {v}_j-{v}_0\|^2}{\max_j \|{v}_j-{v}_0\|^2 } + w_3 \right]}{\min_{i \neq j} \| {v}_i-{v}_j\|^2 + \frac{1}{c}+\frac{1}{c^m-1}}.$

where $w_1 = \frac{n-c+1}{n}$ , $w_2 = \left(\frac{c}{c-1}\right)^{\sqrt{2}}$ and $w_3=\frac{nc}{(n-c+1)^2}$ .

The smallest value of $KWON2(c)$ indicates a valid optimal partition.

Value

KWON2

the KWON2 index for c from cmin to cmax shown in a data frame where the first and the second columns are c and the KWON2 index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

S. H. Kwon, J. Kim, and S. H. Son, “Improved cluster validity index for fuzzy clustering,” Electronics Letters, vol. 57, no. 21, pp. 792–794, 2021.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KWON2 index
FCM.KWON2 = KWON2.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KWON2)

# The optimal number of cluster
FCM.KWON2[which.min(FCM.KWON2$KWON2),]

# ---- EM algorithm ----

# Compute the KWON2 index
EM.KWON2 = KWON2.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KWON2)

# The optimal number of cluster
EM.KWON2[which.min(EM.KWON2$KWON2),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the KWON2 index
FCM.KWON2 = KWON2.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.KWON2)

# The optimal number of cluster
FCM.KWON2[which.min(FCM.KWON2$KWON2),]

# ---- EM algorithm ----

# Compute the KWON2 index
EM.KWON2 = KWON2.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.KWON2)

# The optimal number of cluster
EM.KWON2[which.min(EM.KWON2$KWON2),]

Point biserial correlation (PB)

Description

Computes the PB (G. W. Miligan, 1980) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

PB.IDX(x, kmax, kmin = 2, method = "kmeans", corr = "pearson", nstart = 100)
PB.IDX(x, kmax, kmin = 2, method = "kmeans", corr = "pearson", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`). The default is `"pearson"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The largest value of $PB(k)$ indicates a valid optimal partition.

Value

`PB`	the PB index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the PB index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

G. W. Miligan, "An examination of the effect of six types of error perturbation on fifteen clustering algorithms," Psychometrika, 45, 325-342 (1980).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute PB index
K.PB = PB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100)
print(K.PB)

# The optimal number of cluster
K.PB[which.max(K.PB$PB),]

# ---- Hierarchical ----

# Average linkage

# Compute PB index
H.PB = PB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  corr = "pearson")
print(H.PB)

# The optimal number of cluster
H.PB[which.max(H.PB$PB),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute PB index
K.PB = PB.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100)
print(K.PB)

# The optimal number of cluster
K.PB[which.max(K.PB$PB),]

# ---- Hierarchical ----

# Average linkage

# Compute PB index
H.PB = PB.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  corr = "pearson")
print(H.PB)

# The optimal number of cluster
H.PB[which.max(H.PB$PB),]

Pakhira-Bandyopadhyay-Maulik (PBM) index

Description

Computes the PBM (M. K. Pakhira et al., 2004) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

PBM.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
PBM.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The PBM index is defined as

$PBM(c) = \left(\frac{\sum_{i=1}^n \| {x}_i-{v}_0\| \cdot \max_{j \neq k}\| {v}_j-{v}_k\|}{c\sum_{j=1}^c\sum_{i=1}^n\mu_{ij}\| {x}_i-{v}_j\|}\right)^2.$

The largest value of $PBM(c)$ indicates a valid optimal partition.

Value

PBM

the PBM index for c from cmin to cmax shown in a data frame where the first and the second columns are c and the PBM index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

M. K. Pakhira, S. Bandyopadhyay, and U. Maulik, “Validity index for crisp and fuzzy clusters,” Pattern recognition, vol. 37, no. 3, pp. 487–501, 2004.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the PBM index
FCM.PBM = PBM.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.PBM)

# The optimal number of cluster
FCM.PBM[which.max(FCM.PBM$PBM),]

# ---- EM algorithm ----

# Compute the PBM index
EM.PBM = PBM.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.PBM)

# The optimal number of cluster
EM.PBM[which.max(EM.PBM$PBM),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the PBM index
FCM.PBM = PBM.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.PBM)

# The optimal number of cluster
FCM.PBM[which.max(FCM.PBM$PBM),]

# ---- EM algorithm ----

# Compute the PBM index
EM.PBM = PBM.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.PBM)

# The optimal number of cluster
EM.PBM[which.max(EM.PBM$PBM),]

Plots for visualizing CVIs

Description

Plot and compare upto 8 indices computed by the algorithms in this package.

Usage

plot_idx(idxresult,selected.idx = NULL)
plot_idx(idxresult,selected.idx = NULL)

Arguments

`idxresult`	a result from one of the algorithms `FzzyCVIs, WP.IDX, GC.IDX, CCV.IDX, XB.IDX, WL.IDX, TANG.IDX, PBM.IDX, KWON.IDX, KWON2.IDX, KPBM.IDX, HF.IDX, Hvalid, Wvalid, SF.IDX, PB.IDX, DI.IDX, DB.IDX, CSL.IDX, CH.IDX or STRPBM.IDX`.
`selected.idx`	a numeric vector indicates a part of the indexes from the `idxresult` in respective order selected by a user. For instance, `selected.idx = 3` or `selected.idx = c(1,3,5)` may be selected. If not specified, the full idxresult will be considered.

Value

Plots of upto 8 cluster validity indices computed from FzzyCVIs, WP.IDX, GC.IDX, CCV.IDX, XB.IDX, WL.IDX, TANG.IDX, PBM.IDX, KWON.IDX, KWON2.IDX, KPBM.IDX, HF.IDX, Hvalid, Wvalid, SF.IDX, PB.IDX, DI.IDX, DB.IDX, CSL.IDX, CH.IDX or STRPBM.IDX. When using the isolated index algorithm, all the plots computed by that algorithm will be shown. When using FzzyCVIs or Hvalid with more than 8 selected indices, the first 8 indices will be plotted.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, "A correlation-based fuzzy cluster validity index with secondary options detector," arXiv:2308.14785, 2023

Examples


library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ----Compute all the indices by FzzyCVIs ----
FCVIs = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = 'all', corr = 'pearson',
                 method = 'FCM', fzm = 2, iter = 100, nstart = 20, NCstart = TRUE)

# plots of the eight indices by default
plot_idx(idxresult = FCVIs)

# plots of a specific selected.idx
plot_idx(idxresult = FCVIs, selected.idx = c(2,5,7))

# ----Compute all the indices by Wvalid ----
FCM.NC = Wvalid(scale(x), kmax = 10, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)

# plots of the four indices by default
plot_idx(idxresult = FCM.NC)

# ----Compute all the indices by XB.IDX ----

FCM.XB = XB.IDX(scale(x), cmax = 10, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
plot_idx(idxresult = FCM.XB)
library(UniversalCVI)

# Iris data
x = iris[,1:4]

# ----Compute all the indices by FzzyCVIs ----
FCVIs = FzzyCVIs(scale(x), cmax = 10, cmin = 2, indexlist = 'all', corr = 'pearson',
                 method = 'FCM', fzm = 2, iter = 100, nstart = 20, NCstart = TRUE)

# plots of the eight indices by default
plot_idx(idxresult = FCVIs)

# plots of a specific selected.idx
plot_idx(idxresult = FCVIs, selected.idx = c(2,5,7))

# ----Compute all the indices by Wvalid ----
FCM.NC = Wvalid(scale(x), kmax = 10, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)

# plots of the four indices by default
plot_idx(idxresult = FCM.NC)

# ----Compute all the indices by XB.IDX ----

FCM.XB = XB.IDX(scale(x), cmax = 10, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
plot_idx(idxresult = FCM.XB)

R1 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 9 different Gaussian distributions labeled as 1-9.

Usage

R1_data
R1_data

Format

A data frame with 450 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6,7,8,9

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R2 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 7 different Gaussian distributions labeled as 1-7.

Usage

R2_data
R2_data

Format

A data frame with 1750 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6,7

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R3 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 16 different Gaussian distributions labeled as 1-16.

Usage

R3_data
R3_data

Format

A data frame with 1600 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,...,16

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R4 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 5 different Gaussian distributions labeled as 1-5.

Usage

R4_data
R4_data

Format

A data frame with 1250 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R5 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

R5_data
R5_data

Format

A data frame with 1200 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R6 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian distributions labeled as 1-6.

Usage

R6_data
R6_data

Format

A data frame with 1500 data points and 3 variables

x: Numeric values generated from Gaussian distributions
y: Numeric values generated from Gaussian distributions
label: Categorical labels 1,2,3,4,5,6

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

R7 Artificial Dataset

Description

A 2-dimensional dataset from Wiroonsri and Preedasawakul (2023) generated from 6 different Gaussian and 3 Uniform distributions labeled as 1-3.

Usage

R7_data
R7_data

Format

A data frame with 1200 data points and 3 variables

x: Numeric values generated from Gaussian and Uniform distributions
y: Numeric values generated from Gaussian and Uniform distributions
label: Categorical labels 1,2,3

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, A correlation-based fuzzy cluster validity index with secondary options detector, arXiv:2308.14785, 2023

The score function

Description

Computes the SF (S. Saitta et al., 2007) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

SF.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)
SF.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

The smallest value of $SF(k)$ indicates a valid optimal partition.

Value

`SF`	the Score function index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the SF index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the SF index
K.SF = SF.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.SF)

# The optimal number of cluster
K.SF[which.min(K.SF$SF),]

# ---- Hierarchical ----

# Average linkage

# Compute the SF index
H.SF = SF.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.SF)

# The optimal number of cluster
H.SF[which.min(H.SF$SF),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute the SF index
K.SF = SF.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans", nstart = 100)
print(K.SF)

# The optimal number of cluster
K.SF[which.min(K.SF$SF),]

# ---- Hierarchical ----

# Average linkage

# Compute the SF index
H.SF = SF.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average")
print(H.SF)

# The optimal number of cluster
H.SF[which.min(H.SF$SF),]

Silhouette index

Description

Computes the SH (Rousseeuw, 1987; Kaufman and Rousseeuw, 2009) index for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

SH.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)
SH.IDX(x, kmax, kmin = 2, method = "kmeans", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

For $i \in [n]$ , $l \in [k]$ , and $x_i \in C_l$ , let

$a(i) = \dfrac{1}{|C_l|-1}\sum_{y \in C_l} \left\|x_i-y\right\| and$

$b(i) = \min_{r \neq l} \dfrac{1}{|C_r|} \sum_{y \in C_r} \left\|x_i-y\right\|.$

The silhouette value of one data point $x_j$ is defined as:

$s(j) = \begin{cases} \dfrac{b(j) - a(j)}{\max\{a(j),b(i)\}} &\text{ \ \ if \ } |C_j| > 1 \\ 0 &\text{ \ \ if \ } |C_j| = 1 \end{cases}.$

The silhouette index is defined as

$SH(k) = \dfrac{1}{n} \sum_{i = 1}^n s(i).$

The largest value of $SH(k)$ indicates a valid optimal partition.

Value

`SH`	the SH index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the SH index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.

Kaufman, L. and Rousseeuw, P.J., 2009. Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Hierarchical ----

# Average linkage

# Compute the SH index
H.SH = SH.IDX(scale(x), kmax = 10, kmin = 2, method = "hclust_average", nstart = 1)
print(H.SH)

# The optimal number of cluster
H.SH[which.max(H.SH$SH),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Hierarchical ----

# Average linkage

# Compute the SH index
H.SH = SH.IDX(scale(x), kmax = 10, kmin = 2, method = "hclust_average", nstart = 1)
print(H.SH)

# The optimal number of cluster
H.SH[which.max(H.SH$SH),]

Starczewski and Pakhira-Bandyopadhyay-Maulik for crisp clustering indexes

Description

Computes the STR (A. Starczewski, 2017) and PBM (M. K. Pakhira et al., 2004) indexes for a result either kmeans or hierarchical clustering from user specified kmin to kmax.

Usage

STRPBM.IDX(x, kmax, kmin = 2, method = "kmeans", indexlist = "all", nstart = 100)
STRPBM.IDX(x, kmax, kmin = 2, method = "kmeans", indexlist = "all", nstart = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`indexlist`	a character string indicating which cluster validity indexes to be computed (`"all"`, `"STR"`, `"PBM"`). More than one indexes can be selected.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.

Details

PBM index can be used with both crisp and fuzzy clustering algorithms.
The largest value of $STR(k)$ indicates a valid optimal partition.
The largest value of $PBM(k)$ indicates a valid optimal partition.

Value

`STR`	the STR index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the STR index, respectively.
`PBM`	the PBM index for `k` from `kmin` to `kmax` shown in a data frame where the first and the second columns are `k` and the PBM index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

M. K. Pakhira, S. Bandyopadhyay and U. Maulik, "Validity index for crisp and fuzzy clusters," Pattern Recogn 37(3):487–501 (2004).

A. Starczewski, "A new validity index for crisp clusters," Pattern Anal Applic 20, 687–700 (2017).

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by STRPBM.IDX
K.ALL = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "all", nstart = 100)
print(K.ALL)

# Compute STR index
K.STR = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "STR", nstart = 100)
print(K.STR)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by STRPBM.IDX
H.ALL = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "all")
print(H.ALL)

# Compute STR index
H.STR = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "STR")
print(H.STR)

library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by STRPBM.IDX
K.ALL = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "all", nstart = 100)
print(K.ALL)

# Compute STR index
K.STR = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "kmeans",
  indexlist = "STR", nstart = 100)
print(K.STR)

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by STRPBM.IDX
H.ALL = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "all")
print(H.ALL)

# Compute STR index
H.STR = STRPBM.IDX(scale(x), kmax = 15, kmin = 2, method = "hclust_average",
  indexlist = "STR")
print(H.STR)

Tang index

Description

Computes the TANG (Y. Tang et al., 2005) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

TANG.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
TANG.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The Tang index is defined as

$TANG(c) = \frac{\sum_{j=1}^c \sum_{i=1}^n\mu_{ij}^2\| {x}_i-{v}_j\|^2 + \frac{1}{c(c-1)}\sum_{j\neq k}\| {v}_j-{v}_k\|^2}{\min_{j\neq k} \{ \| {v}_j-{v}_k\|^2 \}+\frac{1}{c}}.$

The smallest value of $TANG(c)$ indicates a valid optimal partition.

Value

TANG

the TANG index for c from cmin to cmax shown in a data frame where the first and the second columns are c and the TANG index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

Y. Tang, F. Sun, and Z. Sun, “Improved validation index for fuzzy clustering,” in Proceedings of the 2005, American Control Conference, 2005., pp. 1120–1125 vol. 2, 2005. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1470111&isnumber=31519

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the TANG index
FCM.TANG = TANG.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.TANG)

# The optimal number of cluster
FCM.TANG[which.min(FCM.TANG$TANG),]

# ---- EM algorithm ----

# Compute the TANG index
EM.TANG = TANG.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.TANG)

# The optimal number of cluster
EM.TANG[which.min(EM.TANG$TANG),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the TANG index
FCM.TANG = TANG.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.TANG)

# The optimal number of cluster
FCM.TANG[which.min(FCM.TANG$TANG),]

# ---- EM algorithm ----

# Compute the TANG index
EM.TANG = TANG.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.TANG)

# The optimal number of cluster
EM.TANG[which.min(EM.TANG$TANG),]

Wu and Li (WL) index

Description

Computes the WL (C. H. Wu et al., 2015) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

WL.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
WL.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The WL index is defined as

$WL(c) = \frac{\sum_{j=1}^c\left(\frac{\sum_{i=1}^n\mu_{ij}^2\| {x}_i-{v}_j\|^2}{\sum_{i=1}^n\mu_{ij}}\right)}{min_{j \neq k}\{\| {v}_j-{v}_k\|^2\} +median_{j \neq k }\{\| {v}_j-{v}_k\|^2\}}.$

The smallest value of $WL(c)$ indicates a valid optimal partition.

Value

`WL`	the WL index for `c` from `cmin` to `cmax` shown in a data frame where the first and the second columns are `c` and the WL index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

C. H. Wu, C. S. Ouyang, L. W. Chen, and L. W. Lu, “A new fuzzy clustering validity index with a median factor for centroid-based clustering,” IEEE Transactions on Fuzzy Systems, vol. 23, no. 3, pp. 701–718, 2015.https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6811211&isnumber=7115244

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the WL index
FCM.WL = WL.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.WL)

# The optimal number of cluster
FCM.WL[which.min(FCM.WL$WL),]

# ---- EM algorithm ----

# Compute the WL index
EM.WL = WL.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.WL)

# The optimal number of cluster
EM.WL[which.min(EM.WL$WL),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the WL index
FCM.WL = WL.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.WL)

# The optimal number of cluster
FCM.WL[which.min(FCM.WL$WL),]

# ---- EM algorithm ----

# Compute the WL index
EM.WL = WL.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.WL)

# The optimal number of cluster
EM.WL[which.min(EM.WL$WL),]

Wiroonsri and Preedasawakul (WP) index

Description

Computes the WPC (WP correlation), WP, WPCI1 and WPCI2 (N. Wiroonsri and O. Preedasawakul, 2023) indexes for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

WP.IDX(x, cmax, cmin = 2, corr = 'pearson', method = 'FCM', fzm = 2,
  gamma = (fzm^2*7)/4, sampling = 1, iter = 100, nstart = 20, NCstart = TRUE)
WP.IDX(x, cmax, cmin = 2, corr = 'pearson', method = 'FCM', fzm = 2,
  gamma = (fzm^2*7)/4, sampling = 1, iter = 100, nstart = 20, NCstart = TRUE)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`). The default is `"pearson"`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`gamma`	adjusted fuzziness parameter for `indexlist` = (`"WP"`, `"WPC"`, `"WPCI1"`, `"WPCI2"`). The default is computed from $7fzm^2/4$ .
`sampling`	a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is `1`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`NCstart`	logical for `indexlist` = (`"WP"`, `"WPC"`, `"WPCI1"`,`"WPCI2"`), if `TRUE`, the WP correlation at c=1 is defined as an adjusted sd of the distances between all data points and their mean. Otherwise, the WP correlation at c=1 is defined as 0.

Details

The newly introduced index was inspired by the recently introduced Wiroonsri index which is only compatible with hard clustering methods.

Value

WPC

the WP correlations for c from cmin-1 to cmax+1 shown in a data frame where the first and the second columns are c and the WPC, respectively.

Each of the followings show the value of each index for c from cmin to cmax in a data frame.

`WP`	the WP index.
`WPCI1`	the WPCI1 index.
`WPCI2`	the WPCI2 index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, O. Preedasawakul, "A correlation-based fuzzy cluster validity index with secondary options detector," arXiv:2308.14785, 2023

Examples

library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute all the indices by WP.IDX using default gamma
FCM.WP = WP.IDX(scale(x), cmax = 10, cmin = 2, corr = 'pearson', method = 'FCM', fzm = 2,
  iter = 100, nstart = 20, NCstart = TRUE)
print(FCM.WP$WP)

# The optimal number of cluster
FCM.WP$WP[which.max(FCM.WP$WP$WPI),]


# ---- EM algorithm ----

# Compute all the indices by WP.IDX using default gamma
EM.WP = WP.IDX(scale(x), cmax = 10, cmin = 2, corr = 'pearson', method = 'EM',
  iter = 100, nstart = 20, NCstart = TRUE)
print(EM.WP$WP)

# The optimal number of cluster
EM.WP$WP[which.max(EM.WP$WP$WPI),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute all the indices by WP.IDX using default gamma
FCM.WP = WP.IDX(scale(x), cmax = 10, cmin = 2, corr = 'pearson', method = 'FCM', fzm = 2,
  iter = 100, nstart = 20, NCstart = TRUE)
print(FCM.WP$WP)

# The optimal number of cluster
FCM.WP$WP[which.max(FCM.WP$WP$WPI),]


# ---- EM algorithm ----

# Compute all the indices by WP.IDX using default gamma
EM.WP = WP.IDX(scale(x), cmax = 10, cmin = 2, corr = 'pearson', method = 'EM',
  iter = 100, nstart = 20, NCstart = TRUE)
print(EM.WP$WP)

# The optimal number of cluster
EM.WP$WP[which.max(EM.WP$WP$WPI),]

Wiroonsri(2024) correlation-based cluster validity indices

Description

Computes the NC correlation, NCI, NCI1 and NCI2 cluster validity indices for the number of clusters from user specified kmin to kmax obtained from either K-means or hierarchical clustering based on the recent paper by Wiroonsri(2024).

Usage

Wvalid(x, kmax, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)
Wvalid(x, kmax, kmin = 2, method = "kmeans",
  corr = "pearson", nstart = 100, sampling = 1, NCstart = TRUE)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`kmax`	a maximum number of clusters to be considered.
`kmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"kmeans"`, `"hclust_complete"`, `"hclust_average"`, `"hclust_single"`). The default is `"kmeans"`.
`corr`	a character string indicating which correlation coefficient is to be computed (`"pearson"`, `"kendall"` or `"spearman"`). The default is `"pearson"`.
`nstart`	a maximum number of initial random sets for kmeans for `method = "kmeans"`. The default is `100`.
`sampling`	a number greater than 0 and less than or equal to 1 indicating the undersampling proportion of data to be used. This argument is intended for handling a large dataset. The default is `1`.
`NCstart`	logical for `indexlist` includes the `"NC"`, `"NCI"`, `"NCI1"`, and `"NCI2"`), if `TRUE`, the NC correlation at `k=1` is defined as the ratio introduced in the reference. Otherwise, it is assigned as `0`.

Details

Value

`NC`	the NC correlations for `k` from `kmin-1` to `kmax+1` shown in a data frame where the first and the second columns are `k` and the NC, respectively.

Each of the followings shows the values of each index for k from kmin to kmax in a data frame.

`NCI`	the NCI index.
`NCI1`	the NCI1 index.
`NCI2`	the NCI2 index.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

N. Wiroonsri, "Clustering performance analysis using a new correlation based cluster validity index," Pattern Recognition, 145, 109910, 2024. doi:10.1016/j.patcog.2023.109910

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by Wvalid
K.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)
print(K.NC)

# The optimal number of cluster
K.NC$NCI[which.max(K.NC$NCI$NCI),]

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Wvalid
H.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'hclust_average',
  corr='pearson', nstart=100, NCstart = TRUE)
print(H.NC)

# The optimal number of cluster
H.NC$NCI[which.max(H.NC$NCI$NCI),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- Kmeans ----

# Compute all the indices by Wvalid
K.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'kmeans',
  corr='pearson', nstart=100, NCstart = TRUE)
print(K.NC)

# The optimal number of cluster
K.NC$NCI[which.max(K.NC$NCI$NCI),]

# ---- Hierarchical ----

# Average linkage

# Compute all the indices by Wvalid
H.NC = Wvalid(scale(x), kmax = 15, kmin=2, method = 'hclust_average',
  corr='pearson', nstart=100, NCstart = TRUE)
print(H.NC)

# The optimal number of cluster
H.NC$NCI[which.max(H.NC$NCI$NCI),]

Xie and Beni (XB) index

Description

Computes the XB (X. L. Xie and G. Beni, 1991) index for a result of either FCM or EM clustering from user specified cmin to cmax.

Usage

XB.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)
XB.IDX(x, cmax, cmin = 2, method = "FCM", fzm = 2, nstart = 20, iter = 100)

Arguments

`x`	a numeric data frame or matrix where each column is a variable to be used for cluster analysis and each row is a data point.
`cmax`	a maximum number of clusters to be considered.
`cmin`	a minimum number of clusters to be considered. The default is `2`.
`method`	a character string indicating which clustering method to be used (`"FCM"` or `"EM"`). The default is `"FCM"`.
`fzm`	a number greater than 1 giving the degree of fuzzification for `method = "FCM"`. The default is `2`.
`nstart`	a maximum number of initial random sets for FCM for `method = "FCM"`. The default is `20`.
`iter`	a maximum number of iterations for `method = "FCM"`. The default is `100`.

Details

The XB index is defined as

$XB(c) = \frac{\sum_{j=1}^c\sum_{i=1}^n\mu_{ij}^2\| {x}_i-{v}_j\|^2} {n \cdot \min_{j\neq k} \{ \| {v}_j-{v}_k\|^2 \}}.$

The lowest value of $XB(c)$ indicates a valid optimal partition.

Value

`XB`	the XB index for `c` from `cmin` to `cmax` shown in a data frame where the first and the second columns are `c` and the XB index, respectively.

Author(s)

Nathakhun Wiroonsri and Onthada Preedasawakul

References

X. Xie and G. Beni, “A validity measure for fuzzy clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 8, pp. 841–847, 1991.

Examples


library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the XB index
FCM.XB = XB.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.XB)

# The optimal number of cluster
FCM.XB[which.min(FCM.XB$XB),]

# ---- EM algorithm ----

# Compute the XB index
EM.XB = XB.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.XB)

# The optimal number of cluster
EM.XB[which.min(EM.XB$XB),]
library(UniversalCVI)

# The data is from Wiroonsri (2024).
x = R1_data[,1:2]

# ---- FCM algorithm ----

# Compute the XB index
FCM.XB = XB.IDX(scale(x), cmax = 15, cmin = 2, method = "FCM",
  fzm = 2, nstart = 20, iter = 100)
print(FCM.XB)

# The optimal number of cluster
FCM.XB[which.min(FCM.XB$XB),]

# ---- EM algorithm ----

# Compute the XB index
EM.XB = XB.IDX(scale(x), cmax = 15, cmin = 2, method = "EM",
  nstart = 20, iter = 100)
print(EM.XB)

# The optimal number of cluster
EM.XB[which.min(EM.XB$XB),]

Package 'UniversalCVI'

Help Index

Accuracy detection for a clustering result with known classes

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Correlation Cluster Validity (CCV) index

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Calinski–Harabasz (CH) index

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

Chou-Su-Lai (CSL) index

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Examples

D1 Artificial Dataset

Description

Usage

Format

Author(s)

References

See Also

D10 Artificial Dataset

Description

Usage

Format

Author(s)

References

See Also

D2 Artificial Dataset

Description

Usage

Format

Author(s)

References

See Also

D3 Artificial Dataset

Description

Usage

Format

Author(s)

References

See Also

D4 Artificial Dataset

Description

Usage

Format

Author(s)

References

See Also

D5 Artificial Dataset

Description

Usage

Format