.. _EDH-net: *************************** Epigraphic Networks using R *************************** .. --- .. layout: post .. title: "Epigraphic Networks: Measures of similarity of artefact assemblages and geographic proximity" .. date: 10-03-2020 12:00:00 .. author: jaro .. categories: Short-reports .. --- .. This post is about Epigraphic Networks based on measures of similarity of artefact assemblages and geographic proximity. | Measuring similarity of artefact assemblages and geographic proximity ===================================================================== .. function [`simil() `_] To measure similarity of artefact assemblages and geographic proximity, [R] package ``sdam`` provides the function ``simil()``, which allows assesing similarity by comparing columns representing --in this case-- different attributes for epigraphic inscriptions. | Function usage -------------- .. function:: simil .. code-block:: R # arguments supported (currently) R> simil(x, vars, type=c("sm","ja","ra"), uniq, diag.incl) .. R> simil(x, vars, type=c("sm","ja","ra"), na.rm, uniq, diag.incl) Which returns a square and valued matrix with similarity meassures based on simple match among variables. | Parameters ---------- Formal arguments of ``simil()`` are: - `x:` a data frame with an `id` column - `vars:` (vector) column(s) in ``x`` representing attributes or variables - `type:` whether the similarity measure is by simple matching ``"sm"``, Jaccard ``"ja"``, or Rand index ``"ra"`` | Optional parameters ------------------- .. - `na.rm:` .. (optional and logical) include ``NA``s (and ``NULL``s)? - `uniq:` (optional and logical) only unique elements? - `diag.incl:` include entries in the matrix diagonal? | Note that at this point the ID column represents the labels of the nodes. In case that an ID column does not exists, then the first column is taken as *id* provided that there are not duplicated entry names in ``x``. | Similarity measures =================== For cases where duplication does not matter, a good option is *Jaccard similarity* whose index is the proportion of the number of observations in both sets to the number in either set. This index is formally expressed as :math:`J(A,B)= \left| A \cap B \right| / \left| A \cup B \right|`. (Otherwise .. todo:: Rand index plain and corrected by chance. | Structures of similarity in the EDH dataset =========================================== We illustrate the use of the ``simil()`` function with ancient inscriptions from the Epigraphic Database Heidelberg, and we first follow the entry :ref:`Epigraphic Database Heidelberg ` to see how accessing the ``EDH`` dataset using ``sdam`` [R] package. | .. code-block:: R # devtools::install_github("mplex/cedhar", subdir="pkg/sdam") # devtools::install_github("sdam-au/sdam") R> library("sdam") # load the EDH data from this package R> data("EDH") | Epigraphic network data ----------------------- Creating epigraphic network data list with variables measures of similarity of artefact assemblages and geographic proximity. For example, a list object named ``epinet`` with the ``ID`` of the inscription plus seven other characteristics from the ``EDH`` dataset is produced by ``edhw()``. .. See :ref:`"Relative Dating" in Epigraphic Database Heidelberg ` for another use of functional programming with ``lapply()``. .. code-block:: R # choose variables of interest and record it as a data frame R> epinet <- edhw(vars=c("type_of_inscription", "language", "material", "country", "findspot_ancient", "not_after", "not_before"), as="df") .. .. code-block:: R .. .. # choose variables of interest .. R> EDH1 <- lapply(EDH, `[`, c("ID", "type_of_monument", "language", "material", .. "country", "findspot_ancient", "not_after", "not_before") ) .. .. .. .. | .. .. Data frame .. ^^^^^^^^^^ .. .. Now we execute a function call on the list to convert list of data into a data frame object named ``epinet``. .. .. .. code-block:: R .. .. # make a data frame from this variable .. R> epinet <- as.data.frame(do.call(rbind, EDH1)) | Take a look at this data: .. code-block:: R # first eight entries in the data frame R> head(epinet, 8) # id type_of_inscription not_before not_after material language findspot_ancient country #1 HD000001 epitaph 0071 0130 Marmor, geädert / farbig Latin Cumae, bei Italy #2 HD000002 epitaph 0051 0200 marble: rocks - metamorphic rocks Latin Roma Italy #3 HD000003 honorific inscription 0131 0170 marble: rocks - metamorphic rocks Latin Spain #4 HD000004 votive inscription 0151 0200 limestone: rocks - clastic sediments Latin Ipolcobulcula Spain #5 HD000005 epitaph 0001 0200 Latin Roma Italy #6 HD000006 epitaph 0071 0150 limestone: rocks - clastic sediments Latin Sabora, bei Spain #7 HD000007 epitaph -0100 -0051 travertine: rocks - chemische Sedimente Latin Roma Italy #8 HD000008 epitaph 0101 0200 marble: rocks - metamorphic rocks Latin Roma? Italy For instance, entry ``8`` indicates that this ancient findspot is uncertain. | .. Regular expressions .. +++++++++++++++++++ .. To remove question marks, for example, we use a regular expression .. .. code-block:: R .. .. R> epinet2 <- as.data.frame(sapply(epinet, function (x) as.list(gsub('\\?', '', x)) )) .. .. code-block:: R .. .. # a new data frame for the same variables but now without question marks .. R> epinet2 <- as.data.frame(do.call(rbind, lapply(EDH1, function (x) as.list(gsub('\\?', '', x)) )) ) .. R> colnames(epinet2) <- names(EDH1[[1]]) .. | .. .. todo:: .. Make another function for cleaning? After all, ``sapply()`` converts ```` into ``NA``. Function ``cln()`` .. code-block:: R R> epinet2 <- cln(epinet) | And then we take a look at ``epinet2`` again, and we assume the questioned entries. .. code-block:: R # first eight entries in the new data frame R> head(epinet2, 8) # id type_of_inscription not_before not_after material language findspot_ancient country #1 HD000001 epitaph 0071 0130 Marmor, geädert / farbig Latin Cumae, bei Italy #2 HD000002 epitaph 0051 0200 marble: rocks - metamorphic rocks Latin Roma Italy #3 HD000003 honorific inscription 0131 0170 marble: rocks - metamorphic rocks Latin NA Spain #4 HD000004 votive inscription 0151 0200 limestone: rocks - clastic sediments Latin Ipolcobulcula Spain #5 HD000005 epitaph 0001 0200 NA Latin Roma Italy #6 HD000006 epitaph 0071 0150 limestone: rocks - clastic sediments Latin Sabora, bei Spain #7 HD000007 epitaph -0100 -0051 travertine: rocks - chemische Sedimente Latin Roma Italy #8 HD000008 epitaph 0101 0200 marble: rocks - metamorphic rocks Latin Roma Italy | The countries in ``epinet2`` are: .. code-block:: R # need first to unlist the component object R> unique(unlist(epinet2$country)) # [1] "Italy" "Spain" "United Kingdom" "Portugal" # [5] "France" "Libyan Arab Jamahiriya" "Germany" "Hungary" # [9] "Austria" "Bulgaria" "Bosnia and Herzegovina" "Montenegro" #[13] "Netherlands" "Tunisia" "Romania" "Algeria" #[17] "Jordan" NA "Croatia" "Switzerland" #[21] "Belgium" "Albania" "Serbia" "Egypt" #[25] "Syrian Arab Republic" "Morocco" "Turkey" "Lebanon" #[29] "Kosovo" "Macedonia" "Slovakia" "Greece" #[33] "Slovenia" "Iraq" "Israel" "unknown" #[37] "Vatican City State" "Ukraine" "Cyprus" "Yemen" #[41] "Sudan" "Luxembourg" "Czech Republic" "Malta" #[45] "Poland" "Armenia" "Monaco" "Azerbaijan" #[49] "Sweden" "Denmark" "Moldova" "Saudi Arabia" #[53] "Uzbekistan" "Liechtenstein" "Georgia" | Subsetting the data +++++++++++++++++++ For example, we use the ``base`` [R] function ``subset()`` to substract epigraphic material in "Greek-Latin" from Egypt. .. code-block:: R # a subset of a subset R> subset(subset(epinet2, country=="Egypt"), language=="Greek-Latin") # id type_of_inscription not_before not_after material language findspot_ancient country #2003 HD002003 identification inscription -0116 NA NA Greek-Latin Philae Egypt #23091 HD023091 NA 0145 NA Holz, Wachs Greek-Latin NA Egypt #23138 HD023138 votive inscription -0029 NA NA Greek-Latin Syene Egypt #27345 HD023091 NA 0145 NA Holz, Wachs Greek-Latin NA Egypt #27351 HD023138 votive inscription -0029 NA NA Greek-Latin Syene Egypt #32500 HD030147 NA 0010 0011 NA Greek-Latin Alexandria Egypt #34436 HD032079 NA NA NA NA Greek-Latin Schedia Egypt #51198 HD048625 identification inscription 0006 NA NA Greek-Latin Berenice Egypt #54194 HD051485 NA 0155 0225 NA Greek-Latin Alexandria Egypt #58318 HD055974 NA 0001 0200 NA Greek-Latin Berenice Egypt #70110 HD067781 public legal inscription -0037 -0030 NA Greek-Latin Leontopolis Egypt | Ranked frequency ++++++++++++++++ A ranked frequency of different *kinds of inscriptions* including missing information is computed as follows: .. code-block:: R R> as.data.frame(sort(table(unlist(epinet2$type_of_inscription), useNA="ifany"), decreasing=TRUE)) # Var1 Freq #1 epitaph 28522 #2 22222 #3 votive inscription 14683 #4 owner/artist inscription 5164 #5 honorific inscription 4338 #6 building/dedicatory inscription 3450 #7 mile-/leaguestone 1766 #8 identification inscription 1600 #9 acclamation 525 #10 military diploma 507 #11 list 363 #12 defixio 311 #13 label 287 #14 boundary inscription 258 #15 public legal inscription 256 #16 elogium 154 #17 seat inscription 88 #18 letter 81 #19 prayer 57 #20 private legal inscription 37 #21 assignation inscription 15 #22 calendar 14 #23 adnuntiatio 3 That is, a decreasing sorted table given as data frame of the ``type_of_inscription`` component of ``epinet2``. Since ``epinet2`` is a list object, it is required to "unlist" the data object to produce a table with the frequencies. | Example: Similarity among Egyptian epigraphs ============================================ We can compute similarity among Egyptian epigraphs with function ``simil()``. For this, we look at the attribute types stored in different columns. .. code-block:: R R> as.data.frame(colnames(epinet2)) # colnames(epinet2) #1 id #2 type_of_inscription #3 not_before #4 not_after #5 material #6 language #7 findspot_ancient #8 country | For instance, in case we want to choose ``"type_of_inscription"``, ``"material"``, and ``"findspot_ancient"``, these correspond to columns 8, 5, and 3. Similarities among Egyptian epigraphs `by simple matching` or default type ``"sm"`` with the above attribute variables are recorded in a matrix object named ``epEgs`` where the ``ID`` in ``epinet2`` corresponds to the dimensions labels. .. code-block:: R # similarity function on the subset for the three variables R> epEgs <- simil(subset(epinet2, country=="Egypt"), vars=c(8,5,6)) # number of rows in this square matrix R> nrow(epEgs) #[1] 170 | And then we look at some cell entries .. code-block:: R # similarity between the first six inscriptions in 'epEgs' R> epEgs[1:6, 1:6] # HD000744 HD002009 HD003137 HD006817 HD006820 HD008184 #HD000744 0 1 1 0 0 0 #HD002009 1 0 0 0 0 0 #HD003137 1 0 0 0 0 0 #HD006817 0 0 0 0 1 0 #HD006820 0 0 0 1 0 0 #HD008184 0 0 0 0 0 0 where we observe six records of a single similarity. | Plot similarities ----------------- To produce a graph for the similarity among Egyptian epigraphs, we employ the [R] package ``multigraph`` that depends on ``multiplex``. .. code-block:: R # define scope for the graph R> scp <- list(directed=FALSE, valued=TRUE, ecol=8, pos=0) # load "multigraph" where "multiplex" gets invoked R> library(multigraph) # plot similarity graph of 'epEgs' for the chosen variables R> multigraph(epEgs, scope=scp, layout="force", maxiter=70, main="Similarity among Egyptian epigraphs") | .. index:: graph .. image:: ../images/epEgs.png :width: 450 .. meta:: :description: Compute similarity in Epigraphic Networks :keywords: epigraphic, documentation, dataset, networks