English
 
Help Privacy Policy Disclaimer
  Advanced SearchBrowse

Item

ITEM ACTIONSEXPORT
  Computational thematics: comparing algorithms for clustering the genres of literary fiction

Sobchuk, O., & Šeļa, A. (2023). Computational thematics: comparing algorithms for clustering the genres of literary fiction. arXiv, 2305.11251. doi:10.48550/arXiv.2305.11251.

Item is

Files

show Files
hide Files
:
gea0060pre.pdf (Preprint), 4MB
Name:
gea0060pre.pdf
Description:
OA
OA-Status:
Not specified
Visibility:
Public
MIME-Type / Checksum:
application/pdf / [MD5]
Technical Metadata:
Copyright Date:
-
Copyright Info:
-

Locators

show
hide
Locator:
R scripts, data and methods, (Supplementary material)
Description:
(last seen: May 2023)
OA-Status:
Not specified

Creators

show
hide
 Creators:
Sobchuk, Oleg1, Author                 
Šeļa, Artjoms, Author
Affiliations:
1The MINT independent research group, Max Planck Institute of Geoanthropology, Max Planck Society, ou_3504342              

Content

show
hide
Free keywords: text mining, computational literary studies, genre, topic modeling
 Abstract: What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call "computational thematics". These algorithms belong to three steps of analysis: text preprocessing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options: every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the "ground truth" genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the sharp difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

Details

show
hide
Language(s): eng - English
 Dates: 2023-05-18
 Publication Status: Published online
 Pages: 59
 Publishing info: -
 Table of Contents: Introduction
Materials and Methods
- Data: The “ground truth” genres
- Analysis: The race of algorithms
- Step 1. Choosing a combination of thematic foregrounding, features, and distance
- Step 2. Sampling for robust results
- Step 3. Clustering
Results
- Conclusion 1. Thematic foregrounding improves genre clustering
- Conclusion 2. Various feature types show similarly good performance
- Conclusion 3. The performance of LDA does not seem to depend on k of topics and n of most frequent words
- Conclusion 4. Bag-of-words approach requires a balance of thematic foregrounding and n of most frequent words
- Comparison of algorithms on a larger dataset
Discussion
 Rev. Type: No review
 Identifiers: DOI: 10.48550/arXiv.2305.11251
Other: gea0060
 Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show
hide
Title: arXiv
Source Genre: Web Page
 Creator(s):
Ginsparg, Paul, Developer
Affiliations:
-
Publ. Info: -
Pages: - Volume / Issue: - Sequence Number: 2305.11251 Start / End Page: - Identifier: URN: https://arxiv.org