Computational thematics: comparing algorithms for clustering the genres of 
literary fiction

Sobchuk, Oleg; Šeļa, Artjoms

doi:10.48550/arXiv.2305.11251

Local TagsRelease HistoryDetailsSummary

Computational thematics: comparing algorithms for clustering the genres of literary fiction

Sobchuk, O., & Šeļa, A. (2023). Computational thematics: comparing algorithms for clustering the genres of literary fiction. arXiv, 2305.11251. doi:10.48550/arXiv.2305.11251.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-000D-385D-D Version Permalink: https://hdl.handle.net/21.11116/0000-000E-510E-8

Genre: Preprint

Files

show Files

hide Files

:

gea0060pre.pdf (Preprint), 4MB

View Save

File Permalink:
https://hdl.handle.net/21.11116/0000-000D-385F-B

Name:
gea0060pre.pdf

Description:
OA

OA-Status:
Not specified

Visibility:
Public

MIME-Type / Checksum:
application/pdf / [MD5]

Technical Metadata:

View

Copyright Date:
-

Copyright Info:
-

License:
https://creativecommons.org/licenses/by/4.0/

Locators

show

hide

Locator:
R scripts, data and methods, (Supplementary material) Open Access status unknown

Description:
(last seen: May 2023)

OA-Status:
Not specified

Creators

show

hide

Creators:
Sobchuk, Oleg¹, Author
Šeļa, Artjoms, Author

Affiliations:
1The MINT independent research group, Max Planck Institute of Geoanthropology, Max Planck Society, ou_3504342

Content

show

hide

Free keywords: text mining, computational literary studies, genre, topic modeling

Abstract: What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call "computational thematics". These algorithms belong to three steps of analysis: text preprocessing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options: every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the "ground truth" genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the sharp difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2023-05-18

Publication Status: Published online

Pages: 59

Publishing info: -

Table of Contents: Introduction
Materials and Methods
- Data: The “ground truth” genres
- Analysis: The race of algorithms
- Step 1. Choosing a combination of thematic foregrounding, features, and distance
- Step 2. Sampling for robust results
- Step 3. Clustering
Results
- Conclusion 1. Thematic foregrounding improves genre clustering
- Conclusion 2. Various feature types show similarly good performance
- Conclusion 3. The performance of LDA does not seem to depend on k of topics and n of most frequent words
- Conclusion 4. Bag-of-words approach requires a balance of thematic foregrounding and n of most frequent words
- Comparison of algorithms on a larger dataset
Discussion

Rev. Type: No review

Identifiers: DOI: 10.48550/arXiv.2305.11251
Other: gea0060

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: arXiv

Source Genre: Web Page

Creator(s):
Ginsparg, Paul, Developer

Affiliations:
-

Publ. Info: -

Pages: - Volume / Issue: - Sequence Number: 2305.11251 Start / End Page: - Identifier: URN: https://arxiv.org