Urban Dictionary Embeddings for Slang NLP Applications (2024)

Steven Wilson,Walid Magdy,Barbara McGillivray,Kiran Garimella,Gareth Tyson

Abstract

The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@inproceedings{wilson-etal-2020-urban, title = "Urban Dictionary Embeddings for Slang {NLP} Applications", author = "Wilson, Steven and Magdy, Walid and McGillivray, Barbara and Garimella, Kiran and Tyson, Gareth", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.586", pages = "4764--4773", abstract = "The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.", language = "English", ISBN = "979-10-95546-34-4",}

Download as File

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="wilson-etal-2020-urban"> <titleInfo> <title>Urban Dictionary Embeddings for Slang NLP Applications</title> </titleInfo> <name type="personal"> <namePart type="given">Steven</namePart> <namePart type="family">Wilson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Walid</namePart> <namePart type="family">Magdy</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barbara</namePart> <namePart type="family">McGillivray</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kiran</namePart> <namePart type="family">Garimella</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gareth</namePart> <namePart type="family">Tyson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2020-05</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <language> <languageTerm type="text">English</languageTerm> <languageTerm type="code" authority="iso639-2b">eng</languageTerm> </language> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Twelfth Language Resources and Evaluation Conference</title> </titleInfo> <name type="personal"> <namePart type="given">Nicoletta</namePart> <namePart type="family">Calzolari</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Frédéric</namePart> <namePart type="family">Béchet</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philippe</namePart> <namePart type="family">Blache</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Khalid</namePart> <namePart type="family">Choukri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christopher</namePart> <namePart type="family">Cieri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Thierry</namePart> <namePart type="family">Declerck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sara</namePart> <namePart type="family">Goggi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hitoshi</namePart> <namePart type="family">Isahara</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bente</namePart> <namePart type="family">Maegaard</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joseph</namePart> <namePart type="family">Mariani</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hélène</namePart> <namePart type="family">Mazo</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Asuncion</namePart> <namePart type="family">Moreno</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Odijk</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stelios</namePart> <namePart type="family">Piperidis</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>European Language Resources Association</publisher> <place> <placeTerm type="text">Marseille, France</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> <identifier type="isbn">979-10-95546-34-4</identifier> </relatedItem> <abstract>The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.</abstract> <identifier type="citekey">wilson-etal-2020-urban</identifier> <location> <url>https://aclanthology.org/2020.lrec-1.586</url> </location> <part> <date>2020-05</date> <extent unit="page"> <start>4764</start> <end>4773</end> </extent> </part></mods></modsCollection>

Download as File

%0 Conference Proceedings%T Urban Dictionary Embeddings for Slang NLP Applications%A Wilson, Steven%A Magdy, Walid%A McGillivray, Barbara%A Garimella, Kiran%A Tyson, Gareth%Y Calzolari, Nicoletta%Y Béchet, Frédéric%Y Blache, Philippe%Y Choukri, Khalid%Y Cieri, Christopher%Y Declerck, Thierry%Y Goggi, Sara%Y Isahara, Hitoshi%Y Maegaard, Bente%Y Mariani, Joseph%Y Mazo, Hélène%Y Moreno, Asuncion%Y Odijk, Jan%Y Piperidis, Stelios%S Proceedings of the Twelfth Language Resources and Evaluation Conference%D 2020%8 May%I European Language Resources Association%C Marseille, France%@ 979-10-95546-34-4%G English%F wilson-etal-2020-urban%X The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.%U https://aclanthology.org/2020.lrec-1.586%P 4764-4773

Download as File

Markdown (Informal)

[Urban Dictionary Embeddings for Slang NLP Applications](https://aclanthology.org/2020.lrec-1.586) (Wilson et al., LREC 2020)

Urban Dictionary Embeddings for Slang NLP Applications (Wilson et al., LREC 2020)

ACL

Steven Wilson, Walid Magdy, Barbara McGillivray, Kiran Garimella, and Gareth Tyson. 2020. Urban Dictionary Embeddings for Slang NLP Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4764–4773, Marseille, France. European Language Resources Association.

FAQs

How reliable is the Urban Dictionary? ›

Urban Dictionary can be a resource for slang phrases that are not in traditional dictionaries. However, it is not at all an authoritative source, since it is mainly an entertainment site. For questions not specifically about slang or obscure phrases, if you can find another source, it would be better to do so.

View Details ›

How long does Urban Dictionary take to approve? ›

The approval timeframe for Urban Dictionary submissions can vary, but it typically takes a few days to a couple of weeks. Here's a breakdown of the factors influencing the wait time: Submission volume: With a large number of submissions received daily, the review process can take longer.

Keep Reading ›

Does anyone use Urban Dictionary anymore? ›

Today, Urban Dictionary averages around 65 million visitors a month, according to data from SimilarWeb, with almost 100 percent of its traffic originating via organic search.

Available in	English
Owner	Aaron Peckham
Created by	Aaron Peckham
URL	www.urbandictionary.com
Commercial	Owned Company

Screenshot Screenshot of Urban Dictionary front page (2018)
Owner	Aaron Peckham
Created by	Aaron Peckham
URL	urbandictionary.com
Launched	December 9, 1999

What does bop mean in slang? ›

In the video, Brian explains that a "bop" is a word for anyone but is typically used to refer to women: "Somebody who posts their body on the internet…or somebody who just be getting around with everybody, who be linking with every dude, who be around all the dudes."

Learn More ›

What does IDEK anymore mean? ›

IDEK is an acronym used in texting and social media that means I don't even know. It expresses sheer puzzlement over something that seems inexplicable. Related words: I know! the more you know.

Explore More ›

Is YEET in the Urban Dictionary? ›

Where does yeet come from? An Urban Dictionary entry from 2008 defined yeet as an excited exclamation, particularly in sports and sexual contexts.

Find Out More ›

What is mogging? ›

Transcript. Mogging refers to being more physically attractive than others, and is part of the trend of looksmaxing, which focuses on improving one's appearance.

Read On ›

What does sigma mean in slang? ›

What the sigma? Some know “sigma” as the 18th letter of the Greek alphabet but it's also teen slang for a cool dude. According to Know Your Meme, sigma is “referring to a supposed classification for men who are successful and popular, but also silent and rebellious.”