Urban Dictionary Embeddings for Slang NLP Applications (2024)

Steven Wilson,Walid Magdy,Barbara McGillivray,Kiran Garimella,Gareth Tyson

Abstract

The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.

Anthology ID:
2020.lrec-1.586
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari,Frédéric Béchet,Philippe Blache,Khalid Choukri,Christopher Cieri,Thierry Declerck,Sara Goggi,Hitoshi Isahara,Bente Maegaard,Joseph Mariani,Hélène Mazo,Asuncion Moreno,Jan Odijk,Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4764–4773
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.586
DOI:
Bibkey:
Cite (ACL):
Steven Wilson, Walid Magdy, Barbara McGillivray, Kiran Garimella, and Gareth Tyson. 2020. Urban Dictionary Embeddings for Slang NLP Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4764–4773, Marseille, France. European Language Resources Association.
Cite (Informal):
Urban Dictionary Embeddings for Slang NLP Applications (Wilson et al., LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.586.pdf

PDFCiteSearch

Export citation
  • BibTeX
  • MODS XML
  • Endnote
  • Preformatted
@inproceedings{wilson-etal-2020-urban, title = "Urban Dictionary Embeddings for Slang {NLP} Applications", author = "Wilson, Steven and Magdy, Walid and McGillivray, Barbara and Garimella, Kiran and Tyson, Gareth", editor = "Calzolari, Nicoletta and B{\'e}chet, Fr{\'e}d{\'e}ric and Blache, Philippe and Choukri, Khalid and Cieri, Christopher and Declerck, Thierry and Goggi, Sara and Isahara, Hitoshi and Maegaard, Bente and Mariani, Joseph and Mazo, H{\'e}l{\`e}ne and Moreno, Asuncion and Odijk, Jan and Piperidis, Stelios", booktitle = "Proceedings of the Twelfth Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://aclanthology.org/2020.lrec-1.586", pages = "4764--4773", abstract = "The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.", language = "English", ISBN = "979-10-95546-34-4",}

Download as File

<?xml version="1.0" encoding="UTF-8"?><modsCollection xmlns="http://www.loc.gov/mods/v3"><mods ID="wilson-etal-2020-urban"> <titleInfo> <title>Urban Dictionary Embeddings for Slang NLP Applications</title> </titleInfo> <name type="personal"> <namePart type="given">Steven</namePart> <namePart type="family">Wilson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Walid</namePart> <namePart type="family">Magdy</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Barbara</namePart> <namePart type="family">McGillivray</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Kiran</namePart> <namePart type="family">Garimella</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Gareth</namePart> <namePart type="family">Tyson</namePart> <role> <roleTerm authority="marcrelator" type="text">author</roleTerm> </role> </name> <originInfo> <dateIssued>2020-05</dateIssued> </originInfo> <typeOfResource>text</typeOfResource> <language> <languageTerm type="text">English</languageTerm> <languageTerm type="code" authority="iso639-2b">eng</languageTerm> </language> <relatedItem type="host"> <titleInfo> <title>Proceedings of the Twelfth Language Resources and Evaluation Conference</title> </titleInfo> <name type="personal"> <namePart type="given">Nicoletta</namePart> <namePart type="family">Calzolari</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Frédéric</namePart> <namePart type="family">Béchet</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Philippe</namePart> <namePart type="family">Blache</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Khalid</namePart> <namePart type="family">Choukri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Christopher</namePart> <namePart type="family">Cieri</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Thierry</namePart> <namePart type="family">Declerck</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Sara</namePart> <namePart type="family">Goggi</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hitoshi</namePart> <namePart type="family">Isahara</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Bente</namePart> <namePart type="family">Maegaard</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Joseph</namePart> <namePart type="family">Mariani</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Hélène</namePart> <namePart type="family">Mazo</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Asuncion</namePart> <namePart type="family">Moreno</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Jan</namePart> <namePart type="family">Odijk</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <name type="personal"> <namePart type="given">Stelios</namePart> <namePart type="family">Piperidis</namePart> <role> <roleTerm authority="marcrelator" type="text">editor</roleTerm> </role> </name> <originInfo> <publisher>European Language Resources Association</publisher> <place> <placeTerm type="text">Marseille, France</placeTerm> </place> </originInfo> <genre authority="marcgt">conference publication</genre> <identifier type="isbn">979-10-95546-34-4</identifier> </relatedItem> <abstract>The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.</abstract> <identifier type="citekey">wilson-etal-2020-urban</identifier> <location> <url>https://aclanthology.org/2020.lrec-1.586</url> </location> <part> <date>2020-05</date> <extent unit="page"> <start>4764</start> <end>4773</end> </extent> </part></mods></modsCollection>

Download as File

%0 Conference Proceedings%T Urban Dictionary Embeddings for Slang NLP Applications%A Wilson, Steven%A Magdy, Walid%A McGillivray, Barbara%A Garimella, Kiran%A Tyson, Gareth%Y Calzolari, Nicoletta%Y Béchet, Frédéric%Y Blache, Philippe%Y Choukri, Khalid%Y Cieri, Christopher%Y Declerck, Thierry%Y Goggi, Sara%Y Isahara, Hitoshi%Y Maegaard, Bente%Y Mariani, Joseph%Y Mazo, Hélène%Y Moreno, Asuncion%Y Odijk, Jan%Y Piperidis, Stelios%S Proceedings of the Twelfth Language Resources and Evaluation Conference%D 2020%8 May%I European Language Resources Association%C Marseille, France%@ 979-10-95546-34-4%G English%F wilson-etal-2020-urban%X The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.%U https://aclanthology.org/2020.lrec-1.586%P 4764-4773

Download as File

Markdown (Informal)

[Urban Dictionary Embeddings for Slang NLP Applications](https://aclanthology.org/2020.lrec-1.586) (Wilson et al., LREC 2020)

  • Urban Dictionary Embeddings for Slang NLP Applications (Wilson et al., LREC 2020)
ACL
  • Steven Wilson, Walid Magdy, Barbara McGillivray, Kiran Garimella, and Gareth Tyson. 2020. Urban Dictionary Embeddings for Slang NLP Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4764–4773, Marseille, France. European Language Resources Association.
Urban Dictionary Embeddings for Slang NLP Applications (2024)

FAQs

How reliable is the Urban Dictionary? ›

Urban Dictionary can be a resource for slang phrases that are not in traditional dictionaries. However, it is not at all an authoritative source, since it is mainly an entertainment site. For questions not specifically about slang or obscure phrases, if you can find another source, it would be better to do so.

How long does Urban Dictionary take to approve? ›

The approval timeframe for Urban Dictionary submissions can vary, but it typically takes a few days to a couple of weeks. Here's a breakdown of the factors influencing the wait time: Submission volume: With a large number of submissions received daily, the review process can take longer.

Does anyone use Urban Dictionary anymore? ›

Today, Urban Dictionary averages around 65 million visitors a month, according to data from SimilarWeb, with almost 100 percent of its traffic originating via organic search.

How does Urban Dictionary make money? ›

The company makes money, he said, mostly from advertising and a small collection of Urban Dictionary-related products like calendars, greeting cards and books that he sells through the site.

Can Urban Dictionary be used as a source? ›

Urban Dictionary is not a reliable source for definitions. It consists mainly of user-generated content,[1] which is usually not considered reliable for facts.

What is a twatwaffle? ›

When someone refers to another person as a "twatwaffle," they are essentially calling them an idiot, fool, or expressing their disapproval in a rude and disrespectful manner. It is important to note that this term is considered offensive and impolite, and its usage may be inappropriate in many situations.

How many people use Urban Dictionary? ›

It was founded in 1999 by computer science student Aaron Peckham to make fun of the comparatively staid Dictionary.com. Yet Urban Dictionary has become much more than a parody site, drawing approximately 65 million visitors every month.

Who runs the Urban Dictionary? ›

Urban Dictionary
Available inEnglish
OwnerAaron Peckham
Created byAaron Peckham
URLwww.urbandictionary.com
CommercialOwned Company
3 more rows

Who is behind Urban Dictionary? ›

Urban Dictionary
Screenshot Screenshot of Urban Dictionary front page (2018)
OwnerAaron Peckham
Created byAaron Peckham
URLurbandictionary.com
LaunchedDecember 9, 1999
4 more rows

What does "pookie" mean in slang? ›

“Pookie” is a term of endearment people use to describe something cute. Call someone like your significant other, friend, or even pet “pookie” to express your love and affection. Other cute nicknames include things like babe, love, and cutie pie.

What does bop mean in slang? ›

In the video, Brian explains that a "bop" is a word for anyone but is typically used to refer to women: "Somebody who posts their body on the internet…or somebody who just be getting around with everybody, who be linking with every dude, who be around all the dudes."

What does IDEK anymore mean? ›

IDEK is an acronym used in texting and social media that means I don't even know. It expresses sheer puzzlement over something that seems inexplicable. Related words: I know! the more you know.

Is YEET in the Urban Dictionary? ›

Where does yeet come from? An Urban Dictionary entry from 2008 defined yeet as an excited exclamation, particularly in sports and sexual contexts.

What is mogging? ›

Transcript. Mogging refers to being more physically attractive than others, and is part of the trend of looksmaxing, which focuses on improving one's appearance.

What does sigma mean in slang? ›

What the sigma? Some know “sigma” as the 18th letter of the Greek alphabet but it's also teen slang for a cool dude. According to Know Your Meme, sigma is “referring to a supposed classification for men who are successful and popular, but also silent and rebellious.”

Is the Urban Dictionary trustable? ›

If you come across a slang word that you don't know the meaning of, Urban Dictionary is a good place to look it up. If there are multiple meanings, use your common sense to see which one might be the correct one in the context the word was used. Urban Dictionary is not a professional dictionary.

Is the Urban Dictionary a real dictionary? ›

Urban Dictionary is a crowdsourced online dictionary of slang words and phrases that was founded in 1999 as a parody of Dictionary.com and Vocabulary.com by then-college freshman Aaron Peckham. Some of the definitions on the website can be found as early as 1999, but most early definitions are from 2003.

Is Urban Dictionary a safe site? ›

The site's usage terms say the content "is frequently presented in a coarse and direct manner that some may find offensive" and suggest the site isn't for users under 13. However, kids can easily find a number of edgy definitions; the terms are listed alphabetically, and there's also a search function.

Top Articles
Latest Posts
Article information

Author: Reed Wilderman

Last Updated:

Views: 6265

Rating: 4.1 / 5 (52 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Reed Wilderman

Birthday: 1992-06-14

Address: 998 Estell Village, Lake Oscarberg, SD 48713-6877

Phone: +21813267449721

Job: Technology Engineer

Hobby: Swimming, Do it yourself, Beekeeping, Lapidary, Cosplaying, Hiking, Graffiti

Introduction: My name is Reed Wilderman, I am a faithful, bright, lucky, adventurous, lively, rich, vast person who loves writing and wants to share my knowledge and understanding with you.