Large Language Model Standards

June 9, 2025

mike@standardsmichigan.com

Perhaps the World Ends Here | Joy Harjo

The world begins at a kitchen table. No matter what, we must eat to live.

The gifts of earth are brought and prepared, set on the table.

So it has been since creation, and it will go on.

We chase chickens or dogs away from it. Babies teethe at the corners. They scrape their knees under it.

It is here that children are given instructions on what it means to be human.

We make men at it, we make women.

At this table we gossip, recall enemies and the ghosts of lovers.

Our dreams drink coffee with us as they put their arms around our children.

They laugh with us at our poor falling-down selves and as we put ourselves back together once again at the table.

This table has been a house in the rain, an umbrella in the sun.

Wars have begun and ended at this table. It is a place to hide in the shadow of terror.

A place to celebrate the terrible victory.

We have given birth on this table, and have prepared our parents for burial here.

At this table we sing with joy, with sorrow. We pray of suffering and remorse. We give thanks.

Perhaps the world will end at the kitchen table, while we are laughing and crying, eating of the last sweet bite.

Standards and benchmarks for evaluating large language models (LLMs). Some of the most commonly used benchmarks and standards include:

GLUE (General Language Understanding Evaluation): GLUE is a benchmark designed to evaluate and analyze the performance of models across a diverse range of natural language understanding tasks, such as text classification, sentiment analysis, and question answering.
SuperGLUE: SuperGLUE is an extension of the GLUE benchmark, featuring more difficult language understanding tasks, aiming to provide a more challenging evaluation for models.
CoNLL (Conference on Computational Natural Language Learning): CoNLL has historically hosted shared tasks, including tasks related to coreference resolution, dependency parsing, and other syntactic and semantic tasks.
SQuAD (Stanford Question Answering Dataset): SQuAD is a benchmark dataset for evaluating the performance of question answering systems. It consists of questions posed on a set of Wikipedia articles, where the model is tasked with providing answers based on the provided context.
RACE (Reading Comprehension from Examinations): RACE is a dataset designed to evaluate reading comprehension models. It consists of English exam-style reading comprehension passages and accompanying multiple-choice questions.
WMT (Workshop on Machine Translation): The WMT shared tasks focus on machine translation, providing benchmarks and evaluation metrics for assessing the quality of machine translation systems across different languages.
BLEU (Bilingual Evaluation Understudy): BLEU is a metric used to evaluate the quality of machine-translated text relative to human-translated reference texts. It compares n-gram overlap between the generated translation and the reference translations.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE is a set of metrics used for evaluating automatic summarization and machine translation. It measures the overlap between generated summaries or translations and reference summaries or translations.

These benchmarks and standards play a crucial role in assessing the performance and progress of large language models, helping researchers and developers understand their strengths, weaknesses, and areas for improvement.

Yann Lecun & Lex Fridman: Limits of LLMs

New topic for us; time only to cover the basics. We have followed language, generally, however — every month — because best practice discovery and promulgation in conceiving, designing, building, occupying and maintaining the architectural character of education settlements depends upon a common vocabulary. The struggle to agree upon vocabulary presents an outsized challenge to the work we do.

Large language models hold significant potential for the building construction industry by streamlining various processes. They can analyze vast amounts of data to aid in architectural design, structural analysis, and project management. These models can generate detailed plans, suggest optimized construction techniques, and assist in cost estimation. Moreover, they facilitate better communication among stakeholders by providing natural language interfaces for discussing complex concepts. By harnessing the power of large language models, the construction industry can enhance efficiency, reduce errors, and ultimately deliver better-designed and more cost-effective buildings.

Join us today at the usual hour. Use the login credentials at the upper right of our home page.