The AI Grading Controversy: Machine Intelligence and the Future of Gemstone Assessment

How artificial intelligence is reshaping — and unsettling — the centuries-old practice of grading colour, clarity, and cut

Cross-cutting essaysView in dictionary · 1,872 words

Few debates in contemporary gemmology have generated as much heat, or as little consensus, as the question of whether artificial intelligence can reliably grade gemstones. Since the early 2020s, a growing number of laboratories and technology firms have introduced machine-learning systems capable of analysing colour, clarity, cut geometry, and even origin-related characteristics in diamonds and coloured stones. The results have been uneven, the claims sometimes extravagant, and the response from the established trade — ranging from cautious interest to outright scepticism — has exposed deep fault lines about what grading actually is, what it is for, and who should be trusted to perform it.

What AI Grading Systems Actually Do

The term "AI grading" encompasses a spectrum of technologies. At the simpler end sit computer-vision systems that measure objectively quantifiable parameters: the angles and proportions of a cut diamond, the surface area of inclusions visible under standardised magnification, or the spectrophotometric coordinates of a stone's colour expressed in a colour space such as CIE L*a*b* or Munsell notation. These tasks are, in principle, well-suited to automation. A machine can measure a pavilion angle to a fraction of a degree far more consistently than a human eye, and a calibrated spectrophotometer will assign colour coordinates without the fatigue, adaptation, or ambient-lighting variation that affects human observers.

At the more ambitious end sit neural-network systems trained on large datasets of previously graded stones, which attempt to replicate the holistic judgements that experienced gemmologists make — assigning a clarity grade, for instance, not merely by counting inclusions but by weighing their nature, position, relief, and effect on transparency. Some systems have further claimed the ability to assist with geographic origin determination by identifying spectroscopic or inclusion-pattern signatures associated with specific deposits. It is in this second, more ambitious category that the controversy is sharpest.

The Case for Automation

Proponents of AI-assisted grading advance several arguments that deserve serious consideration. The first is consistency. Human grading, even within a single laboratory, is subject to inter-grader variation: studies published in Gems & Gemology have documented that experienced graders can disagree by a full clarity grade on the same stone, and colour-grade disagreements of one to two grades are not uncommon. A well-calibrated machine system, operating under fixed illumination and using a fixed algorithm, will produce the same result on the same stone every time it is presented. For a trade in which a single grade boundary can represent thousands of dollars of value, that reproducibility is not trivial.

The second argument is throughput. Major grading laboratories process hundreds of thousands of stones annually. The bottleneck is human attention, and the cost of that attention is reflected in laboratory fees. Automation of routine measurements — proportions, fluorescence intensity, surface condition — could in principle free gemmologists to focus on the genuinely difficult judgements that require expertise, reducing turnaround times and costs simultaneously.

The third argument is the elimination of certain forms of unconscious bias. Human graders are not immune to expectation effects: knowing a stone's declared origin, its price, or the identity of the submitting dealer can subtly influence assessment. A properly blinded algorithmic system has no such susceptibilities, at least in principle.

Finally, there is the argument from democratisation. If reliable grading can be delivered at lower cost and higher speed, smaller dealers and consumers in markets currently underserved by major laboratories might gain access to credible assessments that were previously economically out of reach.

The Case for Scepticism

The critics of AI grading are numerous, and their objections are not merely self-interested protectionism, though that element is present. The most substantive objections concern the nature of coloured-stone grading specifically.

Coloured gemstones present a fundamentally different challenge from diamonds. Diamond grading, for all its complexity, operates within a relatively constrained colour space (near-colourless to light yellow or brown) and a well-defined clarity nomenclature developed over decades by a single dominant institution. Coloured stones — sapphires, rubies, emeralds, alexandrites, spinels, and hundreds of other species — exhibit colour zoning, pleochroism, colour-shift phenomena, and inclusion assemblages of extraordinary variety. The jardin of an emerald, for instance, is not merely a defect to be penalised but a complex feature whose character, distribution, and relationship to the stone's transparency must be weighed holistically. A Colombian emerald with a fine, evenly distributed jardin may be more desirable than a technically cleaner stone from another origin, because the inclusions themselves are evidence of authenticity and origin. No current AI system has demonstrated the ability to make that kind of contextualised, market-informed judgement reliably.

Pleochroism presents a further difficulty. A fine alexandrite displays different colours in different crystallographic directions; a sapphire may show colour zoning that is invisible face-up but apparent from the side. The way a stone is oriented in its setting, the direction of illumination, and the angle of observation all affect the colour a human observer perceives. Capturing this multidimensional optical behaviour in a training dataset requires imaging from multiple angles under multiple illumination conditions — a technical challenge that most current commercial systems have not fully solved.

There is also the question of training data quality. Machine-learning systems are only as reliable as the datasets on which they are trained. If a neural network is trained on grades assigned by human graders — who are themselves subject to the inconsistencies noted above — the system learns to replicate human inconsistency rather than to transcend it. Worse, if the training dataset is drawn predominantly from one laboratory's historical grades, the system will encode that laboratory's particular house style, which may differ from the standards of other institutions. The result is not an objective standard but a digitised version of one institution's subjective practice.

Critics also raise concerns about adversarial manipulation. Any published algorithm creates a roadmap for optimising a stone's presentation to score well by the algorithm's criteria, which may not correspond to genuine quality. The history of diamond grading has already seen cases in which stones were deliberately fashioned to sit just above grade boundaries; an AI system whose decision boundaries are known or discoverable creates analogous incentive structures.

Laboratory Responses and Industry Developments

The major established laboratories — the Gemological Institute of America (GIA), Gübelin Gem Lab, SSEF Swiss Gemmological Institute, and Lotus Gemology among them — have responded to the AI grading wave with a mixture of internal research investment and public caution. The GIA has invested substantially in spectroscopic databases and computer-vision research, particularly for diamond cut grading and the detection of laboratory-grown stones, while maintaining that the holistic assessment of coloured stones remains a domain requiring trained human expertise. Gübelin and SSEF have similarly incorporated advanced spectroscopic and chemical-analysis tools — including laser ablation inductively coupled plasma mass spectrometry (LA-ICP-MS) and photoluminescence spectroscopy — into their origin-determination workflows, but frame these as instruments that inform expert judgement rather than replace it.

Several technology start-ups have taken a more aggressive position, marketing AI grading platforms directly to dealers and consumers as alternatives to traditional laboratory reports. These platforms vary enormously in their methodological rigour, transparency, and independent validation. Some have published peer-reviewed studies supporting their accuracy claims; others have relied on proprietary testing that cannot be independently assessed. The trade press — National Jeweler, JCK, Rapaport Magazine, and Gems & Gemology — has covered these developments with increasing frequency since approximately 2021, and the coverage has generally reflected the industry's ambivalence: acknowledging the technology's potential while questioning the boldness of some claims.

The International Coloured Gemstone Association (ICA) and the American Gem Trade Association (AGTA) have both engaged with the question of AI grading standards, though neither had issued definitive policy positions as of the mid-2020s. The broader question of how AI-generated grades should be disclosed on laboratory reports — whether as primary grades, supplementary data, or internal quality-control tools — remained unresolved across the industry.

The Transparency Problem

Perhaps the most consequential unresolved issue is transparency. When a human gemmologist assigns a grade, there is at least a theoretical chain of accountability: a named individual, trained to a documented standard, operating under a laboratory's published methodology, whose work can in principle be reviewed and challenged. When an algorithm assigns a grade, the situation is murkier. Proprietary machine-learning models are typically protected as trade secrets; their training data, architecture, and decision logic are not disclosed. A consumer or dealer who disputes an AI-generated grade has no meaningful recourse beyond resubmission — and resubmission to a deterministic system will produce the same answer.

This opacity is particularly troubling in the context of high-value coloured stones, where the difference between a "fine" and "exceptional" colour grade, or between "minor" and "moderate" clarity, can represent tens of thousands of dollars. The established laboratories have built their authority over decades through published research, gemmologist training programmes, and a degree of methodological transparency. AI grading systems, to earn equivalent trust, will need to demonstrate not only accuracy but accountability — a standard that most current offerings have not yet met.

The Question of What Grading Is For

Underlying the technical debate is a more philosophical question: what is a gemstone grade actually measuring, and for whom? A grade is not a purely physical description; it is a communication between seller and buyer, embedded in a market context, shaped by conventions that have evolved over time and vary between trade communities. The "pigeon's blood" designation for Burmese ruby, the "royal blue" descriptor for fine Ceylon sapphire, the premium placed on a Colombian emerald's particular hue — these are not objective measurements but culturally and historically situated value judgements that the trade has, over time, partially codified into grading language.

An AI system trained on market data will learn to replicate the market's current preferences, including its biases and its fashions. It will not necessarily identify when those preferences are arbitrary, when they reflect historical contingency rather than genuine quality differences, or when they are shifting. A human expert, embedded in the trade and in dialogue with its history, is at least capable of that kind of reflective judgement, even if individual graders do not always exercise it.

Likely Trajectories

The most plausible near-term outcome is not the replacement of human gemmologists by AI systems but a deepening integration of the two. Automated systems are already demonstrably superior for certain tasks: measuring cut proportions, detecting fluorescence, screening for laboratory-grown stones using spectroscopic signatures, and flagging stones that warrant closer human attention. These applications are likely to become standard across major laboratories regardless of the broader controversy.

For coloured stones specifically, the integration is likely to be slower and more cautious. The complexity of the assessment task, the diversity of the material, the relatively smaller training datasets available (compared to the millions of graded diamonds in laboratory archives), and the higher stakes of individual assessments all argue for a longer transition period. The role of AI in coloured-stone grading is more likely to be that of a sophisticated instrument — like a spectrophotometer or a SSEF fluorescence microscope — than that of an autonomous decision-maker.

What the controversy has already achieved, regardless of its ultimate resolution, is a salutary pressure on the established laboratories to articulate and defend their methodologies more explicitly than they have historically needed to do. If the AI grading debate forces greater transparency about how grades are assigned, how graders are trained, and how inter-grader variation is managed, it will have served a useful function even if the machines themselves never fully replace the gemmologist's loupe.