← 返回列表

AI ƒe Numekuku 10: Embeding ƒe Dɔ Nye Nu Kae?—Tso Teknɔlɔdze ƒe Mɔnu Va Seɖokpo Nyaɖeɖe Me

Embeding ƒe Dɔ Nye Nu Kae?—Tso Teknɔlɔdze ƒe Mɔnu Va Seɖokpo Nyaɖeɖe Me

I. Teknɔlɔdze ƒe Mɔnu: Nya ɖeka eʋlɔa taɖodzinutia

Embeding ƒe dɔ vevitɔ nye, be wòaɖo data siwo menya alesi woatsɔ woawo ŋu le (ŋɔŋlɔ, nɔnɔmetata, kple bubuwo) la ɖe vɛktor siwo le teƒe ɖeka ɖe eme, eye wòazu low-dimensional, be nu siwo ƒe gɔmeɖeɖe le sɔsɔe la nato afi ma afi.
Le akpa me la, enye "gɔmeɖeɖe ƒe dzesi ƒoƒo" na kɔmpiuta, heɖe "gɔmeɖeɖe si me tsiɖia" le ame me ɖe "teƒe ƒe dzesi" siwo kɔmpiuta ate ŋu abu la me.


II. Gɔmeɖeɖe si wɔa dze: Gɔmeɖeɖe ƒe Mapa

Bu be map aɖe le afi si li (eyome gɔmeɖeɖe embedding dzena wòle eƒe akpa geɖe, gake susu la le sɔsɔe):

  • Dadi → [0.92, 0.31, -0.45, …]
  • Avu → [0.88, 0.29, -0.42, …]
  • Lɔri → [0.15, -0.87, 0.53, …]

Dadi kple avu ƒe vɛktorwo va sɔ gbɔe ŋutɔ, eye lɔri tɔ la le didiƒe.
Embeding na be kɔmpiuta magatsɔ nyawo wɔ abe dzesi siwo le ɖokuiwo me o, ke boŋ ate ŋu atsɔ nuŋlɔɖiwo sɔ kple "gɔmeɖeɖe ƒe didime" la.


III. Teknɔlɔdze ƒe Se (ɖeɖe afã): Aleke Wòsua Eŋu?

Le gbeŋutinunya ƒe susu dzi: "Nya ɖeka ƒe gɔmeɖeɖe, eƒe nuŋlɔɖi kplɔna ɖoe."

  • To hehe le ŋɔŋlɔ geɖe dzi (abe Word2Vec, BERT embedding layer), mɔdel la tɔtrɔa vɛktor siwo le nya ɖesiaɖe ŋu.
  • Ne wowo vɔ, nya siwo dzena le nuŋlɔɖi sɔsɔɔwo me (dadi kple avu le "lãme", "xɔxɔ", "ɖu nu" me) la va sɔ gbɔe.
  • Mɔnu sia mele be ame nadzraa ɖo o, ke enye nu si doa ɖokui le gbe zazã me.

Bɔbɔe vevitɔ: Vektor ƒe teƒe ate ŋu akpɔ nɔnɔmetata siwo sɔ hã, abe fia - ŋutsu + nyɔnu ≈ fiaɖuɖo.


IV. Le RAG ƒe Nuɖoɖo me, Embeding wɔa dɔ kae ɖe afi aɖewo?

  1. Ne wole index si nana: Wotsɔa agbalẽʋɔlɔ ɖesiaɖue ƒe vektor → ɖoa eme ɖe vektor database me → wɔa "gɔmeɖeɖe ƒe adrɛs" .
  2. Ne wole biabia bia: Wotsɔa amegã ƒe biabia ɖe vektor si le teƒe ɖeka la me → di vektor siwo sɔ gbɔe le database me → trɔa gɔmeɖeɖe siwo le sɔsɔe.

Mɔnukpɔkpɔ:
Amedzi bia be "Aleke mawɔ be nye lãme lã avu nadze?", togbɔ be gɔmeɖeɖe ƒe ɖoɖowo mele o gake wole "avu hiã be wòazɔ gbe ɖesiaɖe, esia kpena ɖe eƒe susu ƒe lãme" la, embedding ate ŋu aɖo eŋu elabe "dzidzɔ/lãme/avu" ƒe gɔmeɖeɖe sɔ gbɔe. Enɔna ɖe "gɔmeɖeɖe" dzi, menye "nɔnɔme" o.


V. Seɖokpo Nyaɖeɖe ƒe Mɔnu (2~3 minuti ƒe nya blibo)

Le afii la, miedze mɔnu aɖe si wɔa dze, si ate ŋu aɖe gɔmeɖeɖe ƒe titina ɖa eye wòaɖe eƒe dɔwɔna hã.

[Gɔmeɖeɖe ɖe ta]

"Embeding ƒe dɔ vevitɔ nye, be wòaɖo data siwo menya alesi woatsɔ woawo ŋu le la ɖe vektor siwo le teƒe ɖeka ɖe eme, eye wòazu low-dimensional, be nu siwo ƒe gɔmeɖeɖe le sɔsɔe la nato afi ma afi. Le akpa me la, enye 'gɔmeɖeɖe ƒe dzesi ƒoƒo' na kɔmpiuta."

[He se la ɖe gbe, ƒoe ŋkɔe vevitɔwo]

"One-hot encoding siwo tsa la mena be nyawo nanɔ didime o, gake embedding to nɛral netwɔk dzi hena hehe tso gbea geɖe me — 'nya ɖeka ƒe gɔmeɖeɖe nye eƒe nuŋlɔɖi.' Ne wowo vɔ, nya ɖesiaɖe/sɛtɛns ɖesiaɖe nye dense vektor, eye vektor siwo dome aɖakpa yɔxɔi ate ŋu aɖo gɔmeɖeɖe ƒe sɔsɔe ɖa. Ekpɔa nɔnɔmetata siwo sɔ hã, abe fia - ŋutsu + nyɔnu ≈ fiaɖuɖo."

[Dɔwɔna ŋuti nukpɔsusu — vevitɔ]

"Le dɔ si mewɔ do ŋgɔ, RAG nyaɖeɖe ƒe nuɖoɖo me, metsɔ embedding wɔ dɔ tẽ. Mewo tiatia text-embedding-3-small, hetsɔ dɔwɔƒe ƒe agbalẽwo lã 500 karakter ƒe akpawo, akpa ɖesiaɖe trɔ ɖe vektor me heɖo ɖe Qdrant me.
Ƒe ɖeka la, amedzi bia be 'Aleke mawɔ be makpɔ ŋkeke siwo wotsɔna wɔa ŋkeke?' be keyword mekpɔ o, elabe agbalẽ me wole 'ɖoɖo le alesi woatsɔ wɔa ŋkeke.' Gake embedding ate ŋu aɖo 'ŋkeke' kple 'dɔwɔwɔ' ɖe teƒe siwo sɔ gbɔe, eye wòaɖo eŋu nyuie.
Mekpɔ tsɔtsrɔ̃ aɖe hã: gɔmedzedzea me metsɔ generic embedding wɔ, gake mewɔ dzeɖe se ƒe nyawo ŋu o, eyome metsɔ domain-tuned BGE-large ɖe teƒe, eye retrieval hit rate tso 72% yi 89%. Eya ta embedding model tiatia la doa vevie na dɔ si le eƒe xexlẽme."

[Ŋutifafa ƒe susu kɔkɔ, be nàɖe senior ƒe ŋusẽ ɖa]

"Le akpa aɖe me la, mele be mawu hã: embedding le mɔnu aɖe nu enye semantic compression si me nane bu — etsɔa nya ƒe nɔnɔme, sentence ƒe ɖoɖo kple bubuwo ɖa, eye wòtɔna ɖe 'gɔmeɖeɖe vevitɔ' me. Eya ta le nɔnɔmetata siwo hiã be wòadze ŋgɔ (abe product model 'iPhone12' vs 'iPhone13'), pure vektor retrieval ate ŋu aɖo kpe o. Le mɔ̃ɖaŋudɔwɔwɔ me la, míedzrana mixed retrieval (vektor + BM25) héna kpekpeɖeŋu."

[Nuwuwu]

"Le akpa aɖe me la, embedding ɖea nya si nye 'Aleke wɔ be kɔmpiuta nase gɔmeɖeɖe ƒe sɔsɔe?' la ŋu. Enye modern NLP kple RAG ƒe kpe ɖeka le teƒe."


VI. Biabia siwo ameɖokpɔla ate ŋu abia kple ale si nàwɔ aɖo eŋu

Biabia Nyaɖeɖe ƒe nu vevitɔwo
"Aleke wosua embedding?" Trɔ ɖe Word2Vec ƒe CBOW/Skip-gram (zã nuŋlɔɖi la hena nyagɔmeɖeɖe), alo modern contrastive learning (SimCSE, Sentence-BERT). Ƒo nu ɖe dɔwɔwɔ vevitɔ ŋu.
"Aleke woadrɔ̃ embedding ƒe nyonyo?" Zã hit rate, MRR le dɔ vevitɔwo me; public benchmarks abe MTEB. Le dɔwɔwɔ me la, àte ŋu awɔ A/B testing ɖe retrieval ŋu.
"Embedding model kae nèzã? Nyonyo kple vɔvɔli?" OpenAI bɔbɔ gake xɔ ga; BGE wɔa dze le Chinees me; M3E le fafa; E5 le gbe geɖe dzi. Àte ŋu atia ɖe nɔnɔmetata nu.
"Aleke nàtia vektor dimension?" High dimension naa ŋusẽ gake xɔa xexlẽme/kudɔ; low dimension ate ŋu awɔ dɔ o. Wozãna 384/768/1536, eye wotia kple dɔwɔwɔ.

VII. Nuxlɔ̃ame siwo woɖo ɖe ala me (le seɖokpo nyaɖeɖe me)

  • ❌ Mègagblɔ ko be "embedding trɔa nya ɖe vektor me" o — esia le fafa, eye ameɖokpɔla abia 'eya megbe?'
  • ❌ Mègazã akɔntabubu geɖe o (abe Hilbert space enum), elabena ate ŋu aɖo abe wòle agbalẽ me ko.
  • Ŋutɔe nàƒo nu ɖe ale si nètsɔe ɖo nya aɖe ŋu la ŋu, togbɔ be enye suku ƒe dɔwɔwɔ tɔ o. Xexlẽme ɖeka si nye (abe hit rate dzi heyi 17%) wu gɔmeɖeɖe blibo ewo.

评论

暂无已展示的评论。

发表评论(匿名)