← 返回列表

Tambayoyi na AI Series 10: Menene Embedding ya yi a gaskiya? – Daga Asalin Fasaha zuwa Amsar Tambayoyi

Menene Embedding ya yi a gaskiya? – Daga Asalin Fasaha zuwa Amsar Tambayoyi

I. Asalin Fasaha: Kalmomi ɗaya suna bayyana mahimmanci

Babban aikin Embedding shine, haɗa bayanan da ba su dace ba (rubutu, hoto, da sauransu) zuwa wani filin vector na ci gaba, mai ƙarancin girma, inda abubuwa masu kamanceceniya suke kusa da juna.
A sarari, yana kafa "tsarin haɗin kai na ma'ana" don kwamfuta, yana fassara "ma'anar da ba ta bayyana ba" ta ɗan adam zuwa "coordinate na wuri" da kwamfuta za ta iya ƙididdigewa.


II. Fahimta mai zurfi: Taswirar Ma'ana

Ka yi tunanin taswirar girma biyu (a gaskiya embedding galibi yana da ɗaruruwan girma, amma ka'ida ɗaya ce):

  • Cat → [0.92, 0.31, -0.45, …]
  • Dog → [0.88, 0.29, -0.42, …]
  • Mota → [0.15, -0.87, 0.53, …]

Vectors na cat da dog suna kusa sosai, mota kuma tana da nisa sosai.
Embedding yana sa kwamfuta ta daina ɗaukar kalmomi a matsayin alamomi keɓaɓɓu, amma tana iya kwatanta rubutu bisa "nisan ma'ana".


III. Ka'idar Fasaha (Sauƙaƙan): Ta yaya ake koyan shi?

Ya danganta da hasashe na ilimin harshe: "Ma'anar kalma, ta hanyar mahallin da yake ciki."

  • Ta hanyar horarwa a kan manyan matani (kamar Word2Vec, layin shigar BERT), samfurin yana ci gaba da daidaita vector na kowace kalma.
  • A ƙarshe, kalmomin da suke bayyana a cikin mahalli iri ɗaya (cat da dog a cikin mahallin "dabbobi", "shafa", "ciyarwa") za su kasance a wurare kusa.
  • Wannan tsari baya buƙatar lakabin hannu, shi ne tsarin geometrical da yake fitowa kai tsaye daga amfani da harshe.

Muhimmin sifa: Filin vector yana iya kama alaƙar kwatanta, kamar Sarki - Namiji + Mace ≈ Sarauniya.


IV. A cikin tsarin RAG, menene Embedding ya yi takamaiman matakai?

  1. Lokacin gina fihirisa: Juya kowane guntun takarda (chunk) zuwa vector → Adana shi a cikin bayanan vector → Ƙirƙirar "adireshin ma'ana".
  2. Lokacin tambaya: Juya tambayar mai amfani zuwa vector a cikin filin guda → Nemo mafi kusancin vectors na takardu a cikin bayanai → Dawo da sassan ilimi masu alaƙa da ma'ana.

Misalin sakamako:
Mai amfani ya tambaya "Yaya zan sa kare na yayi farin ciki?", ko da bayanan sun ƙunshi kawai "Kare yana buƙatar tafiya kowace rana, wannan yana taimaka wa lafiyar hankalinsa", embedding zai iya dawo da shi saboda kusancin ma'ana na "farin ciki/ lafiya/ kare". Yana aiwatar da "haɗin ma'ana", ba "haɗin siffa" ba.


V. Dabarun amsa tambayoyi (2-3 mintuna cikakken magana)

A ƙasa akwai tsarin amsa da aka tsara, wanda zai iya nuna zurfin ka'ida da kuma gogewar aikin.

[Fara da bayyana manufa]

"Babban aikin Embedding shine, haɗa bayanan da ba su dace ba zuwa wani filin vector na ci gaba, mai ƙarancin girma, inda abubuwa masu kamanceceniya suke kusa da juna. A sarari, yana kafa 'tsarin haɗin kai na ma'ana' don kwamfuta."

[Bayyana ka'ida, ambaci sifofi masu mahimmanci]

"One-hot encoding na gargajiya ba shi da ra'ayi na nisa tsakanin kalmomi, amma embedding ta hanyar hanyar sadarwa ta jijiyoyi yana koyo daga ɗimbin bayanan harshe - 'ma'anar kalma ta hanyar mahallin da yake ciki'. A ƙarshe kowace kalma/ jimla ana wakilta ta da wani vector mai yawa, kuma cosine na kusurwa tsakanin vectors na iya auna ma'aunin kamanceceniya kai tsaye. Har ma yana iya kama alaƙar kwatanta, kamar Sarki - Namiji + Mace ≈ Sarauniya."

[Haɗa da gogewar aikin - muhimmi]

"A cikin tsarin RAG na tambaya da amsa da na yi a baya, na yi amfani da embedding kai tsaye. Na zaɓi text-embedding-3-small, na yanke takardun cikin kamfanin zuwa guntu na haruffa 500, na juya kowane guntu zuwa vector na adana a cikin Qdrant.
Wata rana mai amfani ya tambaya 'Yaya zan nemi hutun shekara?', binciken mabuɗin kalmomi bai samu ba, saboda a cikin takardar an rubuta 'hanyar neman hutun aiki'. Amma embedding ya iya sanya 'hutu na shekara' da 'hutu aiki' su kasance a wurare kusa, ya dawo da sashin da yake daidai.
Na kuma ci karo da matsala: da farko na yi amfani da embedding na gabaɗaya, a kan sharuɗɗan doka sakamako ya yi kyau sosai, daga baya na canza zuwa BGE-large wanda aka daidaita a fanni, ƙimar dawo da sakamako ta tashi daga 72% zuwa 89%. Don haka zaɓin samfurin embedding yana da tasiri sosai ga aikin ƙasa."

[Ƙara zurfin tunani, nuna ƙarfin senior]

"Ina so in ƙara wani abu: embedding a zahiri matse ma'ana mai asara ne - yana watsar da bayanan saman kamar tsarin kalmomi, nahawu, yana riƙe kawai 'ma'anar gaba ɗaya'. Don haka a cikin yanayin da ake buƙatar daidaitattun daidaito (kamar samfurin 'iPhone12' vs 'iPhone13'), binciken vector kawai na iya zama bai isa ba. A cikin aikin injiniya, muna yawan amfani da binciken haɗe-haɗe (vector + BM25) don cikawa."

[Ƙarewa]

"A taƙaice, embedding yana magance matsala ta asali: 'Yaya za a sa kwamfuta ta auna kamanceceniya ta ma'ana'. Shi ne ɗayan ginshiƙan NLP na zamani da RAG."


VI. Tambayoyin da mai tambaya zai iya yi da yadda za ka amsa

Tambaya Abubuwan amsa
"Yaya ake horar da embedding?" Ka bayyana a taƙaice CBOW/Skip-gram na Word2Vec (amfani da mahalli don hasashen kalma ta tsakiya ko akasin haka), ko koyan kwatanta na zamani (SimCSE, Sentence-BERT). Ka jaddada cewa horon yana amfani da kididdigar bayyana tare.
"Yaya za a auna ingancin embedding?" A kan takamaiman aiki, yi amfani da ƙimar dawo da sakamako, MRR; manyan benchmarks kamar MTEB. A aikace, ana iya yin gwajin A/B na bincike.
"Wane samfurin embedding ka yi amfani? Abubuwan da suke da kyau da rashin kyau?" OpenAI yana da sauƙi amma tsada, BGE yana da kyau a Sinanci, M3E yana da nauyi, E5 yana da yaruka da yawa. Za a iya zaɓar bisa ga yanayi.
"Yaya za a zaɓi girman vector?" Girman yana ba da ƙarfin bayyana amma yana da tsada a lissafi/ adanawa; ƙaramin girma na iya rashin dacewa. Ana yawan amfani da 384/768/1536, ta gwaji don daidaitawa.

VII. Gargaɗi don guje wa kurakurai (a cikin tambayoyi)

  • ❌ Kada ka kawai yi maganar "embedding yana juya rubutu zuwa vector" - wannan yana da sauƙi, mai tambaya zai tambaya "sannan fa?"
  • ❌ Kada ka yi lissafi sosai (ka fara magana game da filin Hilbert), yana iya zama kamar karanta rubutu maimakon aiki.
  • Dole ne ka faɗi abin da ka yi da shi don magance wata matsala, ko da aikin kwas ne kawai. Lamba guda ɗaya (kamar ɗaga ƙimar dawo da sakamako da 17%) tana da ƙarfi fiye da jimloli goma na ka'ida.

评论

暂无已展示的评论。

发表评论(匿名)