Programming in object pascal with the Delphi CE IDE (and maybe other stuff).

Tuesday, June 18, 2024

[eng] A dumb journey into the AI world. (part 1 of ?)

June 18, 2024 Posted by TikoTako , , , , , , No comments

AI

more like a pAIn in the butt?

I recently embarked on a small project that required the use of AI. The goal was simple: to add a human touch to my home automation system, which, at the moment, is as basic as “press button, receive bacon” 🤣.

Jokes aside, the system is just a local server running on a modest PC, with an open TCP port that receives commands to control various devices and read sensor data via an Arduino.

My plan was to learn how AI works (not in a technical sense, but the big picture), and then create one that could translate human text into commands.

Since I knew nothing about AI, the first thing I did was to try out the free language models (LMs) on my main PC. And there, the horror began.

My PC, a 13600k with 32 GB of RAM and no GPU (I can’t afford one), struggled to run a 13B model. It consumed almost all the RAM and was so slow that I could make a coffee while waiting for it to respond.

A 7B model was better. It used less RAM and was faster. However, as the models got smaller, they became too simple to do anything useful, even though their speed increased significantly and their RAM usage decreased.

So, I thought, why not pick a 7B model and train it to understand what I say and generate some kind of text? Well, nope. The hardware necessary to train a 7B model is way too high. And that’s just for a 7B, and it’s not even “real training”, it’s fine-tuning.

Real training involves feeding an algorithm with data, like billions of pieces of text data. The algorithm converts the data into tokens and then creates connections between the tokens into layers. In this way, terabytes of data become gigabytes of interconnected tokens. At that point, the LM can understand things, like where a word fits best.

Then comes the fine-tuning, which teaches the LM to understand what something means so it can respond accordingly. There are various ways to do this, but the most common is a series of Q/A, and another common method is to provide one question and two answers, one good and one bad.

Now, back to the 7B model. Training from scratch is literally impossible. Fine-tuning is possible if I buy a video card with a few GB of VRAM, but I can’t. So, I thought I’d pick one of the smallest models and teach it with Q/A to generate some magic string when I ask something. For example, if I ask “What’s the water temperature in the fish tank?”, it could generate something like “##temp##fish##tank##”. Then the UI intercepts that, sends the temperature request to the server, converts the reply, and tells me as usual.

The good news is that there are LMs under 1B that can understand what one writes, and I can run the fine-tuning on my PC. The bad news is that they are all in English, and I need it in Italian because I’m not the only one who will use it.

I was about to give up when I heard that the Qwen2 0.5B model has some Italian in it. So, I did a test, and yes, it does have some Italian. It’s very bad, but it’s there 😮.

And again, I had the brilliant idea to feed it lots of Italian text to do some kind of post-pre-training (yeah, I’m confused too). And I can’t believe I’ve spent ELEVEN HOURS trying to find a decent Italian dataset. Most are either query-online-only or dead links. The few that can be downloaded are scraped web pages that have been “cleaned”. I checked them, and yes, the Italian text is there, but it’s a total mess.

I won’t go into details, but the biggest dataset I found was almost 2GB. It took them two years to make, and it contains:

Site errors (MySQL, 404, 403, etc.), poorly written Italian, random symbols, user comments, emails, links, etc…

At this point, I feel kind of demoralized, to be honest. Even the Italian Wikipedia dataset has lots of crap inside besides the text.

Fast forward to two weeks later:

  • Any raw "corpus" is, well, raw, so it contains tons of crap.
  • Any processed "corpus" still contains crap. Way less, but it's still there.
  • Any dataset that is from a processed "corpus" still contains crap.

I tried to clean some of it, but it’s almost impossible. I’ve tried with a mix of regex and language_tool_python. It “works” with the big stuff (links, MySQL errors, server errors, ads, etc.), but it still needs to be manually verified because the output was a 50/50 mix of bad/good. But the bad actually had some good text (just typing errors).

So, anyway, now I have a 34k lines text file (which is a pitiful thing). I’m going to try to increase the model’s understanding of Italian with that…

I hope it’s enough since it already has some Italian 😒.

 

English fixed by copilot.

0 comments:

Post a Comment