AI-Enhanced Robotics, GPT-4V and myCobot in Action
-
This article on the mylangrobot project was created by neka-nat, and the author Shirokuma has authorized its editing and reproduction. A big shout out at neka-nat:)
https://twitter.com/neka_natIntroduction
This project demonstrates a simple pick-and-place operation using a robotic arm, employing GPT-4V and myCobot. The demonstration uses a method named SoM (Object Detection Method) to generate robotic movements through natural language. In simpler terms, the machine receives natural language instructions, locates the target, and then commands the robotic arm to perform the grabbing task. The highlight of this project is the combination of GPT-4V's image processing and the SoM object detection algorithm, enabling the robotic arm to move through natural language interaction with the machine.
Software
SoM
Set of Mark (SoM) is a method designed to enhance the visual understanding capabilities of large language models. After an image is processed with SoM, a series of markers are added to it. These markers can be recognized and processed by language models, aiding in more accurate identification and understanding of objects and content within the image. This enhancement allows language models to conduct more precise analysis and descriptions of elements in images, thereby improving their performance in visual tasks.
GPT-4V
The commonly discussed GPT is a large predictive model with which we can have conversational interactions. The new iteration, GPT-4V, is a multimodal large language model that not only processes textual information but now also handles image information. It is capable of image understanding, image generation, and image description. This large model combines the powerful natural language processing abilities of GPT-4 with current image analysis technologies, offering more efficient and accurate integrated capabilities in vision and language.
Below is an example provided by OpenAI:User: "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video."
#GPT-4V Image description processing GPT4: "🐺 Survival of the Fittest: An Epic Tale in the Snow ❄️ - Witness the intense drama of nature as a pack of wolves face off against mighty bison in a harsh winter landscape. This raw footage captures the essence of the wild where every creature fights for survival. With each frame, experience the tension, the strategy, and the sheer force exerted in this life-or-death struggle. See nature's true colors in this gripping encounter on the snowy plains. 🦬" Remember to respect wildlife and nature. This video may contain scenes that some viewers might find intense or distressing, but they depict natural animal behaviors important for ecological studies and understanding the reality of life in the wilderness.
Openai-whisper
https://openai.com/research/whisper
An automatic speech recognition (ASR) system developed by OpenAI. It utilizes advanced deep learning techniques to transcribe, translate, and understand human speech. Whisper is characterized by its ability to support multiple languages and its exceptional performance in recognizing voices with various accents and in noisy environments. Additionally, it can handle different audio qualities, making it suitable for various applications such as transcribing meeting records, automatically generating subtitles, and assisting in translations.
pymycobot
https://github.com/elephantrobotics/pymycobot/
pymycobot is a Python library for the mycobot robot series. MyCobot is a compact, multifunctional collaborative robotic arm suitable for education, research, and lightweight industrial applications. The PyMyCobot library provides a simple programming interface, enabling developers to control and program MyCobot robots for operations such as movement, grabbing, and sensing. This library supports multiple operating systems and development environments, facilitating its integration into various projects, especially in robotics and automation. By using Python, a widely-used programming language, pymycobot makes operating and experimenting with MyCobot robots more accessible and flexible.Hardware
myCobot 280M5
The myCobot 280 M5 is a desktop-level compact six-axis collaborative robot produced by Elephant Robotics. Designed for compactness, it is suitable for education, research, and light industrial applications. The myCobot 280 M5 supports various programming and control methods, compatible with different operating systems and programming languages, including:
●Main and auxiliary control chips: ESP32
●Supports Bluetooth (2.4G/5G) and wireless (2.4G 3D Antenna)
●Multiple input and output ports
●Supports free movement, joint movement, Cartesian movement, trajectory recording, and wireless control
●Compatible operating systems: Windows, Linux, MAC
●Supported programming languages: Python, C++, C#, JavaScript
●Supported programming platforms and tools: RoboFlow, myblockly, Mind+, UiFlow, Arduino, mystudio
●Supported communication protocols: Serial port control protocol, TCP/IP, MODBUS
These features make the myCobot 280 M5 a versatile, user-friendly robot solution suitable for a variety of application scenarios.
myCobot Vertical Suction Pump V2.0
Operates on the principle of vacuum adhesion, providing 3.3V IO control, and can be extensively used in the development of various embedded devices.
Camera
Standard USB and LEGO interfaces. The USB interface can be used with various PC devices, and the LEGO interface can be conveniently fixed. It is applicable to machine vision, image recognition, and other applications.
mylangrobot Software Analysis
The specific workflow of the project described at the beginning is as follows:- Audio Input: Record audio instructions first.
- Audio Processing: Use "openai-whisper" to process the audio and convert it into text.
- Language Model Interaction: Use the GPT-4 model to process the converted text instructions and understand the user's commands.
- Image Processing: Use GPT-4V and the enhanced image capability of SoM to process images and find the target mentioned in the instructions.
- Robotic Arm Control: Control the robotic arm to grab the identified target.
Audio Processing
This function utilizes speech_recognition to capture audio data from the microphone, enabling the computer to recognize it.
Libraries used:import io import os from enum import Enum from typing import Protocol import openai import speech_recognition as sr from pydub import AudioSegment from pydub.playback import play
Define interfaces, capture user input, and provide output to the user.
class Interface(Protocol): def input(self, prefix: str = "") -> str: return prefix + self._input_impl() def _input_impl(self) -> str: ... def output(self, message: str) -> None: ...
Initialize the microphone for audio input and output.
class Audio(Interface): def __init__(self): self.r = sr.Recognizer() self.mic = sr.Microphone() # openai-whisper API key self.client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
Convert the input audio into text format for output.
def _input_impl(self) -> str: print("Please tell me your command.") with self.mic as source: self.r.adjust_for_ambient_noise(source) audio = self.r.listen(source) try: return self.r.recognize_whisper(audio, language="japanese") except sr.UnknownValueError: print("could not understand audio") except sr.RequestError as e: print("Could not request results from Google Speech Recognition service; {0}".format(e))
The final return 'r' is the text format of the audio, which can be used for interaction with the GPT-4 model.
Image Processing and GPT-4 Language Interaction
When transmitting text to the GPT-4 model for interaction, images are sent along, so image processing and interaction are discussed together.
Libraries used for image processing:import cv2 import numpy as np import supervision as sv import torch from segment_anything import SamAutomaticMaskGenerator, sam_model_registry from .utils import download_sam_model_to_cache
Primarily uses the SamAutomaticMaskGenerator feature to mark and draw markers on detected targets.
#Convert image to RGB format image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) #Image processing, target detection and marker rendering sam_result = self.mask_generator.generate(image_rgb) detections = sv.Detections.from_sam(sam_result=sam_result) height, width, _ = image.shape image_area = height * width min_area_mask = (detections.area / image_area) > self.MIN_AREA_PERCENTAGE max_area_mask = (detections.area / image_area) < self.MAX_AREA_PERCENTAGE detections = detections[min_area_mask & max_area_mask] #Returns the result of the image and detected information labels = [str(i) for i in range(len(detections))] annotated_image = mask_annotator.annotate(scene=image_rgb.copy(), detections=detections) annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections, labels=labels) return annotated_image, detections
This results in the following effect.
Note: The below function requires obtaining the GPT-4 API-Key for usage.
The resulting image is passed to the GPT-4 model, which requires some processing before use. Through GPT-4V, the image can be processed to return information about the image content and corresponding object information.def prepare_inputs(message: str, image: np.ndarray) -> dict: # # Path to your image # image_path = "temp.jpg" # # Getting the base64 string base64_image = encode_image_from_cv2(image) payload = { "model": "gpt-4-vision-preview", "messages": [ {"role": "system", "content": [metaprompt]}, { "role": "user", "content": [ { "type": "text", "text": message, }, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, ], }, ], "max_tokens": 800, } return payload def request_gpt4v(message: str, image: np.ndarray) -> str: payload = prepare_inputs(message, image) response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload) res = response.json()["choices"][0]["message"]["content"] return res
Robotic Arm Control and Overall Integration
After image processing and GPT-4V model processing, the interpreted instructions generate target position information. This position information is passed to the robotic arm control system, which moves to the corresponding location to perform the grabbing action.
Key methods involved:
Move to the target object.def move_to_object(self, object_no: int, speed: Optional[float] = None) -> None: object_no = self._check_and_correct_object_no(object_no) print("[MyCobotController] Move to Object No. {}".format(object_no)) detection = ( np.array([-self._detections[object_no][0], -self._detections[object_no][1]]) + self.capture_coord.pos[:2] ) print("[MyCobotController] Object pos:", detection[0], detection[1]) self.move_to_xy(detection[0], detection[1], speed)
grab action def grab(self, speed: Optional[float] = None) -> None: print("[MyCobotController] Grab to Object") current_pos = self.current_coords().pos self.move_to_z(self.object_height + self.end_effector_height, speed) self._mycobot.set_basic_output(self._suction_pin, 0) time.sleep(2) self.move_to_z(current_pos[2], speed) drop action def move_to_place(self, place_name: str, speed: Optional[float] = None) -> None: print("[MyCobotController] Move to Place {}".format(place_name)) self._current_position = self.positions[place_name] self._mycobot.sync_send_angles( np.array(self._current_position) + self.calc_gravity_compensation(self._current_position), speed or self._default_speed, self._command_timeout, ) print("Current coords: {}".format(self.current_coords()))
After each function is implemented, coordinate the entire process, streamline the workflow logic, and complete the task.
The specific code can be viewed in the operator.py file.Example
Below is an example test to observe the project's outcome. The content involves a voice input saying "pick up the chocolate," and the robotic arm executes the task.
https://youtu.be/Eda1m7DnIhQSummary
This project demonstrates how to leverage advanced artificial intelligence and robotics technologies to accomplish complex automation tasks. By integrating voice recognition, natural language processing, image analysis, and precise robotic arm control, the project has successfully created a robotic system capable of understanding and executing spoken instructions. This not only enhances the naturalness and efficiency of robot-human interaction but also opens up new possibilities for robotic technology in various practical applications, such as automated manufacturing, logistics, assistive robots, and more.
Finally, thanks again to Shirokuma for sharing this case with us. If you have better examples, feel free to contact us!